Skip to main content

AI Output Verification & Validation

AI-generated code requires verification that goes substantially beyond standard code review and testing practices. The 2.74x higher vulnerability rate and 1.7x higher defect rate in AI co-authored code demand a layered verification strategy combining automated testing, static analysis, behavioral validation, regression testing, and mutation testing. This section defines the mandatory verification techniques, coverage requirements, and integration points for AI output validation within the AEEF framework.

Verification Philosophy

info

Human review (defined in Human-in-the-Loop Review Processes) is necessary but not sufficient. Automated verification provides the scalable, consistent safety net that human review alone cannot deliver. Both layers are REQUIRED.

The verification strategy follows a defense-in-depth model: multiple independent verification layers, each designed to catch different classes of defects. No single layer is expected to catch all issues; together, they provide comprehensive coverage.

Automated Testing Requirements

Test Generation and Review

  • AI-generated code MUST have corresponding tests that verify its behavior
  • Tests MAY be generated by AI tools, but AI-generated tests MUST be reviewed by a human for correctness before they are trusted as a verification mechanism
  • AI-generated tests MUST NOT be the sole verification for AI-generated implementation code without human validation of the tests themselves
  • Tests MUST verify behavior (what the code does) rather than implementation (how the code does it)
warning

AI tools frequently generate tests that are tautological (they test that the code does what the code does, rather than what the specification requires). Reviewers MUST verify that test assertions are grounded in requirements, not in the generated implementation.

Test Coverage Requirements

The following coverage thresholds are MANDATORY minimums for AI-generated code. These thresholds are intentionally higher than standard coverage requirements due to the elevated defect rates in AI-generated code.

MetricAI-Generated Code MinimumStandard Code MinimumMeasurement Tool
Line Coverage90%80%JaCoCo, coverage.py, Istanbul/c8
Branch Coverage85%75%JaCoCo, coverage.py, Istanbul/c8
Function/Method Coverage95%85%JaCoCo, coverage.py, Istanbul/c8
Mutation Score70%Not requiredPIT, mutmut, Stryker
Integration Test CoverageAll public API endpointsCritical paths onlyCustom + framework tools
Error Path CoverageAll explicitly handled error pathsCritical error pathsManual verification + coverage tools
danger

Coverage thresholds are a floor, not a ceiling. Meeting coverage numbers with low-quality tests provides false confidence. Coverage MUST be paired with mutation testing and behavioral validation to be meaningful.

Test Categories Required for AI-Generated Code

Test CategoryWhen RequiredPurpose
Unit TestsALWAYSVerify individual function/method behavior in isolation
Integration TestsWhen code interacts with external systems, databases, or APIsVerify correct interaction between components
Contract TestsWhen code implements or consumes an APIVerify API contract conformance
Property-Based TestsRECOMMENDED for data transformation, parsing, and algorithmic codeDiscover edge cases through randomized input generation
Boundary TestsALWAYS for code handling numeric ranges, string lengths, or collection sizesVerify correct behavior at boundary conditions
Negative TestsALWAYSVerify correct handling of invalid inputs, error conditions, and edge cases

Static Analysis Requirements

Static analysis MUST be applied to all AI-generated code before it enters a protected branch. The following categories of static analysis are REQUIRED:

Mandatory Static Analysis Categories

CategoryTools (Examples)Enforcement
Security Scanning (SAST)Semgrep, SonarQube, CodeQL, Snyk CodeMUST pass with zero Critical/High findings
Dependency Vulnerability ScanningSnyk, Dependabot, OWASP Dependency-CheckMUST pass with zero Critical findings; High findings require risk acceptance
Code Quality / LintingESLint, Pylint, Checkstyle, RuboCopMUST pass with zero errors; warnings reviewed
Complexity AnalysisSonarQube, radon, lizardMUST not exceed thresholds in Engineering Quality Standards
Type Checkingmypy, TypeScript strict mode, FlowMUST pass with zero errors for typed languages
License ComplianceFOSSA, Black Duck, LicenseeMUST pass -- see Intellectual Property

Static Analysis Configuration for AI Code

  • Static analysis tools SHOULD be configured with stricter rulesets for files or commits tagged as AI-generated
  • Teams SHOULD enable experimental or preview rules that detect common AI generation patterns (e.g., overly generic exception handling, unused imports, redundant null checks)
  • SAST findings on AI-generated code MUST NOT be suppressed without a documented justification reviewed by a Tier 2 or higher reviewer

Behavioral Validation

Behavioral validation verifies that AI-generated code does what the specification requires, not merely that it executes without errors.

Behavioral Validation Techniques

  1. Specification-Based Testing: Tests derived directly from acceptance criteria or user stories, written before or independently of the AI-generated code
  2. Example-Based Validation: Running the code against known input/output pairs from the specification or existing system behavior
  3. Differential Testing: When AI-generated code replaces existing code, running both old and new implementations against the same inputs and comparing outputs
  4. Shadow Mode Execution: Deploying AI-generated code in a shadow or canary mode where it processes real traffic but its outputs are compared against the production system rather than served to users
  5. Domain Expert Review: For business-critical logic, a domain expert (not just a developer) SHOULD validate that the code's behavior matches business rules

When to Apply Each Technique

TechniqueWhen REQUIREDWhen RECOMMENDED
Specification-Based TestingAlways--
Example-Based ValidationWhen known input/output pairs existAlways
Differential TestingWhen replacing existing functionalityWhen refactoring
Shadow Mode ExecutionFor high-risk production changesFor any customer-facing logic
Domain Expert ReviewFor financial, medical, or legal logicFor any business rule implementation

Regression Testing

AI-generated code MUST NOT break existing functionality. Regression testing requirements are elevated for AI-assisted changes.

Regression Testing Requirements

  • The full regression test suite MUST pass before AI-generated code is merged into a protected branch
  • Regression test execution MUST be automated in the CI/CD pipeline and MUST NOT be bypassed
  • When AI-generated code modifies existing functions, all callers of those functions MUST be tested
  • Performance regression tests SHOULD be run for AI-generated code that is on a critical performance path
  • Visual regression tests SHOULD be run for AI-generated frontend code

Regression Failure Protocol

If regression tests fail after integrating AI-generated code:

  1. The AI-generated code MUST be reverted or the merge blocked
  2. The failure MUST be analyzed to determine if the AI output introduced the regression or exposed a pre-existing issue
  3. If the AI output introduced the regression, the prompt and constraints MUST be reviewed and updated before re-generation
  4. The incident SHOULD be logged in the team's AI defect tracker for trend analysis

Mutation Testing for AI Outputs

Mutation testing is the most effective technique for validating that tests actually detect defects, rather than merely executing code paths. It is especially valuable for AI-generated code and AI-generated tests.

What Is Mutation Testing

Mutation testing introduces small, systematic changes (mutations) to the code -- such as changing > to >=, removing a function call, or altering a return value -- and verifies that the test suite detects each mutation. If a mutation survives (tests still pass), it indicates a gap in test effectiveness.

Mutation Testing Requirements

  • Mutation testing MUST be applied to AI-generated code in Tier 2 (Standard Risk) and Tier 3 (High Risk) changes as defined in Human-in-the-Loop
  • Mutation testing is RECOMMENDED for Tier 1 (Low Risk) changes
  • The minimum mutation score (percentage of mutations killed) MUST be 70%
  • Surviving mutations MUST be reviewed: each surviving mutation is either a test gap to fix or an equivalent mutation to document
LanguageToolNotes
Java/KotlinPIT (pitest)Integrates with Maven/Gradle; CI-friendly
PythonmutmutWorks with pytest; incremental mutation support
JavaScript/TypeScriptStrykerSupports multiple test runners; dashboard available
C#Stryker.NET.NET ecosystem support
Gogo-mutestingCommunity-maintained

CI/CD Integration

All verification techniques described in this section MUST be integrated into the CI/CD pipeline and enforced as automated gates.

Pipeline Gate Configuration

PR Created (AI-assisted label detected)
|
v
Stage 1: Linting + Type Checking --> Block on failure
|
v
Stage 2: Unit Tests + Coverage --> Block if below thresholds
|
v
Stage 3: SAST + Dependency Scan --> Block on Critical/High
|
v
Stage 4: Integration Tests --> Block on failure
|
v
Stage 5: Mutation Testing --> Block if score < 70%
|
v
Stage 6: Regression Suite --> Block on failure
|
v
Human Review Gate --> Requires qualified reviewer approval
|
v
Merge Permitted

Teams MUST NOT configure CI pipelines to allow bypassing these gates for AI-generated code. Emergency bypass procedures MUST require written approval from a Tier 3 reviewer or engineering director and MUST be logged for audit purposes. For broader quality thresholds and architectural conformance checks, see Engineering Quality Standards.