AI Output Verification & Validation
AI-generated code requires verification that goes substantially beyond standard code review and testing practices. The 2.74x higher vulnerability rate and 1.7x higher defect rate in AI co-authored code demand a layered verification strategy combining automated testing, static analysis, behavioral validation, regression testing, and mutation testing. This section defines the mandatory verification techniques, coverage requirements, and integration points for AI output validation within the AEEF framework.
Verification Philosophy
Human review (defined in Human-in-the-Loop Review Processes) is necessary but not sufficient. Automated verification provides the scalable, consistent safety net that human review alone cannot deliver. Both layers are REQUIRED.
The verification strategy follows a defense-in-depth model: multiple independent verification layers, each designed to catch different classes of defects. No single layer is expected to catch all issues; together, they provide comprehensive coverage.
Automated Testing Requirements
Test Generation and Review
- AI-generated code MUST have corresponding tests that verify its behavior
- Tests MAY be generated by AI tools, but AI-generated tests MUST be reviewed by a human for correctness before they are trusted as a verification mechanism
- AI-generated tests MUST NOT be the sole verification for AI-generated implementation code without human validation of the tests themselves
- Tests MUST verify behavior (what the code does) rather than implementation (how the code does it)
AI tools frequently generate tests that are tautological (they test that the code does what the code does, rather than what the specification requires). Reviewers MUST verify that test assertions are grounded in requirements, not in the generated implementation.
Test Coverage Requirements
The following coverage thresholds are MANDATORY minimums for AI-generated code. These thresholds are intentionally higher than standard coverage requirements due to the elevated defect rates in AI-generated code.
| Metric | AI-Generated Code Minimum | Standard Code Minimum | Measurement Tool |
|---|---|---|---|
| Line Coverage | 90% | 80% | JaCoCo, coverage.py, Istanbul/c8 |
| Branch Coverage | 85% | 75% | JaCoCo, coverage.py, Istanbul/c8 |
| Function/Method Coverage | 95% | 85% | JaCoCo, coverage.py, Istanbul/c8 |
| Mutation Score | 70% | Not required | PIT, mutmut, Stryker |
| Integration Test Coverage | All public API endpoints | Critical paths only | Custom + framework tools |
| Error Path Coverage | All explicitly handled error paths | Critical error paths | Manual verification + coverage tools |
Coverage thresholds are a floor, not a ceiling. Meeting coverage numbers with low-quality tests provides false confidence. Coverage MUST be paired with mutation testing and behavioral validation to be meaningful.
Test Categories Required for AI-Generated Code
| Test Category | When Required | Purpose |
|---|---|---|
| Unit Tests | ALWAYS | Verify individual function/method behavior in isolation |
| Integration Tests | When code interacts with external systems, databases, or APIs | Verify correct interaction between components |
| Contract Tests | When code implements or consumes an API | Verify API contract conformance |
| Property-Based Tests | RECOMMENDED for data transformation, parsing, and algorithmic code | Discover edge cases through randomized input generation |
| Boundary Tests | ALWAYS for code handling numeric ranges, string lengths, or collection sizes | Verify correct behavior at boundary conditions |
| Negative Tests | ALWAYS | Verify correct handling of invalid inputs, error conditions, and edge cases |
Static Analysis Requirements
Static analysis MUST be applied to all AI-generated code before it enters a protected branch. The following categories of static analysis are REQUIRED:
Mandatory Static Analysis Categories
| Category | Tools (Examples) | Enforcement |
|---|---|---|
| Security Scanning (SAST) | Semgrep, SonarQube, CodeQL, Snyk Code | MUST pass with zero Critical/High findings |
| Dependency Vulnerability Scanning | Snyk, Dependabot, OWASP Dependency-Check | MUST pass with zero Critical findings; High findings require risk acceptance |
| Code Quality / Linting | ESLint, Pylint, Checkstyle, RuboCop | MUST pass with zero errors; warnings reviewed |
| Complexity Analysis | SonarQube, radon, lizard | MUST not exceed thresholds in Engineering Quality Standards |
| Type Checking | mypy, TypeScript strict mode, Flow | MUST pass with zero errors for typed languages |
| License Compliance | FOSSA, Black Duck, Licensee | MUST pass -- see Intellectual Property |
Static Analysis Configuration for AI Code
- Static analysis tools SHOULD be configured with stricter rulesets for files or commits tagged as AI-generated
- Teams SHOULD enable experimental or preview rules that detect common AI generation patterns (e.g., overly generic exception handling, unused imports, redundant null checks)
- SAST findings on AI-generated code MUST NOT be suppressed without a documented justification reviewed by a Tier 2 or higher reviewer
Behavioral Validation
Behavioral validation verifies that AI-generated code does what the specification requires, not merely that it executes without errors.
Behavioral Validation Techniques
- Specification-Based Testing: Tests derived directly from acceptance criteria or user stories, written before or independently of the AI-generated code
- Example-Based Validation: Running the code against known input/output pairs from the specification or existing system behavior
- Differential Testing: When AI-generated code replaces existing code, running both old and new implementations against the same inputs and comparing outputs
- Shadow Mode Execution: Deploying AI-generated code in a shadow or canary mode where it processes real traffic but its outputs are compared against the production system rather than served to users
- Domain Expert Review: For business-critical logic, a domain expert (not just a developer) SHOULD validate that the code's behavior matches business rules
When to Apply Each Technique
| Technique | When REQUIRED | When RECOMMENDED |
|---|---|---|
| Specification-Based Testing | Always | -- |
| Example-Based Validation | When known input/output pairs exist | Always |
| Differential Testing | When replacing existing functionality | When refactoring |
| Shadow Mode Execution | For high-risk production changes | For any customer-facing logic |
| Domain Expert Review | For financial, medical, or legal logic | For any business rule implementation |
Regression Testing
AI-generated code MUST NOT break existing functionality. Regression testing requirements are elevated for AI-assisted changes.
Regression Testing Requirements
- The full regression test suite MUST pass before AI-generated code is merged into a protected branch
- Regression test execution MUST be automated in the CI/CD pipeline and MUST NOT be bypassed
- When AI-generated code modifies existing functions, all callers of those functions MUST be tested
- Performance regression tests SHOULD be run for AI-generated code that is on a critical performance path
- Visual regression tests SHOULD be run for AI-generated frontend code
Regression Failure Protocol
If regression tests fail after integrating AI-generated code:
- The AI-generated code MUST be reverted or the merge blocked
- The failure MUST be analyzed to determine if the AI output introduced the regression or exposed a pre-existing issue
- If the AI output introduced the regression, the prompt and constraints MUST be reviewed and updated before re-generation
- The incident SHOULD be logged in the team's AI defect tracker for trend analysis
Mutation Testing for AI Outputs
Mutation testing is the most effective technique for validating that tests actually detect defects, rather than merely executing code paths. It is especially valuable for AI-generated code and AI-generated tests.
What Is Mutation Testing
Mutation testing introduces small, systematic changes (mutations) to the code -- such as changing > to >=, removing a function call, or altering a return value -- and verifies that the test suite detects each mutation. If a mutation survives (tests still pass), it indicates a gap in test effectiveness.
Mutation Testing Requirements
- Mutation testing MUST be applied to AI-generated code in Tier 2 (Standard Risk) and Tier 3 (High Risk) changes as defined in Human-in-the-Loop
- Mutation testing is RECOMMENDED for Tier 1 (Low Risk) changes
- The minimum mutation score (percentage of mutations killed) MUST be 70%
- Surviving mutations MUST be reviewed: each surviving mutation is either a test gap to fix or an equivalent mutation to document
Recommended Mutation Testing Tools
| Language | Tool | Notes |
|---|---|---|
| Java/Kotlin | PIT (pitest) | Integrates with Maven/Gradle; CI-friendly |
| Python | mutmut | Works with pytest; incremental mutation support |
| JavaScript/TypeScript | Stryker | Supports multiple test runners; dashboard available |
| C# | Stryker.NET | .NET ecosystem support |
| Go | go-mutesting | Community-maintained |
CI/CD Integration
All verification techniques described in this section MUST be integrated into the CI/CD pipeline and enforced as automated gates.
Pipeline Gate Configuration
PR Created (AI-assisted label detected)
|
v
Stage 1: Linting + Type Checking --> Block on failure
|
v
Stage 2: Unit Tests + Coverage --> Block if below thresholds
|
v
Stage 3: SAST + Dependency Scan --> Block on Critical/High
|
v
Stage 4: Integration Tests --> Block on failure
|
v
Stage 5: Mutation Testing --> Block if score < 70%
|
v
Stage 6: Regression Suite --> Block on failure
|
v
Human Review Gate --> Requires qualified reviewer approval
|
v
Merge Permitted
Teams MUST NOT configure CI pipelines to allow bypassing these gates for AI-generated code. Emergency bypass procedures MUST require written approval from a Tier 3 reviewer or engineering director and MUST be logged for audit purposes. For broader quality thresholds and architectural conformance checks, see Engineering Quality Standards.