AI Output Verification & Validation

AI-generated code requires verification that goes substantially beyond standard code review and testing practices. The 2.74x higher vulnerability rate and 1.7x higher defect rate in AI co-authored code demand a layered verification strategy combining automated testing, static analysis, behavioral validation, regression testing, and mutation testing. This section defines the mandatory verification techniques, coverage requirements, and integration points for AI output validation within the AEEF framework.

Verification Philosophy

info

Human review (defined in Human-in-the-Loop Review Processes) is necessary but not sufficient. Automated verification provides the scalable, consistent safety net that human review alone cannot deliver. Both layers are REQUIRED.

The verification strategy follows a defense-in-depth model: multiple independent verification layers, each designed to catch different classes of defects. No single layer is expected to catch all issues; together, they provide comprehensive coverage.

Automated Testing Requirements

Test Generation and Review

AI-generated code MUST have corresponding tests that verify its behavior
Tests MAY be generated by AI tools, but AI-generated tests MUST be reviewed by a human for correctness before they are trusted as a verification mechanism
AI-generated tests MUST NOT be the sole verification for AI-generated implementation code without human validation of the tests themselves
Tests MUST verify behavior (what the code does) rather than implementation (how the code does it)

warning

AI tools frequently generate tests that are tautological (they test that the code does what the code does, rather than what the specification requires). Reviewers MUST verify that test assertions are grounded in requirements, not in the generated implementation.

Test Coverage Requirements

The following coverage thresholds are MANDATORY minimums for AI-generated code. These thresholds are intentionally higher than standard coverage requirements due to the elevated defect rates in AI-generated code.

Metric	AI-Generated Code Minimum	Standard Code Minimum	Measurement Tool
Line Coverage	90%	80%	JaCoCo, coverage.py, Istanbul/c8
Branch Coverage	85%	75%	JaCoCo, coverage.py, Istanbul/c8
Function/Method Coverage	95%	85%	JaCoCo, coverage.py, Istanbul/c8
Mutation Score	70%	Not required	PIT, mutmut, Stryker
Integration Test Coverage	All public API endpoints	Critical paths only	Custom + framework tools
Error Path Coverage	All explicitly handled error paths	Critical error paths	Manual verification + coverage tools

danger

Coverage thresholds are a floor, not a ceiling. Meeting coverage numbers with low-quality tests provides false confidence. Coverage MUST be paired with mutation testing and behavioral validation to be meaningful.

Test Categories Required for AI-Generated Code

Test Category	When Required	Purpose
Unit Tests	ALWAYS	Verify individual function/method behavior in isolation
Integration Tests	When code interacts with external systems, databases, or APIs	Verify correct interaction between components
Contract Tests	When code implements or consumes an API	Verify API contract conformance
Property-Based Tests	RECOMMENDED for data transformation, parsing, and algorithmic code	Discover edge cases through randomized input generation
Boundary Tests	ALWAYS for code handling numeric ranges, string lengths, or collection sizes	Verify correct behavior at boundary conditions
Negative Tests	ALWAYS	Verify correct handling of invalid inputs, error conditions, and edge cases

Static Analysis Requirements

Static analysis MUST be applied to all AI-generated code before it enters a protected branch. The following categories of static analysis are REQUIRED:

Mandatory Static Analysis Categories

Category	Tools (Examples)	Enforcement
Security Scanning (SAST)	Semgrep, SonarQube, CodeQL, Snyk Code	MUST pass with zero Critical/High findings
Dependency Vulnerability Scanning	Snyk, Dependabot, OWASP Dependency-Check	MUST pass with zero Critical findings; High findings require risk acceptance
Code Quality / Linting	ESLint, Pylint, Checkstyle, RuboCop	MUST pass with zero errors; warnings reviewed
Complexity Analysis	SonarQube, radon, lizard	MUST not exceed thresholds in Engineering Quality Standards
Type Checking	mypy, TypeScript strict mode, Flow	MUST pass with zero errors for typed languages
License Compliance	FOSSA, Black Duck, Licensee	MUST pass -- see Intellectual Property

Static Analysis Configuration for AI Code

Static analysis tools SHOULD be configured with stricter rulesets for files or commits tagged as AI-generated
Teams SHOULD enable experimental or preview rules that detect common AI generation patterns (e.g., overly generic exception handling, unused imports, redundant null checks)
SAST findings on AI-generated code MUST NOT be suppressed without a documented justification reviewed by a Tier 2 or higher reviewer

Behavioral Validation

Behavioral validation verifies that AI-generated code does what the specification requires, not merely that it executes without errors.

Behavioral Validation Techniques

Specification-Based Testing: Tests derived directly from acceptance criteria or user stories, written before or independently of the AI-generated code
Example-Based Validation: Running the code against known input/output pairs from the specification or existing system behavior
Differential Testing: When AI-generated code replaces existing code, running both old and new implementations against the same inputs and comparing outputs
Shadow Mode Execution: Deploying AI-generated code in a shadow or canary mode where it processes real traffic but its outputs are compared against the production system rather than served to users
Domain Expert Review: For business-critical logic, a domain expert (not just a developer) SHOULD validate that the code's behavior matches business rules

When to Apply Each Technique

Technique	When REQUIRED	When RECOMMENDED
Specification-Based Testing	Always	--
Example-Based Validation	When known input/output pairs exist	Always
Differential Testing	When replacing existing functionality	When refactoring
Shadow Mode Execution	For high-risk production changes	For any customer-facing logic
Domain Expert Review	For financial, medical, or legal logic	For any business rule implementation

Regression Testing

AI-generated code MUST NOT break existing functionality. Regression testing requirements are elevated for AI-assisted changes.

Regression Testing Requirements

The full regression test suite MUST pass before AI-generated code is merged into a protected branch
Regression test execution MUST be automated in the CI/CD pipeline and MUST NOT be bypassed
When AI-generated code modifies existing functions, all callers of those functions MUST be tested
Performance regression tests SHOULD be run for AI-generated code that is on a critical performance path
Visual regression tests SHOULD be run for AI-generated frontend code

Regression Failure Protocol

If regression tests fail after integrating AI-generated code:

The AI-generated code MUST be reverted or the merge blocked
The failure MUST be analyzed to determine if the AI output introduced the regression or exposed a pre-existing issue
If the AI output introduced the regression, the prompt and constraints MUST be reviewed and updated before re-generation
The incident SHOULD be logged in the team's AI defect tracker for trend analysis

Mutation Testing for AI Outputs

Mutation testing is the most effective technique for validating that tests actually detect defects, rather than merely executing code paths. It is especially valuable for AI-generated code and AI-generated tests.

What Is Mutation Testing

Mutation testing introduces small, systematic changes (mutations) to the code -- such as changing > to >=, removing a function call, or altering a return value -- and verifies that the test suite detects each mutation. If a mutation survives (tests still pass), it indicates a gap in test effectiveness.

Mutation Testing Requirements

Mutation testing MUST be applied to AI-generated code in Tier 2 (Standard Risk) and Tier 3 (High Risk) changes as defined in Human-in-the-Loop
Mutation testing is RECOMMENDED for Tier 1 (Low Risk) changes
The minimum mutation score (percentage of mutations killed) MUST be 70%
Surviving mutations MUST be reviewed: each surviving mutation is either a test gap to fix or an equivalent mutation to document

Recommended Mutation Testing Tools

Language	Tool	Notes
Java/Kotlin	PIT (pitest)	Integrates with Maven/Gradle; CI-friendly
Python	mutmut	Works with pytest; incremental mutation support
JavaScript/TypeScript	Stryker	Supports multiple test runners; dashboard available
C#	Stryker.NET	.NET ecosystem support
Go	go-mutesting	Community-maintained

CI/CD Integration

All verification techniques described in this section MUST be integrated into the CI/CD pipeline and enforced as automated gates.

Pipeline Gate Configuration

PR Created (AI-assisted label detected)
  |
  v
Stage 1: Linting + Type Checking     --> Block on failure
  |
  v
Stage 2: Unit Tests + Coverage        --> Block if below thresholds
  |
  v
Stage 3: SAST + Dependency Scan       --> Block on Critical/High
  |
  v
Stage 4: Integration Tests            --> Block on failure
  |
  v
Stage 5: Mutation Testing             --> Block if score < 70%
  |
  v
Stage 6: Regression Suite             --> Block on failure
  |
  v
Human Review Gate                     --> Requires qualified reviewer approval
  |
  v
Merge Permitted

Teams MUST NOT configure CI pipelines to allow bypassing these gates for AI-generated code. Emergency bypass procedures MUST require written approval from a Tier 3 reviewer or engineering director and MUST be logged for audit purposes. For broader quality thresholds and architectural conformance checks, see Engineering Quality Standards.

Verification Philosophy​

Automated Testing Requirements​

Test Generation and Review​

Test Coverage Requirements​

Test Categories Required for AI-Generated Code​

Static Analysis Requirements​

Mandatory Static Analysis Categories​

Static Analysis Configuration for AI Code​

Behavioral Validation​

Behavioral Validation Techniques​

When to Apply Each Technique​

Regression Testing​

Regression Testing Requirements​

Regression Failure Protocol​

Mutation Testing for AI Outputs​

What Is Mutation Testing​

Mutation Testing Requirements​

Recommended Mutation Testing Tools​

CI/CD Integration​

Pipeline Gate Configuration​