Skip to main content

Production Monitoring & Drift Management

AI feature quality can degrade after release due to data drift, behavior drift, or changing user patterns. Monitoring MUST detect this early.

Telemetry Requirements

Minimum production telemetry:

  • Request volume and success/error rates
  • Latency percentiles and timeout rates
  • Quality proxy metrics (acceptance, correction, escalation)
  • Safety incidents and policy violations
  • Segment-level performance for critical cohorts

Drift Types

Drift TypeSignalResponse
Data driftInput distribution changes materiallyRe-evaluate model on fresh sample
Concept driftTarget behavior or business rules changedUpdate rubric/model/prompt and revalidate
Behavioral driftOutput style/quality degrades without infra failureTrigger incident triage and rollback decision

Alert Policy

  • Define warning and critical thresholds per metric.
  • Critical alerts MUST route to on-call owner within 5 minutes.
  • Repeated warnings over 7 days SHOULD trigger formal review.

Drift Response Runbook

  1. Detect and classify drift severity.
  2. Evaluate blast radius (users, revenue, compliance exposure).
  3. Apply mitigation: rollback, feature-flag reduction, or safe-mode fallback.
  4. Re-evaluate candidate fix against current data.
  5. Publish incident note and corrective action.

Monitoring Ownership

ResponsibilityOwner
Metric definition and SLOsProduct + Engineering
Dashboards and alertsPlatform/Observability
Drift triageAI feature on-call
Governance reportingCoE + Security