# Production Outages and RCA

## Incident response principles
- Stabilize first, analyze second.
- Speed of recovery beats perfect diagnosis.
- Blameless does not mean consequence-free.

## Severity model (example)
- Sev0: total outage, revenue or safety at risk
- Sev1: major feature unavailable
- Sev2: degraded performance or partial impact
- Sev3: minor issue with workaround

## First 15 minutes
- Declare the incident and assign roles.
- Stop the bleeding: rollback, feature flag off, or traffic shift.
- Start a shared timeline for decisions.

## Communication
- Internal: short, frequent updates with clear status.
- External: honest impact, expected recovery time, and next update time.

## Root cause analysis (RCA)
### What to capture
- Timeline of events
- Trigger and contributing factors
- Customer impact
- Detection gaps and response gaps

### What to avoid
- Personal blame
- Hypothetical fixes without owners
- Long investigations without action

## Fixes that stick
- One or two high-leverage remediation items
- Clear owner and deadline
- Follow-up review to confirm the issue is actually solved

## Post-incident learning
- Update runbooks
- Improve alerting and dashboards
- Reduce future blast radius
