Skip to content
← All runbooks

Incident response

Incident Response Procedure

Phase 12, general incident response for Atrium production.

Referenced from SECURITY.md.

Severity Definitions

LevelDefinitionExamplesResponse Time
P0Fund loss or unauthorized admin actionExploit, key compromise, unauthorized pauseImmediate
P1Service down or data integrity issueUI unreachable, incorrect balances shown< 1 hour
P2Performance degradationp95 > 5s, partial feature broken< 24 hours
P3Cosmetic or non-blockingTypo, minor UI glitchNext sprint

Procedure

1. Triage

  • Identify severity using definitions above
  • Assign incident commander (on-call, see runbooks/on-call-rotation.md)
  • Create incident channel in Discord: #incident-YYYY-MM-DD-<slug>

2. Communicate

SeverityInternalExternal
P0Discord #ops-alerts + #incident-*Twitter status update
P1Discord #ops-alerts + #incident-*Twitter if >30min downtime
P2Discord #ops-alertsNone
P3GitHub issueNone

3. Mitigate

P0, Fund safety:

# Emergency pause via Praetor multisig
cast send $PRAETOR_TIMELOCK "pause()" --private-key $MULTISIG_KEY_1
# Requires 2/3 multisig confirmation within 48h timelock
# For immediate action: use PosternKillSwitch
cast send $POSTERN_KILL_SWITCH "revokeAll(address)" $COMPROMISED_ACCOUNT

P1, Service restoration:

# Hotfix branch
git checkout -b hotfix/incident-YYYY-MM-DD
# Fix, test, push
git push -u origin hotfix/incident-YYYY-MM-DD
# Vercel auto-deploys preview; promote to production via Vercel UI

P2/P3, Ticket:

  • Create GitHub issue with incident label
  • Link to incident channel
  • Schedule for next sprint

4. Postmortem

Required for P0 and P1 within 48 hours.

Template: incidents/YYYY-MM-DD-<slug>.md

# Incident: <title>

**Date:** YYYY-MM-DD
**Severity:** P0/P1
**Duration:** X hours
**Impact:** <what users experienced>

## Timeline

- HH:MM, Alert fired
- HH:MM, Incident commander assigned
- HH:MM, Root cause identified
- HH:MM, Mitigation applied
- HH:MM, Service restored

## Root Cause (5 Whys)

1. Why did X happen? Because Y.
2. Why did Y happen? Because Z.
...

## Action Items

- [ ] <action>, owner, due date
- [ ] <action>, owner, due date

## Lessons Learned

<what we'll do differently>

5. Follow-Up

  • Action items tracked in docs/plan-tracker.md
  • Postmortem published to incidents/
  • Alert rules updated if detection was slow
  • Runbooks updated if response was unclear