Skip to content
← All runbooks

Incident response

Incident runbook: Archive weekly backtest cron failed

The archive service runs the weekly Atrium-vs-baseline backtest + publishes the result via ResearchAttestation.publishBacktest on chain. A failure means the next ResearchAttestation never lands and the /research dashboard renders stale.

Severity

  • SEV-3 if a single weekly run fails (next run can recover)
  • SEV-2 if the last 4 weekly runs all failed (research dashboard

staler than 30 days)

  • SEV-1 if the RESEARCH_SIGNER_KEY has been rotated but the GHA

secret is stale (signing failures with a now-public key)

Signals

  • .github/workflows/archive-weekly.yml shows red on the Actions tab
  • Discord ops webhook fires on if: failure()
  • Sentry events tagged service: archive
  • /research page: generatedAt field older than 14 days

Triage (30 min target)

  1. Open the failed workflow run. The notebook-execution step prints the

Python traceback if the model rebuild crashed.

  1. Check the publish-step: a failure here means

RESEARCH_SIGNER_KEY is wrong, or RESEARCH_CONTRACT_ADDR points at a redeployed instance.

  1. Check IPFS pinning: web3.storage token may have expired (free-tier

quota or token revocation).

Mitigations

SymptomFixRollback safe?
Notebook crashPin previous notebook commit; reopen issue with the failing cellyes
RESEARCH_SIGNER_KEY rotatedUpdate GHA secret + Vercel env; rerun cron via workflow_dispatchyes
IPFS pin 401Rotate WEB3_STORAGE_TOKEN; rerunyes
RPC throttleWait 60 min, rerun; if persistent, swap RPC endpointyes
Backtest produced impossible deltas (sanity-check fail)DO NOT publish; reopen issueyes

Resolution checklist

  • [ ] Latest archive run is green
  • [ ] On-chain ResearchAttestation.latestRoot() reflects the new run
  • [ ] /research page generatedAt is recent
  • [ ] Sentry events stop firing
  • [ ] Post-mortem in /incidents/ if SEV ≤ 2

Escalation contacts

  • On-call ops owner per runbooks/on-call-rotation.md
  • web3.storage support for IPFS pinning issues