Skip to main content

SRE (Site Reliability Engineering)

Monitoring, alerting, incident response, and reliability targets.

Monitoring

Monitor everything that can fail. But be intentional-every metric has a cost.

Metric Categories

Request metrics (RED)

  • Rate - Requests per second
  • Errors - Failed requests (5xx, timeouts)
  • Duration - Response time (p50, p95, p99)

Resource metrics (USE)

  • Utilization - CPU, memory, disk usage %
  • Saturation - Queue depth, threads waiting
  • Errors - Device errors, dropped packets

Business metrics

  • Signups, conversions, revenue
  • Feature usage
  • User retention

Metric Hygiene

Avoid high cardinality - Unique user IDs as labels (explodes your metrics storage) Use low cardinality - status_code, region, environment

Avoid vanity metrics - Page views (without context) Track actionable metrics - Error rate, latency, business impact

Alerting

Only alert on conditions that require immediate human action.

Alert Rules

  1. Severity defines action

    • P1 (Critical) - Page someone now, business impact
    • P2 (High) - Alert during business hours
    • P3 (Low) - Report in standup, schedule fix
  2. Alert on outcomes, not causes

    • Bad: "Database connection pool at 80%"
    • Good: "API error rate > 1%"
  3. Prevent alert fatigue

    • One alert = one fix
    • No alerts that always fire
    • Test alerts weekly to ensure they work

Alert Template

Alert: Database Write Latency High
Severity: P2
Threshold: p99 latency > 500ms for 5 minutes
Action: Check slow query log, run query analysis
Runbook: docs/runbooks/slow-queries.md

Incident Response

Every incident is a learning opportunity.

Severity Definition

LevelImpactResponse TimeTeam Communication
P1Service downImmediateReal-time in war room
P2Feature degraded30 minutesStatus updates during incident
P3Minor issuesBusiness hoursReport in standup

Response Flow

  1. Declare incident - Establish war room, assign IC (Incident Commander)
  2. Acknowledge - Notify team immediately
  3. Investigate - IC directs investigation, collect facts
  4. Mitigate - Stop bleeding first, find root cause second
  5. Resolve - Implement fix, validate it works
  6. Communicate - Keep team updated
  7. Postmortem - Write blameless postmortem within 48 hours

Postmortem Sections

  • Timeline - What happened and when
  • Root cause - Why did this happen?
  • Impact - How many users affected, how long?
  • Mitigation - What did we do to stop it?
  • Action items - What prevents this next time? (Prioritized)

SLOs / SLIs

Know what reliability means for your service.

SLI (Service Level Indicator)

Measurable metric of your service health.

Examples:

  • API error rate (< 0.1%)
  • API latency p99 (< 200ms)
  • Database availability (>= 99.9%)

SLO (Service Level Objective)

Target for your SLI.

Example: "API error rate SLO: < 0.1% over 30 days"

Error Budget

If your SLO is 99.9%, you have ~43 minutes of downtime per month.

Use it wisely:

  • Use budget to justify urgent fixes
  • Use budget to decide deployment timing
  • When budget exhausted - freeze changes, focus on stability

Runbooks

Every alert should link to a runbook.

Runbook Template

# Database Connection Pool Exhaustion

## Alert
"Database connections > 90% for 5 minutes"

## Investigation
1. Check current connections: SELECT count(*) FROM pg_stat_activity;
2. Find slow queries: SELECT * FROM pg_stat_statements ORDER BY mean_time DESC;
3. Check for leaks: SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename;

## Mitigation (Quick Fixes)
- Restart application servers
- Increase pool size (if safe)
- Kill idle connections: SELECT pg_terminate_backend(pid);

## Resolution (Permanent Fix)
- Fix slow query
- Add connection pooling (PgBouncer)
- Optimize database

## Escalation
If still not resolved, escalate to database team lead

Documentation