SRE (Site Reliability Engineering)
Monitoring, alerting, incident response, and reliability targets.
Monitoring
Monitor everything that can fail. But be intentional-every metric has a cost.
Metric Categories
Request metrics (RED)
- Rate - Requests per second
- Errors - Failed requests (5xx, timeouts)
- Duration - Response time (p50, p95, p99)
Resource metrics (USE)
- Utilization - CPU, memory, disk usage %
- Saturation - Queue depth, threads waiting
- Errors - Device errors, dropped packets
Business metrics
- Signups, conversions, revenue
- Feature usage
- User retention
Metric Hygiene
Avoid high cardinality - Unique user IDs as labels (explodes your metrics storage) Use low cardinality - status_code, region, environment
Avoid vanity metrics - Page views (without context) Track actionable metrics - Error rate, latency, business impact
Alerting
Only alert on conditions that require immediate human action.
Alert Rules
-
Severity defines action
- P1 (Critical) - Page someone now, business impact
- P2 (High) - Alert during business hours
- P3 (Low) - Report in standup, schedule fix
-
Alert on outcomes, not causes
- Bad: "Database connection pool at 80%"
- Good: "API error rate > 1%"
-
Prevent alert fatigue
- One alert = one fix
- No alerts that always fire
- Test alerts weekly to ensure they work
Alert Template
Alert: Database Write Latency High
Severity: P2
Threshold: p99 latency > 500ms for 5 minutes
Action: Check slow query log, run query analysis
Runbook: docs/runbooks/slow-queries.md
Incident Response
Every incident is a learning opportunity.
Severity Definition
| Level | Impact | Response Time | Team Communication |
|---|---|---|---|
| P1 | Service down | Immediate | Real-time in war room |
| P2 | Feature degraded | 30 minutes | Status updates during incident |
| P3 | Minor issues | Business hours | Report in standup |
Response Flow
- Declare incident - Establish war room, assign IC (Incident Commander)
- Acknowledge - Notify team immediately
- Investigate - IC directs investigation, collect facts
- Mitigate - Stop bleeding first, find root cause second
- Resolve - Implement fix, validate it works
- Communicate - Keep team updated
- Postmortem - Write blameless postmortem within 48 hours
Postmortem Sections
- Timeline - What happened and when
- Root cause - Why did this happen?
- Impact - How many users affected, how long?
- Mitigation - What did we do to stop it?
- Action items - What prevents this next time? (Prioritized)
SLOs / SLIs
Know what reliability means for your service.
SLI (Service Level Indicator)
Measurable metric of your service health.
Examples:
- API error rate (< 0.1%)
- API latency p99 (< 200ms)
- Database availability (>= 99.9%)
SLO (Service Level Objective)
Target for your SLI.
Example: "API error rate SLO: < 0.1% over 30 days"
Error Budget
If your SLO is 99.9%, you have ~43 minutes of downtime per month.
Use it wisely:
- Use budget to justify urgent fixes
- Use budget to decide deployment timing
- When budget exhausted - freeze changes, focus on stability
Runbooks
Every alert should link to a runbook.
Runbook Template
# Database Connection Pool Exhaustion
## Alert
"Database connections > 90% for 5 minutes"
## Investigation
1. Check current connections: SELECT count(*) FROM pg_stat_activity;
2. Find slow queries: SELECT * FROM pg_stat_statements ORDER BY mean_time DESC;
3. Check for leaks: SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename;
## Mitigation (Quick Fixes)
- Restart application servers
- Increase pool size (if safe)
- Kill idle connections: SELECT pg_terminate_backend(pid);
## Resolution (Permanent Fix)
- Fix slow query
- Add connection pooling (PgBouncer)
- Optimize database
## Escalation
If still not resolved, escalate to database team lead
Documentation
- Architecture - System design
- Code Review - What to look for in reviews
- Code Review Speed - Why fast reviews matter
- Code Review Comments - How to write helpful comments
- Handling Pushback - Responding to disagreement
- Code Style - Formatting and linting standards
- DevOps - Infrastructure and deployment
- Security - Encryption and access control
- Standards - Git, naming, and code reviews
- Terminology - Common definitions