SRE (Site Reliability Engineering)

Monitoring, alerting, incident response, and reliability targets.

Monitoring

Monitor everything that can fail. But be intentional-every metric has a cost.

Metric Categories

Request metrics (RED)

Rate - Requests per second
Errors - Failed requests (5xx, timeouts)
Duration - Response time (p50, p95, p99)

Resource metrics (USE)

Utilization - CPU, memory, disk usage %
Saturation - Queue depth, threads waiting
Errors - Device errors, dropped packets

Business metrics

Signups, conversions, revenue
Feature usage
User retention

Metric Hygiene

Avoid high cardinality - Unique user IDs as labels (explodes your metrics storage) Use low cardinality - status_code, region, environment

Avoid vanity metrics - Page views (without context) Track actionable metrics - Error rate, latency, business impact

Alerting

Only alert on conditions that require immediate human action.

Alert Rules

Severity defines action
- P1 (Critical) - Page someone now, business impact
- P2 (High) - Alert during business hours
- P3 (Low) - Report in standup, schedule fix
Alert on outcomes, not causes
- Bad: "Database connection pool at 80%"
- Good: "API error rate > 1%"
Prevent alert fatigue
- One alert = one fix
- No alerts that always fire
- Test alerts weekly to ensure they work

Alert Template

Alert: Database Write Latency High
Severity: P2
Threshold: p99 latency > 500ms for 5 minutes
Action: Check slow query log, run query analysis
Runbook: docs/runbooks/slow-queries.md

Incident Response

Every incident is a learning opportunity.

Severity Definition

Level	Impact	Response Time	Team Communication
P1	Service down	Immediate	Real-time in war room
P2	Feature degraded	30 minutes	Status updates during incident
P3	Minor issues	Business hours	Report in standup

Response Flow

Declare incident - Establish war room, assign IC (Incident Commander)
Acknowledge - Notify team immediately
Investigate - IC directs investigation, collect facts
Mitigate - Stop bleeding first, find root cause second
Resolve - Implement fix, validate it works
Communicate - Keep team updated
Postmortem - Write blameless postmortem within 48 hours

Postmortem Sections

Timeline - What happened and when
Root cause - Why did this happen?
Impact - How many users affected, how long?
Mitigation - What did we do to stop it?
Action items - What prevents this next time? (Prioritized)

SLOs / SLIs

Know what reliability means for your service.

SLI (Service Level Indicator)

Measurable metric of your service health.

Examples:

API error rate (< 0.1%)
API latency p99 (< 200ms)
Database availability (>= 99.9%)

SLO (Service Level Objective)

Target for your SLI.

Example: "API error rate SLO: < 0.1% over 30 days"

Error Budget

If your SLO is 99.9%, you have ~43 minutes of downtime per month.

Use it wisely:

Use budget to justify urgent fixes
Use budget to decide deployment timing
When budget exhausted - freeze changes, focus on stability

Runbooks

Every alert should link to a runbook.

Runbook Template

# Database Connection Pool Exhaustion

## Alert
"Database connections > 90% for 5 minutes"

## Investigation
1. Check current connections: SELECT count(*) FROM pg_stat_activity;
2. Find slow queries: SELECT * FROM pg_stat_statements ORDER BY mean_time DESC;
3. Check for leaks: SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename;

## Mitigation (Quick Fixes)
- Restart application servers
- Increase pool size (if safe)
- Kill idle connections: SELECT pg_terminate_backend(pid);

## Resolution (Permanent Fix)
- Fix slow query
- Add connection pooling (PgBouncer)
- Optimize database

## Escalation
If still not resolved, escalate to database team lead

Documentation

Architecture - System design
Code Review - What to look for in reviews
Code Review Speed - Why fast reviews matter
Code Review Comments - How to write helpful comments
Handling Pushback - Responding to disagreement
Code Style - Formatting and linting standards
DevOps - Infrastructure and deployment
Security - Encryption and access control
Standards - Git, naming, and code reviews
Terminology - Common definitions

SRE (Site Reliability Engineering)

Monitoring​

Metric Categories​

Metric Hygiene​

Alerting​

Alert Rules​

Alert Template​

Incident Response​

Severity Definition​

Response Flow​

Postmortem Sections​

SLOs / SLIs​

SLI (Service Level Indicator)​

SLO (Service Level Objective)​

Error Budget​

Runbooks​

Runbook Template​

Documentation​