Site Reliability Engineering

Site Reliability Engineering — Reliability-by-Design Mentality#

Focus: System behavior over time
Strength: Monitoring, automation, self-healing systems
Approach:
- Encode operational knowledge into code
- Measure, detect, and react automatically
- Minimize toil
Question asked:
“How do we make the system fix itself and page humans only for the unknown?”

1. Overview#

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations. The goal is to create reliable, scalable, and automated systems by encoding operational knowledge into software.

Essentially, SRE is about building self-healing systems, reducing manual toil, and measuring reliability through metrics.

2. Core Principles#

Automation First
- Manual work (toil) should be automated wherever possible.
- Common examples:
  - Automated service restarts
  - SSL certificate renewal
  - Deployment pipelines
Monitoring & Observability
- Systems must be observable through metrics, logs, and traces.
- Key metrics:
  - CPU, memory, disk
  - Network and I/O
  - Application health (latency, errors, throughput)
  - SSL certificate expiration
  - Nginx or web server availability
SLIs, SLOs, and SLAs
- SLI (Service Level Indicator): Measure of system health (e.g., HTTP 200 success rate).
- SLO (Service Level Objective): Target value for SLI (e.g., 99.9% uptime).
- SLA (Service Level Agreement): Formal agreement with customers based on SLOs.
Reliability by Design
- Systems should recover automatically from failures.
- Include guardrails:
  - Retry logic
  - Circuit breakers
  - Safe rollbacks
Incident Management
- Automated alerts notify humans only when necessary.
- Postmortems document root causes and lessons learned.
- Continuous improvement reduces repeated incidents.