ISTEB Foundations in Site Reliability Engineering

ISTEB Foundations in Site Reliability Engineering Intermediate — Quiz 2

ISTEB Foundations in Site Reliability Engineering Intermediate — Quiz 2 — Study Guide

ISTEB Foundations in SRE Intermediate — Quiz 2 Study Guide

Modern software systems are too large and complex to manage by hand. Site Reliability Engineers rely on automation, standardized tooling, and repeatable processes to keep infrastructure consistent, recoverable, and secure. This lesson covers the core concepts you'll need to master for Quiz 2 — from writing infrastructure code to deploying safely across multiple clouds.


Infrastructure as Code (IaC)

Infrastructure as Code means defining your servers, networks, databases, and other resources in text files — just like application code. Instead of clicking through a cloud console, you write a file that describes what you want, and a tool provisions it for you.

Primary Benefits

  • Repeatability — spin up identical environments every time
  • Auditability — track every change in version control
  • Speed — provision hundreds of resources in minutes
  • Reduced human error — no more missed checkbox in a UI
  • Declarative vs. Imperative

    StyleWhat you writeExample tools
    Declarative*What* the end state should look likeTerraform, Puppet
    Imperative*How* to get to the end state, step by stepAnsible (procedural mode), Bash scripts
    # Declarative Terraform example — describe desired state
    resource "aws_instance" "web" {
      ami           = "ami-0c55b159cbfafe1f0"
      instance_type = "t2.micro"
    }


    Popular IaC & Configuration Management Tools

    Terraform

    Terraform is the most widely used declarative IaC tool. It talks to cloud provider APIs and tracks what it has created in a state file (.tfstate). The state file is the source of truth about real-world resources.

  • Modules — reusable, shareable blocks of Terraform configuration (think functions for infrastructure)
  • Drift detection — Terraform can compare the state file against actual infrastructure to find unauthorized changes
  • Ansible

    Ansible is an agentless configuration management tool that uses YAML "playbooks." It connects over SSH and runs tasks in order — making it more imperative in style.

    # Ansible playbook snippet
    
  • name: Install nginx
  • apt: name: nginx state: present

    Puppet

    Puppet uses a declarative, agent-based model. Agents on each server regularly "pull" the desired configuration from a Puppet server and enforce it — great for large fleets.


    Idempotence

    Idempotence means running the same operation multiple times produces the same result as running it once. This is critical in automation.

    Analogy: Pressing the elevator button ten times doesn't make it arrive ten times faster — and it doesn't cause ten elevators to arrive. One press = one result.

    If your Ansible playbook says "nginx should be installed," running it 5 times won't install nginx 5 times. It checks, and if nginx is already there, it does nothing. Non-idempotent scripts (like raw apt install) can fail or create duplicates on re-runs.


    Version Control and Git

    Every IaC file should live in a version control system. Git is the standard.

  • Why it matters for IaC: you get a full history of infrastructure changes, can roll back bad configs, and enforce peer review via pull requests
  • Branching strategies protect production — changes go through dev → staging → main
  • Git enables CI/CD pipelines to trigger automatically when code is merged
  • git init
    git add main.tf
    git commit -m "Add web server resource"
    git push origin feature/add-web-server


    CI/CD Pipelines and Automation

    A CI/CD pipeline automates the journey from code commit to deployed infrastructure. In SRE, pipelines typically:

  • Lint the code (catch syntax errors and style issues)
  • Run policy as code checks (e.g., OPA, Sentinel) to enforce compliance rules
  • Run terraform plan or equivalent dry-run
  • Apply changes to staging, then production
  • Linting and code quality checks prevent bad configs from ever reaching production. Tools like tflint for Terraform or ansible-lint for Ansible flag problems early.


    Security Concepts

    Secrets Management

    Never store passwords, API keys, or certificates in your Git repository. Use dedicated secrets management tools:
  • HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
  • Inject secrets at runtime, not at commit time
  • Policy as Code & Compliance

    Policy as code encodes security and compliance rules as machine-readable files. Tools like Open Policy Agent (OPA) can block a Terraform plan that would open port 22 to the world — automatically, before anything is deployed.

    Risks of Poor IaC Practices

  • Exposed credentials in code
  • Unreviewed infrastructure changes
  • Configuration drift leading to security gaps

  • Immutability and Deployment Strategies

    Immutable Infrastructure

    Instead of patching a running server, you replace it entirely with a new, pre-baked image. This eliminates "snowflake servers" — unique, hand-configured machines that are impossible to reproduce.

    Blue/Green Deployments

    Run two identical production environments — blue (current) and green (new). Traffic switches to green once it passes tests. If something breaks, flip back to blue instantly.

    [Users] → [Load Balancer] → [Blue Environment] (active)
                              → [Green Environment] (staging new version)

    This strategy supports incident recovery — you always have a known-good environment to fall back to.


    Multi-Cloud and Load Testing

    Multi-Cloud

    Running workloads across AWS, GCP, and Azure simultaneously. IaC tools like Terraform shine here because the same workflow applies regardless of provider. Key concern: avoid vendor lock-in in your modules.

    Load Testing and SLOs

    A Service Level Objective (SLO) defines a target for reliability (e.g., "99.9% of requests complete in under 200ms"). Load testing validates that your infrastructure can actually meet those SLOs under realistic traffic before you go live. Tools like k6, Locust, or JMeter simulate thousands of users.


    Key Takeaways

  • IaC treats infrastructure like software — it should be version-controlled, reviewed, tested, and deployed through automated pipelines, never manually.
  • Idempotence is non-negotiable in automation: your scripts and playbooks must be safe to run repeatedly without side effects.
  • Terraform manages state, detects drift, and uses modules for reuse; Ansible and Puppet handle configuration management with different agent and style trade-offs.
  • Security must be built into the pipeline — use secrets management tools, policy as code, and linting to catch problems before they reach production.
  • Immutability and blue/green deployments reduce risk and speed up incident recovery by ensuring you always have a clean, reproducible environment to fall back on.