← Back to all topics
$ curl localhost:9090/metrics

Monitoring & Observability
Instructor Guide

Prometheus + Grafana + Loki — the open-source toolkit that runs on every modern DevOps team's wall.

01
Why Monitoring Matters — and Observability vs Monitoring
"You can't fix what you can't see." The 3 questions every DevOps engineer must be able to answer in 30 seconds.

How to explain to students

Open with: "It's 2 AM. Your phone buzzes — production is down. You have 5 minutes to answer three questions: Is it really broken? Where is it broken? What changed?" If your team can't answer those, you don't have monitoring — you have guessing.

Monitoring tells you something is wrong ("CPU is at 95%"). Observability lets you ask new questions you didn't think to ask in advance ("Why does CPU spike only when a Karachi user uploads a PDF on Wednesdays?"). The first is dashboards + alerts. The second is high-cardinality metrics + structured logs + traces.

on-call.sh — the 3 questions
# 02:14 → PagerDuty: "Site is down"

# Q1. Is it really broken? (or just one user?)
$ curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health
503
$ open https://grafana.example.com/d/overview
Error rate spiked from 0.2% → 28% at 02:11 UTC ✓ confirmed broken

# Q2. Where is it broken? (which service / region / version?)
$ grafana → "errors by service"
checkout-api: 90% of errors — others healthy
→ all errors from instances tagged version=v2.4.1

# Q3. What changed? (deploy / config / dependency?)
$ git log --since="3 hours ago" --oneline
a3f1c2 deploy v2.4.1 → checkout-api (02:08 UTC) ← 3 min before alert

# Decision: rollback to v2.4.0. Postmortem after.
$ aws ecs update-service --service checkout-api --task-definition checkout:42
🚨
Detect
Alarms find problems within minutes, not hours.
🔍
Diagnose
Dashboards + logs scope the blast radius — which service, which version.
↩️
Decide
Rollback, scale up, or page a specialist. With data, not gut feel.
📚
Learn
Every incident becomes a postmortem and a new alert / dashboard.

🎯 Practice Questions

Q1.
In one sentence each, define monitoring and observability. Pick the right word for: (a) "alert when DB CPU > 80%", (b) "find out which user's request triggered the 500".
Show Answer
Monitoring = predefined dashboards + alerts on known signals. Answers "is the thing I'm watching healthy?"
Observability = ability to explore the unknown — slice metrics by arbitrary labels, search structured logs, follow traces. Answers "why is this user's request slow?"

(a) Monitoring — you knew in advance you cared about CPU.
(b) Observability — you're asking a new question that requires high-cardinality data (user ID, request ID).
Q2.
Your service has zero monitoring today. Pick the first two things to instrument before adding anything else. Justify in one sentence each.
💡 Think "is it up?" + "is it deploying?"
Q3.
A teammate says, "We have alerts when CPU hits 90% — that's enough." Why is alerting on resource utilisation alone often wrong? What's a better signal?
02
The Three Pillars — Metrics, Logs, Traces
Each answers a different question. Master all three and you can debug anything.

How to explain to students

Metrics = numbers over time. Cheap to store, fast to aggregate, perfect for dashboards and alerts. Tools: Prometheus, CloudWatch.
Logs = text events with structure (JSON ideally). Expensive to store, perfect for "what happened to this request?" Tools: Loki, ELK, CloudWatch Logs.
Traces = the request journey across services. Answers "why is this slow?" Tools: Tempo, Jaeger, X-Ray, OpenTelemetry.

The three are complementary, not competing. A great workflow: "alert fires off a metric → click into the dashboard → drill into logs for that time window → for slow requests, follow the trace".

three-pillars.txt
METRICS LOGS TRACES
─────────────── ────────────────── ───────────────────── ─────────────────
Shape number @ time structured event request span tree
Question "what's the rate?" "what happened?" "where's the time?"
Cardinality LOW (limited tags) HIGH (per-event) VERY HIGH
Storage cost cheap medium expensive
Retention 90d–year+ 7–30 days 7–14 days
Tools Prometheus, CW Loki, ELK, CW Logs Tempo, Jaeger, X-Ray

# Same problem, three views
METRIC: http_5xx_total{service="api"} → 28% at 02:11
LOG: {ts:..., level:"error", trace_id:"abc", msg:"DB conn refused"}
TRACE: api-handler 320ms ─ db-query 295ms ✗ ECONNREFUSED

🎯 Practice Questions

Q1.
Pick the right pillar for: (a) "alert when error rate > 1%", (b) "what was the body of request that crashed at 14:23?", (c) "is the slowness in our service or our DB?", (d) "track requests/sec over the last 30 days".
Show Answer
(a) Metrics — error rate is a number over time, computed via PromQL: rate(http_errors_total[5m]) / rate(http_requests_total[5m]).
(b) Logs — request body is a one-off event with high cardinality. Search structured logs by trace ID or timestamp.
(c) Traces — distributed traces show time per span (api 50ms + db 250ms = 300ms). Pin-points the bottleneck.
(d) Metrics — long-retention numeric series; logs would be too expensive.
Q2.
Why are metrics "low cardinality" while logs are "high cardinality"? Why does that matter for cost?
💡 Each unique label combination is a separate time series in Prometheus.
Q3.
Your team logs every HTTP request to CloudWatch and the bill is exploding. List two changes that cut log volume by > 80% without losing debug ability.
03
Prometheus & PromQL Basics
The de-facto open-source metrics database — pull-based, time-series, queryable

How to explain to students

Prometheus has a deliberately simple architecture: every 15 seconds it pulls a /metrics endpoint from each target and stores the numbers. Targets expose metrics in a plain-text format. To monitor a Node.js app, you add the prom-client library; to monitor a Linux box, you run node-exporter.

PromQL is the query language. Three things to memorise: rate(counter[5m]) (per-second rate of a counter), histogram_quantile(0.95, ...) (95th percentile latency), and sum by (label)(...) (group by). With those three, you can build 80% of dashboards.

prometheus.yml + first queries
# prometheus.yml — config
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['ec2-host:9100']

  - job_name: 'myapp'
    metrics_path: /metrics
    static_configs:
      - targets: ['app1:3000', 'app2:3000']

# What /metrics actually looks like
$ curl http://app1:3000/metrics
http_requests_total{method="GET",status="200",route="/items"} 14823
http_requests_total{method="GET",status="500",route="/items"} 41
http_request_duration_seconds_bucket{le="0.05",route="/items"} 8200
http_request_duration_seconds_bucket{le="0.1",route="/items"} 12400
process_resident_memory_bytes 142360576

# PromQL — the 3 queries you'll write daily

# 1. Per-second request rate, grouped by status
sum by (status) (rate(http_requests_total[5m]))

# 2. Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# 3. p95 latency by route (the SLO favourite)
histogram_quantile(
0.95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)
scrape /metrics node-exporter prom-client rate() histogram_quantile() sum by()

🎯 Practice Questions

Q1.
Prometheus is "pull-based" while DataDog is "push-based". Why does Prometheus prefer pull, and what's one scenario where push (via Pushgateway) is needed?
Show Answer
Pull advantages:
1. Prometheus controls the schedule — no thundering herds when 1000 instances all push at once.
2. Health-check by default — if a target stops responding to /metrics, Prometheus knows immediately (up == 0).
3. Easier security — Prometheus reaches into the network; targets don't need outbound credentials.

Push needed for short-lived jobs. A cron job that runs for 30 seconds may finish before Prometheus's next 15-second scrape. The job pushes its final metrics to Pushgateway, which then exposes them via pull. Use Pushgateway sparingly — it's an exception, not the default.
Q2.
Write a PromQL query for "request error rate per route over the last 5 minutes, as a percentage."
Q3.
Why is using increase(counter[5m]) often wrong, and rate(counter[5m]) right, when alerting on a per-second threshold?
💡 Units — increase is total, rate is per-second.
Q4.
Your http_request_duration_seconds_bucket has 10 buckets. What's the trade-off if you add 100 more buckets to capture finer percentile detail?
04
Instrumenting a Node.js App with prom-client
From "I have an API" to "/metrics is exposing real RED-method counters" in 20 lines

How to explain to students

Most apps come pre-instrumented for the basics — Express + prom-client gives you HTTP counters and histograms in 20 lines. Focus on the RED method: Rate, Errors, Duration. Every endpoint should expose at least these three.

Naming matters. Prometheus convention: <namespace>_<name>_<unit> (snake_case, suffix is the unit). E.g., myapp_http_request_duration_seconds, not requestTimeMs.

server.ts — RED instrumentation
import express from 'express';
import { Counter, Histogram, Registry, collectDefaultMetrics } from 'prom-client';

const registry = new Registry();
collectDefaultMetrics({ register: registry });  # node + GC metrics for free

const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [registry],
});

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [registry],
});

const app = express();

app.use((req, res, next) => {
  const end = httpDuration.startTimer();
  res.on('finish', () => {
    const labels = { method: req.method, route: req.route?.path ?? 'unknown', status: res.statusCode };
    httpRequests.inc(labels);
    end(labels);
  });
  next();
});

app.get('/metrics', async (_req, res) => {
  res.set('content-type', registry.contentType);
  res.end(await registry.metrics());
});

# Verify locally
$ curl localhost:3000/metrics | grep http_requests_total
http_requests_total{method="GET",route="/items",status="200"} 14823
🔢
Counter
Only goes up. Use for "total things that happened" — requests, errors, jobs.
📈
Gauge
Up or down. Use for "current value" — queue depth, active connections.
📊
Histogram
Buckets that record distribution. Use for latency / sizes — supports percentile queries.
📐
RED method
Rate, Errors, Duration. The 3 metrics every HTTP service must expose.

🎯 Practice Questions

Q1.
Pick Counter or Gauge for: (a) "messages processed since boot", (b) "current queue depth", (c) "5xx responses today", (d) "active websocket connections".
Q2.
Why include route as a label, but not user_id? What's the cardinality consequence?
Show Answer
Each unique label-value combination becomes a separate time series in Prometheus. route has maybe 10–50 distinct values (your endpoints) — manageable. user_id has potentially millions of values — every user creates a new time series, and Prometheus's storage + memory blow up linearly.

Rule of thumb: label cardinality should be bounded and small. Anything user-scoped (user ID, request ID, IP) belongs in logs or traces, not metrics. If you genuinely need per-user metrics, sample / aggregate before exporting (e.g. "top 10 users by request count").
Q3.
A teammate names a metric request_time_in_milliseconds. Suggest two improvements following Prometheus naming conventions.
Q4.
Why default histogram buckets [0.005, 0.01, 0.025, ..., 10] are a poor fit for an API where p95 is 20ms? What buckets would you pick instead?
05
Grafana — Dashboards, Variables, and Cloud Datasources
The visual layer over Prometheus, Loki, CloudWatch, and a hundred other backends

How to explain to students

Grafana doesn't store metrics — it queries other systems and visualises the result. You add data sources (Prometheus, Loki, CloudWatch, Postgres), build dashboards with panels, and use variables to make a single dashboard work for many environments / services.

A good dashboard has fewer panels, not more. Ben Kraft's "RED dashboard" for an HTTP service is just three panels (Rate, Errors, Duration). USE for hosts (Utilisation, Saturation, Errors). Master those two layouts and you can read any dashboard in 30 seconds.

grafana — provisioned dashboard
# datasources.yaml — auto-provisioned on Grafana start
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
  - name: Loki
    type: loki
    url: http://loki:3100
  - name: CloudWatch
    type: cloudwatch
    jsonData: { defaultRegion: eu-west-1 }

# RED dashboard — 3 panels for an HTTP service

┌─────────────────────────────────────────────────────────────────┐
│ Rate Errors Duration p95 │
│ ▁▂▃▆▇▆▅▄ ▁▁▁▁▁▁▂▁ ▂▂▂▃▃▂▂▂ │
│ 142 req/s 0.3% errors 38ms │
└─────────────────────────────────────────────────────────────────┘

# Variable: $service — one dashboard, all services
label_values(http_requests_total, service)

# Panel queries reference $service
sum by (status) (rate(http_requests_total{service="$service"}[5m]))

# Cloud datasource — query CloudWatch from Grafana
SELECT AVG(CPUUtilization) FROM "AWS/EC2"
WHERE InstanceId = '$instance' GROUP BY interval(5m)
datasource panel variable RED dashboard USE method CloudWatch DS

🎯 Practice Questions

Q1.
A dashboard has 47 panels. Why is this an anti-pattern? Suggest the rule for what should and shouldn't be on the front page.
Show Answer
A dashboard with 47 panels is unreadable in an incident — it takes 5 minutes to find the broken one. The front-page dashboard should answer one question: "Is the service healthy?" 5–8 panels max, big and obvious.

Rule: front page = SLI / RED / USE only. Drilldowns (per-route latency, per-customer error rate, GC pauses) belong on linked sub-dashboards. Use row collapse + drilldown links to keep things tiered.
Q2.
Explain RED (Rate, Errors, Duration) vs USE (Utilisation, Saturation, Errors). When would you use each?
Q3.
A dashboard variable $service uses label_values(http_requests_total, service). What happens if no metrics with that label exist yet (e.g., new env)?
💡 Variable returns empty → all queries become invalid.
06
Loki — Log Aggregation that Plays Nicely with Grafana
"Prometheus, but for logs" — same labels, same Grafana, much cheaper than ELK

How to explain to students

Loki takes a different approach from Elasticsearch: it indexes only the labels (service, level, host), not the message body. This makes it 10–100× cheaper than ELK at the cost of slower full-text search. For most DevOps teams that's the right trade — "give me all logs for service=api, level=error in the last hour" is fast; brute-force "find the word 'kafka'" is slow.

The query language is LogQL — basically PromQL but for logs. {service="api"} |= "ERROR" filters; rate(...) turns logs into a metric. You can plot log-derived metrics on the same dashboard as Prometheus metrics — that's the killer feature.

loki — promtail config + LogQL
# promtail.yml — log shipper (one per host or k8s daemonset)
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: containers
    docker_sd_configs: [{ host: unix:///var/run/docker.sock }]
    relabel_configs:
      - source_labels: [__meta_docker_container_label_service]
        target_label: service

# LogQL — three queries you'll write often

# 1. All errors from the api service in the last 15 minutes
{service="api", level="error"}

# 2. Filter further by message substring
{service="api"} |= "DB connection refused"

# 3. Turn logs INTO a metric — error rate per minute
sum by (service) (
rate({level="error"}[1m])
)

# 4. Parse JSON logs in LogQL
{service="api"} | json | latency > 500  # slow requests only

🎯 Practice Questions

Q1.
Loki indexes labels, not log bodies. Name two trade-offs of this design (one win, one loss).
Show Answer
Win — cost: indexing only labels keeps storage 10–100× cheaper than Elasticsearch. A small team can keep 90 days of logs for under $50/month.

Loss — search speed: full-text searches without a label filter ("find the word 'oom' anywhere") are linear scans over compressed chunks — slow on large volumes. Always query with at least one label filter ({service="api"} |= "oom") so Loki can prune to the relevant chunks first.
Q2.
Why is structured (JSON) logging important in a Loki workflow?
💡 LogQL | json parses fields you can filter on.
Q3.
Write a LogQL query that finds API logs where the parsed JSON field latency_ms is greater than 1000.
07
Alerting — Alertmanager, Slack/Discord, and the "Wake Me Up" Rule
An alert that doesn't require a human action is a notification, not an alert

How to explain to students

Bad alerts erode trust faster than no alerts. Two rules: (1) every alert must have a runbook linking what to do, and (2) if an alert can't be acted on, it shouldn't page — it should be a notification or a dashboard signal.

Architecture: Prometheus evaluates alerting rules against metrics. Firing alerts go to Alertmanager, which de-duplicates, groups, silences, and routes them to Slack / Discord / email / PagerDuty based on labels (severity, team).

alerts.yml + alertmanager.yml
# alerts.yml — Prometheus alerting rules
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
           / sum(rate(http_requests_total[5m])) > 0.02
        for: 5m  # must persist 5 min before firing
        labels: { severity: page, team: backend }
        annotations:
          summary: 'API error rate > 2% (current {{ $value | humanizePercentage }})'
          runbook: https://wiki.example.com/runbooks/api-errors

# alertmanager.yml — routing
route:
  group_by: [alertname, service]
  group_wait: 30s  # batch alerts arriving in same window
  repeat_interval: 4h
  receiver: slack-default
  routes:
    - matchers: [severity="page"]
      receiver: pagerduty-oncall
    - matchers: [severity="warn"]
      receiver: slack-default

receivers:
  - name: slack-default
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/T0/B0/XXXXX'
        channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: 'Runbook: {{ .CommonAnnotations.runbook }}'

  - name: discord
    webhook_configs:
      - url: 'https://discord.com/api/webhooks/...'
📖
Runbook required
Every alert links to "what to do." If you can't write one, the alert isn't ready.
for: 5m
Persist condition before firing. Kills 90% of false positives from spikes.
🎯
Alert on symptoms
Error rate, latency — what users feel. Not "CPU high" (that's a cause).
🔇
Silences for maintenance
Use Alertmanager silences during planned work — don't disable rules.

🎯 Practice Questions

Q1.
Two pages a week from a single alert no one ever takes action on. List three changes to consider — without disabling the alert.
Q2.
Why is "alert when CPU > 90%" usually a worse alert than "alert when API error rate > 1%"?
Show Answer
CPU = cause. Error rate = symptom.

A service can run at 95% CPU and serve every user fine (it's just busy). A service can run at 30% CPU and be returning 500s to half the world (a downstream is broken). Alerting on CPU pages you when no one is suffering, and stays silent when they are.

Modern SRE practice: alert on SLI breaches — error rate, latency, availability — and use CPU only as a diagnostic signal during incident response. (The Google SRE workbook calls this "page on symptoms, dashboard on causes.")
Q3.
Write the Slack webhook config in Alertmanager that posts only alerts with severity=page to channel #oncall.
Q4.
A teammate wants to silence an alert during a 2-hour deploy window. How do you do this without editing alert rules?
08
Using AI to Write PromQL, LogQL & Grafana Dashboards
PromQL is concise but tricky — exactly the shape AI is best at

How to explain to students

PromQL syntax (rate, histogram_quantile, label matchers) is concise but unforgiving. AI shines here: describe the question in English, get the query, then refine. Same for Grafana dashboard JSON — AI can scaffold a 5-panel RED dashboard from a description, saving an hour of clicking.

The trap: AI sometimes invents metric names that don't exist in your fleet (http_requests_total exists; app_user_requests probably doesn't). Always verify by querying the metric first in Prometheus's expression browser.

AI prompts for observability
# ✅ Strong PromQL prompt
"Write PromQL queries for an HTTP service exposing these metrics:
- http_requests_total{method, route, status} (counter)
- http_request_duration_seconds_bucket{method, route, status, le} (histogram)
Give me:
1. Per-route requests/sec over 5 min
2. p95 latency per route
3. 5xx error ratio over 5 min as a percentage
4. SLO burn-rate alert: error budget for 99.9% availability over 30 days
Use the proper rate() / histogram_quantile() forms."

# ✅ Strong LogQL prompt
"Write LogQL for Loki to find: (a) all error logs from service=api in last 1h,
(b) requests where parsed JSON field 'latency_ms' > 1000,
(c) error rate per minute as a metric for a Grafana panel."

# Verify before pasting into a dashboard
$ curl -G "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=<the AI-generated query>'
"resultType": "vector", "result": [...]  # non-empty = the metric exists

🎯 Practice Questions

Q1.
Take "show me errors in Grafana" and turn it into a 5-bullet detailed prompt that produces a working PromQL query.
Q2.
An AI-generated PromQL uses http_requests (no _total). Why might the query work locally but break in production, and what's the convention?
Show Answer
Prometheus convention: counters MUST end in _total. Tools like the rate() function and the Prometheus operator's auto-generated rules assume this naming.

If your local app exposes http_requests (without _total), AI's query rate(http_requests[5m]) works in your dev cluster — but breaks the moment it gets renamed in prod, or when a teammate writes the same query and assumes the canonical _total name.

Always verify against your actual /metrics endpoint before merging an AI-generated query.
Q3.
Why is asking AI to "generate a complete Grafana dashboard JSON" usually less useful than asking for individual panel queries?
09
Project: Full Prometheus + Grafana + Loki Stack via Docker Compose
Bring up the whole observability stack on your laptop in one command, instrument an app, see metrics + logs together

How to explain to students

Walk through this on screen first. Once it works locally, students take the same compose stack and adapt it for their EC2 instances or homelab. The shape — Prometheus + node-exporter + Grafana + Loki + Promtail — is the same in production, just with persistent volumes, auth, and TLS in front.

compose.yaml — full stack
services:

  app:
    build: ./app
    ports: ["3000:3000"]
    labels: { service: api }  # for Loki to pick up

  prometheus:
    image: prom/prometheus:latest
    volumes: ['./prometheus.yml:/etc/prometheus/prometheus.yml']
    ports: ["9090:9090"]

  node-exporter:
    image: prom/node-exporter:latest
    pid: host

  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./promtail.yml:/etc/promtail/config.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3001:3000"]
    environment: { GF_SECURITY_ADMIN_PASSWORD: admin }
    volumes:
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

$ docker compose up -d
✓ prometheus ✓ node-exporter ✓ loki ✓ promtail ✓ grafana ✓ app
Open Grafana → http://localhost:3001 (admin / admin)
📦
All in one compose
6 services, one up -d. Lower the barrier to "I have monitoring."
🗂️
Provisioned datasources
Grafana sees Prometheus + Loki on first boot — no clicking through wizards.
🔗
Metrics + logs linked
Click a Grafana panel → "show logs at this time" → Loki opens with the same window.
🚀
Production shape
Same setup with persistent volumes + auth + TLS = a real prod deployment.
10
Quiz: Observability Basics, Grafana & Alerting
5 MCQs + 2 fill-in-the-command questions

Sample quiz questions (interactive)

Q1. The "three pillars of observability" are:
A
Dashboards, alerts, runbooks
B
Metrics, logs, traces
C
Prometheus, Grafana, Loki
D
CPU, memory, disk
Q2. Prometheus is mostly:
A
Pull-based — scrapes /metrics endpoints
B
Push-based — apps push metrics to Prometheus
C
Log-based — parses log files
D
Trace-based — follows request spans
Q3. Best metric-type for "current queue depth"?
A
Counter
B
Gauge
C
Histogram
D
Summary
Q4. To get p95 latency you query:
A
rate(http_request_duration[5m])
B
avg(http_request_duration_seconds)
C
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
D
max(http_request_duration_seconds)
Q5. Best practice — alert on:
A
CPU > 90% always
B
Symptoms users care about (error rate, latency)
C
Anything that changes
D
Disk > 50% — early warning

Fill-in-the-command

Fill 1: PromQL — error rate as a percentage over 5 minutes (assume http_requests_total{status}).
Fill 2: LogQL — all error-level logs from service=api.
11
Assignment: Slack/Discord Alert when CPU Exceeds 80%
A real, end-to-end alerting pipeline: instrument → scrape → rule → route → message in chat

How to explain to students

Frame as the on-call setup task: "By Monday, CPU above 80% on any of our hosts must post a message to #alerts in Slack within 2 minutes." Forces them to integrate Prometheus + node-exporter + Alertmanager + a Slack/Discord webhook end-to-end.

📋 Assignment Requirements

  • Run the Compose stack from Module 9 (Prometheus + Grafana + Loki + node-exporter + Alertmanager)
  • Configure Prometheus to scrape node-exporter on at least one host (your laptop or an EC2)
  • Write a Prometheus alerting rule: HighCpu fires when 1-minute average CPU usage on any host > 80% for 2 minutes
  • Wire Alertmanager to a Slack incoming webhook OR Discord webhook (your choice)
  • Alert payload must include: hostname, current value (formatted), runbook URL (a placeholder is fine)
  • Demonstrate end-to-end by running stress --cpu 4 --timeout 180 and showing the resulting Slack/Discord post
  • Add a Grafana dashboard with one panel: CPU usage per host, last 1 hour
  • Bonus: Add a severity label and route severity=warn to a different channel than severity=page
  • Bonus: Add a silence via Alertmanager during a "maintenance window" — show the alert is suppressed
  • Bonus: Add a CPU-usage SLO ("CPU below 70% for 99% of the month") and a burn-rate alert
expected slack message
🚨 [FIRING] HighCpu — host: laptop-01
CPU usage: 91.2% (threshold: 80%)
Started: 14:23 UTC · Severity: page
Runbook: https://wiki.example.com/runbooks/high-cpu
📊
Grading rubric
Stack runs: 25. Rule fires correctly: 25. Slack/Discord delivery: 20. Dashboard panel: 15. Runbook + labels: 15.
🎯
Common mistakes
Forgot for: 2m (alerts flap), used node_cpu_seconds_total directly (need rate + (1 - idle)), webhook URL pasted in repo.
💡
Stretch
Replace CPU with a real SLI — request error rate, p95 latency. CPU is a teaching example, not a great real-world signal.