How to explain to students
Open with: "It's 2 AM. Your phone buzzes — production is down. You have 5 minutes to answer three questions: Is it really broken? Where is it broken? What changed?" If your team can't answer those, you don't have monitoring — you have guessing.
Monitoring tells you something is wrong ("CPU is at 95%"). Observability lets you ask new questions you didn't think to ask in advance ("Why does CPU spike only when a Karachi user uploads a PDF on Wednesdays?"). The first is dashboards + alerts. The second is high-cardinality metrics + structured logs + traces.
🎯 Practice Questions
Show Answer
Observability = ability to explore the unknown — slice metrics by arbitrary labels, search structured logs, follow traces. Answers "why is this user's request slow?"
(a) Monitoring — you knew in advance you cared about CPU.
(b) Observability — you're asking a new question that requires high-cardinality data (user ID, request ID).
How to explain to students
Metrics = numbers over time. Cheap to store, fast to aggregate, perfect for dashboards and alerts. Tools: Prometheus, CloudWatch.
Logs = text events with structure (JSON ideally). Expensive to store, perfect for "what happened to this request?" Tools: Loki, ELK, CloudWatch Logs.
Traces = the request journey across services. Answers "why is this slow?" Tools: Tempo, Jaeger, X-Ray, OpenTelemetry.
The three are complementary, not competing. A great workflow: "alert fires off a metric → click into the dashboard → drill into logs for that time window → for slow requests, follow the trace".
🎯 Practice Questions
Show Answer
rate(http_errors_total[5m]) / rate(http_requests_total[5m]).(b) Logs — request body is a one-off event with high cardinality. Search structured logs by trace ID or timestamp.
(c) Traces — distributed traces show time per span (api 50ms + db 250ms = 300ms). Pin-points the bottleneck.
(d) Metrics — long-retention numeric series; logs would be too expensive.
How to explain to students
Prometheus has a deliberately simple architecture: every 15 seconds it pulls a /metrics endpoint from each target and stores the numbers. Targets expose metrics in a plain-text format. To monitor a Node.js app, you add the prom-client library; to monitor a Linux box, you run node-exporter.
PromQL is the query language. Three things to memorise: rate(counter[5m]) (per-second rate of a counter), histogram_quantile(0.95, ...) (95th percentile latency), and sum by (label)(...) (group by). With those three, you can build 80% of dashboards.
🎯 Practice Questions
Show Answer
1. Prometheus controls the schedule — no thundering herds when 1000 instances all push at once.
2. Health-check by default — if a target stops responding to
/metrics, Prometheus knows immediately (up == 0).3. Easier security — Prometheus reaches into the network; targets don't need outbound credentials.
Push needed for short-lived jobs. A cron job that runs for 30 seconds may finish before Prometheus's next 15-second scrape. The job pushes its final metrics to Pushgateway, which then exposes them via pull. Use Pushgateway sparingly — it's an exception, not the default.
increase(counter[5m]) often wrong, and rate(counter[5m]) right, when alerting on a per-second threshold?http_request_duration_seconds_bucket has 10 buckets. What's the trade-off if you add 100 more buckets to capture finer percentile detail?How to explain to students
Most apps come pre-instrumented for the basics — Express + prom-client gives you HTTP counters and histograms in 20 lines. Focus on the RED method: Rate, Errors, Duration. Every endpoint should expose at least these three.
Naming matters. Prometheus convention: <namespace>_<name>_<unit> (snake_case, suffix is the unit). E.g., myapp_http_request_duration_seconds, not requestTimeMs.
🎯 Practice Questions
route as a label, but not user_id? What's the cardinality consequence?Show Answer
route has maybe 10–50 distinct values (your endpoints) — manageable. user_id has potentially millions of values — every user creates a new time series, and Prometheus's storage + memory blow up linearly.Rule of thumb: label cardinality should be bounded and small. Anything user-scoped (user ID, request ID, IP) belongs in logs or traces, not metrics. If you genuinely need per-user metrics, sample / aggregate before exporting (e.g. "top 10 users by request count").
request_time_in_milliseconds. Suggest two improvements following Prometheus naming conventions.[0.005, 0.01, 0.025, ..., 10] are a poor fit for an API where p95 is 20ms? What buckets would you pick instead?How to explain to students
Grafana doesn't store metrics — it queries other systems and visualises the result. You add data sources (Prometheus, Loki, CloudWatch, Postgres), build dashboards with panels, and use variables to make a single dashboard work for many environments / services.
A good dashboard has fewer panels, not more. Ben Kraft's "RED dashboard" for an HTTP service is just three panels (Rate, Errors, Duration). USE for hosts (Utilisation, Saturation, Errors). Master those two layouts and you can read any dashboard in 30 seconds.
🎯 Practice Questions
Show Answer
Rule: front page = SLI / RED / USE only. Drilldowns (per-route latency, per-customer error rate, GC pauses) belong on linked sub-dashboards. Use row collapse + drilldown links to keep things tiered.
$service uses label_values(http_requests_total, service). What happens if no metrics with that label exist yet (e.g., new env)?How to explain to students
Loki takes a different approach from Elasticsearch: it indexes only the labels (service, level, host), not the message body. This makes it 10–100× cheaper than ELK at the cost of slower full-text search. For most DevOps teams that's the right trade — "give me all logs for service=api, level=error in the last hour" is fast; brute-force "find the word 'kafka'" is slow.
The query language is LogQL — basically PromQL but for logs. {service="api"} |= "ERROR" filters; rate(...) turns logs into a metric. You can plot log-derived metrics on the same dashboard as Prometheus metrics — that's the killer feature.
🎯 Practice Questions
Show Answer
Loss — search speed: full-text searches without a label filter ("find the word 'oom' anywhere") are linear scans over compressed chunks — slow on large volumes. Always query with at least one label filter (
{service="api"} |= "oom") so Loki can prune to the relevant chunks first.
| json parses fields you can filter on.latency_ms is greater than 1000.How to explain to students
Bad alerts erode trust faster than no alerts. Two rules: (1) every alert must have a runbook linking what to do, and (2) if an alert can't be acted on, it shouldn't page — it should be a notification or a dashboard signal.
Architecture: Prometheus evaluates alerting rules against metrics. Firing alerts go to Alertmanager, which de-duplicates, groups, silences, and routes them to Slack / Discord / email / PagerDuty based on labels (severity, team).
🎯 Practice Questions
Show Answer
A service can run at 95% CPU and serve every user fine (it's just busy). A service can run at 30% CPU and be returning 500s to half the world (a downstream is broken). Alerting on CPU pages you when no one is suffering, and stays silent when they are.
Modern SRE practice: alert on SLI breaches — error rate, latency, availability — and use CPU only as a diagnostic signal during incident response. (The Google SRE workbook calls this "page on symptoms, dashboard on causes.")
severity=page to channel #oncall.How to explain to students
PromQL syntax (rate, histogram_quantile, label matchers) is concise but unforgiving. AI shines here: describe the question in English, get the query, then refine. Same for Grafana dashboard JSON — AI can scaffold a 5-panel RED dashboard from a description, saving an hour of clicking.
The trap: AI sometimes invents metric names that don't exist in your fleet (http_requests_total exists; app_user_requests probably doesn't). Always verify by querying the metric first in Prometheus's expression browser.
🎯 Practice Questions
http_requests (no _total). Why might the query work locally but break in production, and what's the convention?Show Answer
_total. Tools like the rate() function and the Prometheus operator's auto-generated rules assume this naming.If your local app exposes
http_requests (without _total), AI's query rate(http_requests[5m]) works in your dev cluster — but breaks the moment it gets renamed in prod, or when a teammate writes the same query and assumes the canonical _total name.Always verify against your actual
/metrics endpoint before merging an AI-generated query.
How to explain to students
Walk through this on screen first. Once it works locally, students take the same compose stack and adapt it for their EC2 instances or homelab. The shape — Prometheus + node-exporter + Grafana + Loki + Promtail — is the same in production, just with persistent volumes, auth, and TLS in front.
up -d. Lower the barrier to "I have monitoring."Sample quiz questions (interactive)
Fill-in-the-command
http_requests_total{status}).error-level logs from service=api.How to explain to students
Frame as the on-call setup task: "By Monday, CPU above 80% on any of our hosts must post a message to #alerts in Slack within 2 minutes." Forces them to integrate Prometheus + node-exporter + Alertmanager + a Slack/Discord webhook end-to-end.
📋 Assignment Requirements
- Run the Compose stack from Module 9 (Prometheus + Grafana + Loki + node-exporter + Alertmanager)
- Configure Prometheus to scrape
node-exporteron at least one host (your laptop or an EC2) - Write a Prometheus alerting rule:
HighCpufires when 1-minute average CPU usage on any host > 80% for 2 minutes - Wire Alertmanager to a Slack incoming webhook OR Discord webhook (your choice)
- Alert payload must include: hostname, current value (formatted), runbook URL (a placeholder is fine)
- Demonstrate end-to-end by running
stress --cpu 4 --timeout 180and showing the resulting Slack/Discord post - Add a Grafana dashboard with one panel: CPU usage per host, last 1 hour
- Bonus: Add a
severitylabel and routeseverity=warnto a different channel thanseverity=page - Bonus: Add a silence via Alertmanager during a "maintenance window" — show the alert is suppressed
- Bonus: Add a CPU-usage SLO ("CPU below 70% for 99% of the month") and a burn-rate alert
for: 2m (alerts flap), used node_cpu_seconds_total directly (need rate + (1 - idle)), webhook URL pasted in repo.