Why Monitoring Matters — and Observability vs Monitoring

"You can't fix what you can't see." The 3 questions every DevOps engineer must be able to answer in 30 seconds.

▾

How to explain to students

Open with: "It's 2 AM. Your phone buzzes — production is down. You have 5 minutes to answer three questions: Is it really broken? Where is it broken? What changed?" If your team can't answer those, you don't have monitoring — you have guessing.

Monitoring tells you something is wrong ("CPU is at 95%"). Observability lets you ask new questions you didn't think to ask in advance ("Why does CPU spike only when a Karachi user uploads a PDF on Wednesdays?"). The first is dashboards + alerts. The second is high-cardinality metrics + structured logs + traces.

on-call.sh — the 3 questions

# 02:14 → PagerDuty: "Site is down"

# Q1. Is it really broken? (or just one user?)

$ curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health

503

$ open https://grafana.example.com/d/overview

Error rate spiked from 0.2% → 28% at 02:11 UTC ✓ confirmed broken

# Q2. Where is it broken? (which service / region / version?)

$ grafana → "errors by service"

checkout-api: 90% of errors — others healthy

→ all errors from instances tagged version=v2.4.1

# Q3. What changed? (deploy / config / dependency?)

$ git log --since="3 hours ago" --oneline

a3f1c2 deploy v2.4.1 → checkout-api (02:08 UTC) ← 3 min before alert

# Decision: rollback to v2.4.0. Postmortem after.

$ aws ecs update-service --service checkout-api --task-definition checkout:42

🚨

Detect

Alarms find problems within minutes, not hours.

🔍

Diagnose

Dashboards + logs scope the blast radius — which service, which version.

↩️

Decide

Rollback, scale up, or page a specialist. With data, not gut feel.

📚

Learn

Every incident becomes a postmortem and a new alert / dashboard.

🎯 Practice Questions

Q1.

In one sentence each, define monitoring and observability. Pick the right word for: (a) "alert when DB CPU > 80%", (b) "find out which user's request triggered the 500".

Show Answer

Monitoring = predefined dashboards + alerts on known signals. Answers "is the thing I'm watching healthy?"
Observability = ability to explore the unknown — slice metrics by arbitrary labels, search structured logs, follow traces. Answers "why is this user's request slow?"

(a) Monitoring — you knew in advance you cared about CPU.
(b) Observability — you're asking a new question that requires high-cardinality data (user ID, request ID).

Q2.

Your service has zero monitoring today. Pick the first two things to instrument before adding anything else. Justify in one sentence each.

💡 Think "is it up?" + "is it deploying?"

Q3.

A teammate says, "We have alerts when CPU hits 90% — that's enough." Why is alerting on resource utilisation alone often wrong? What's a better signal?

02

The Three Pillars — Metrics, Logs, Traces

Each answers a different question. Master all three and you can debug anything.

▾

How to explain to students

Metrics = numbers over time. Cheap to store, fast to aggregate, perfect for dashboards and alerts. Tools: Prometheus, CloudWatch.
Logs = text events with structure (JSON ideally). Expensive to store, perfect for "what happened to this request?" Tools: Loki, ELK, CloudWatch Logs.
Traces = the request journey across services. Answers "why is this slow?" Tools: Tempo, Jaeger, X-Ray, OpenTelemetry.

The three are complementary, not competing. A great workflow: "alert fires off a metric → click into the dashboard → drill into logs for that time window → for slow requests, follow the trace".

three-pillars.txt

METRICS LOGS TRACES

─────────────── ────────────────── ───────────────────── ─────────────────

Shape number @ time structured event request span tree

Question "what's the rate?" "what happened?" "where's the time?"

Cardinality LOW (limited tags) HIGH (per-event) VERY HIGH

Storage cost cheap medium expensive

Retention 90d–year+ 7–30 days 7–14 days

Tools Prometheus, CW Loki, ELK, CW Logs Tempo, Jaeger, X-Ray

# Same problem, three views

METRIC: http_5xx_total{service="api"} → 28% at 02:11

LOG: {ts:..., level:"error", trace_id:"abc", msg:"DB conn refused"}

TRACE: api-handler 320ms ─ db-query 295ms ✗ ECONNREFUSED

🎯 Practice Questions

Q1.

Pick the right pillar for: (a) "alert when error rate > 1%", (b) "what was the body of request that crashed at 14:23?", (c) "is the slowness in our service or our DB?", (d) "track requests/sec over the last 30 days".

Show Answer

(a) Metrics — error rate is a number over time, computed via PromQL: rate(http_errors_total[5m]) / rate(http_requests_total[5m]).
(b) Logs — request body is a one-off event with high cardinality. Search structured logs by trace ID or timestamp.
(c) Traces — distributed traces show time per span (api 50ms + db 250ms = 300ms). Pin-points the bottleneck.
(d) Metrics — long-retention numeric series; logs would be too expensive.

Q2.

Why are metrics "low cardinality" while logs are "high cardinality"? Why does that matter for cost?

💡 Each unique label combination is a separate time series in Prometheus.

Q3.

Your team logs every HTTP request to CloudWatch and the bill is exploding. List two changes that cut log volume by > 80% without losing debug ability.

03

Prometheus & PromQL Basics

The de-facto open-source metrics database — pull-based, time-series, queryable

▾

How to explain to students

Prometheus has a deliberately simple architecture: every 15 seconds it pulls a /metrics endpoint from each target and stores the numbers. Targets expose metrics in a plain-text format. To monitor a Node.js app, you add the prom-client library; to monitor a Linux box, you run node-exporter.

PromQL is the query language. Three things to memorise: rate(counter[5m]) (per-second rate of a counter), histogram_quantile(0.95, ...) (95th percentile latency), and sum by (label)(...) (group by). With those three, you can build 80% of dashboards.

prometheus.yml + first queries

# prometheus.yml — config

global:

scrape_interval: 15s

scrape_configs:

- job_name: 'node-exporter'

static_configs:

- targets: ['ec2-host:9100']

- job_name: 'myapp'

metrics_path: /metrics

static_configs:

- targets: ['app1:3000', 'app2:3000']

# What /metrics actually looks like

$ curl http://app1:3000/metrics

http_requests_total{method="GET",status="200",route="/items"} 14823

http_requests_total{method="GET",status="500",route="/items"} 41

http_request_duration_seconds_bucket{le="0.05",route="/items"} 8200

http_request_duration_seconds_bucket{le="0.1",route="/items"} 12400

process_resident_memory_bytes 142360576

# PromQL — the 3 queries you'll write daily

# 1. Per-second request rate, grouped by status

sum by (status) (rate(http_requests_total[5m]))

# 2. Error rate as a percentage

sum(rate(http_requests_total{status=~"5.."}[5m]))

/

sum(rate(http_requests_total[5m])) * 100

# 3. p95 latency by route (the SLO favourite)

histogram_quantile(

0.95,

sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))

)

scrape /metrics node-exporter prom-client rate() histogram_quantile() sum by()

🎯 Practice Questions

Q1.

Prometheus is "pull-based" while DataDog is "push-based". Why does Prometheus prefer pull, and what's one scenario where push (via Pushgateway) is needed?

Show Answer

Pull advantages:
1. Prometheus controls the schedule — no thundering herds when 1000 instances all push at once.
2. Health-check by default — if a target stops responding to /metrics, Prometheus knows immediately (up == 0).
3. Easier security — Prometheus reaches into the network; targets don't need outbound credentials.

Push needed for short-lived jobs. A cron job that runs for 30 seconds may finish before Prometheus's next 15-second scrape. The job pushes its final metrics to Pushgateway, which then exposes them via pull. Use Pushgateway sparingly — it's an exception, not the default.

Q2.

Write a PromQL query for "request error rate per route over the last 5 minutes, as a percentage."

Q3.

Why is using increase(counter[5m]) often wrong, and rate(counter[5m]) right, when alerting on a per-second threshold?

💡 Units — increase is total, rate is per-second.

Q4.

Your http_request_duration_seconds_bucket has 10 buckets. What's the trade-off if you add 100 more buckets to capture finer percentile detail?

04

Instrumenting a Node.js App with prom-client

From "I have an API" to "/metrics is exposing real RED-method counters" in 20 lines

▾

How to explain to students

Most apps come pre-instrumented for the basics — Express + prom-client gives you HTTP counters and histograms in 20 lines. Focus on the RED method: Rate, Errors, Duration. Every endpoint should expose at least these three.

Naming matters. Prometheus convention: <namespace>_<name>_<unit> (snake_case, suffix is the unit). E.g., myapp_http_request_duration_seconds, not requestTimeMs.

server.ts — RED instrumentation

import express from 'express';

import { Counter, Histogram, Registry, collectDefaultMetrics } from 'prom-client';

const registry = new Registry();

collectDefaultMetrics({ register: registry }); # node + GC metrics for free

const httpRequests = new Counter({

name: 'http_requests_total',

help: 'Total HTTP requests',

labelNames: ['method', 'route', 'status'],

registers: [registry],

});

const httpDuration = new Histogram({

name: 'http_request_duration_seconds',

help: 'Request duration in seconds',

labelNames: ['method', 'route', 'status'],

buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],

registers: [registry],

});

const app = express();

app.use((req, res, next) => {

const end = httpDuration.startTimer();

res.on('finish', () => {

const labels = { method: req.method, route: req.route?.path ?? 'unknown', status: res.statusCode };

httpRequests.inc(labels);

end(labels);

});

next();

});

app.get('/metrics', async (_req, res) => {

res.set('content-type', registry.contentType);

res.end(await registry.metrics());

});

# Verify locally

$ curl localhost:3000/metrics | grep http_requests_total

http_requests_total{method="GET",route="/items",status="200"} 14823

🔢

Counter

Only goes up. Use for "total things that happened" — requests, errors, jobs.

📈

Gauge

Up or down. Use for "current value" — queue depth, active connections.

📊

Histogram

Buckets that record distribution. Use for latency / sizes — supports percentile queries.

📐

RED method

Rate, Errors, Duration. The 3 metrics every HTTP service must expose.

🎯 Practice Questions

Q1.

Pick Counter or Gauge for: (a) "messages processed since boot", (b) "current queue depth", (c) "5xx responses today", (d) "active websocket connections".

Q2.

Why include route as a label, but not user_id? What's the cardinality consequence?

Show Answer

Each unique label-value combination becomes a separate time series in Prometheus. route has maybe 10–50 distinct values (your endpoints) — manageable. user_id has potentially millions of values — every user creates a new time series, and Prometheus's storage + memory blow up linearly.

Rule of thumb: label cardinality should be bounded and small. Anything user-scoped (user ID, request ID, IP) belongs in logs or traces, not metrics. If you genuinely need per-user metrics, sample / aggregate before exporting (e.g. "top 10 users by request count").

Q3.

A teammate names a metric request_time_in_milliseconds. Suggest two improvements following Prometheus naming conventions.

Q4.

Why default histogram buckets [0.005, 0.01, 0.025, ..., 10] are a poor fit for an API where p95 is 20ms? What buckets would you pick instead?

05

Grafana — Dashboards, Variables, and Cloud Datasources

The visual layer over Prometheus, Loki, CloudWatch, and a hundred other backends

▾

How to explain to students

Grafana doesn't store metrics — it queries other systems and visualises the result. You add data sources (Prometheus, Loki, CloudWatch, Postgres), build dashboards with panels, and use variables to make a single dashboard work for many environments / services.

A good dashboard has fewer panels, not more. Ben Kraft's "RED dashboard" for an HTTP service is just three panels (Rate, Errors, Duration). USE for hosts (Utilisation, Saturation, Errors). Master those two layouts and you can read any dashboard in 30 seconds.

grafana — provisioned dashboard

# datasources.yaml — auto-provisioned on Grafana start

apiVersion: 1

datasources:

- name: Prometheus

type: prometheus

url: http://prometheus:9090

isDefault: true

- name: Loki

type: loki

url: http://loki:3100

- name: CloudWatch

type: cloudwatch

jsonData: { defaultRegion: eu-west-1 }

# RED dashboard — 3 panels for an HTTP service

┌─────────────────────────────────────────────────────────────────┐

│ Rate Errors Duration p95 │

│ ▁▂▃▆▇▆▅▄ ▁▁▁▁▁▁▂▁ ▂▂▂▃▃▂▂▂ │

│ 142 req/s 0.3% errors 38ms │

└─────────────────────────────────────────────────────────────────┘

# Variable: $service — one dashboard, all services

label_values(http_requests_total, service)

# Panel queries reference $service

sum by (status) (rate(http_requests_total{service="$service"}[5m]))

# Cloud datasource — query CloudWatch from Grafana

SELECT AVG(CPUUtilization) FROM "AWS/EC2"

WHERE InstanceId = '$instance' GROUP BY interval(5m)

datasource panel variable RED dashboard USE method CloudWatch DS

🎯 Practice Questions

Q1.

A dashboard has 47 panels. Why is this an anti-pattern? Suggest the rule for what should and shouldn't be on the front page.

Show Answer

A dashboard with 47 panels is unreadable in an incident — it takes 5 minutes to find the broken one. The front-page dashboard should answer one question: "Is the service healthy?" 5–8 panels max, big and obvious.

Rule: front page = SLI / RED / USE only. Drilldowns (per-route latency, per-customer error rate, GC pauses) belong on linked sub-dashboards. Use row collapse + drilldown links to keep things tiered.

Q2.

Explain RED (Rate, Errors, Duration) vs USE (Utilisation, Saturation, Errors). When would you use each?

Q3.

A dashboard variable $service uses label_values(http_requests_total, service). What happens if no metrics with that label exist yet (e.g., new env)?

💡 Variable returns empty → all queries become invalid.

06

Loki — Log Aggregation that Plays Nicely with Grafana

"Prometheus, but for logs" — same labels, same Grafana, much cheaper than ELK

▾

How to explain to students

Loki takes a different approach from Elasticsearch: it indexes only the labels (service, level, host), not the message body. This makes it 10–100× cheaper than ELK at the cost of slower full-text search. For most DevOps teams that's the right trade — "give me all logs for service=api, level=error in the last hour" is fast; brute-force "find the word 'kafka'" is slow.

The query language is LogQL — basically PromQL but for logs. {service="api"} |= "ERROR" filters; rate(...) turns logs into a metric. You can plot log-derived metrics on the same dashboard as Prometheus metrics — that's the killer feature.

loki — promtail config + LogQL

# promtail.yml — log shipper (one per host or k8s daemonset)

clients:

- url: http://loki:3100/loki/api/v1/push

scrape_configs:

- job_name: containers

docker_sd_configs: [{ host: unix:///var/run/docker.sock }]

relabel_configs:

- source_labels: [__meta_docker_container_label_service]

target_label: service

# LogQL — three queries you'll write often

# 1. All errors from the api service in the last 15 minutes

{service="api", level="error"}

# 2. Filter further by message substring

{service="api"} |= "DB connection refused"

# 3. Turn logs INTO a metric — error rate per minute

sum by (service) (

rate({level="error"}[1m])

)

# 4. Parse JSON logs in LogQL

{service="api"} | json | latency > 500 # slow requests only

🎯 Practice Questions

Q1.

Loki indexes labels, not log bodies. Name two trade-offs of this design (one win, one loss).

Show Answer

Win — cost: indexing only labels keeps storage 10–100× cheaper than Elasticsearch. A small team can keep 90 days of logs for under $50/month.

Loss — search speed: full-text searches without a label filter ("find the word 'oom' anywhere") are linear scans over compressed chunks — slow on large volumes. Always query with at least one label filter ({service="api"} |= "oom") so Loki can prune to the relevant chunks first.

Q2.

Why is structured (JSON) logging important in a Loki workflow?

💡 LogQL | json parses fields you can filter on.

Q3.

Write a LogQL query that finds API logs where the parsed JSON field latency_ms is greater than 1000.

07

Alerting — Alertmanager, Slack/Discord, and the "Wake Me Up" Rule

An alert that doesn't require a human action is a notification, not an alert

▾

How to explain to students

Bad alerts erode trust faster than no alerts. Two rules: (1) every alert must have a runbook linking what to do, and (2) if an alert can't be acted on, it shouldn't page — it should be a notification or a dashboard signal.

Architecture: Prometheus evaluates alerting rules against metrics. Firing alerts go to Alertmanager, which de-duplicates, groups, silences, and routes them to Slack / Discord / email / PagerDuty based on labels (severity, team).

alerts.yml + alertmanager.yml

# alerts.yml — Prometheus alerting rules

groups:

- name: api

rules:

- alert: HighErrorRate

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m]))

/ sum(rate(http_requests_total[5m])) > 0.02

for: 5m # must persist 5 min before firing

labels: { severity: page, team: backend }

annotations:

summary: 'API error rate > 2% (current {{ $value | humanizePercentage }})'

runbook: https://wiki.example.com/runbooks/api-errors

# alertmanager.yml — routing

route:

group_by: [alertname, service]

group_wait: 30s # batch alerts arriving in same window

repeat_interval: 4h

receiver: slack-default

routes:

- matchers: [severity="page"]

receiver: pagerduty-oncall

- matchers: [severity="warn"]

receiver: slack-default

receivers:

- name: slack-default

slack_configs:

- api_url: 'https://hooks.slack.com/services/T0/B0/XXXXX'

channel: '#alerts'

title: '{{ .CommonAnnotations.summary }}'

text: 'Runbook: {{ .CommonAnnotations.runbook }}'

- name: discord

webhook_configs:

- url: 'https://discord.com/api/webhooks/...'

📖

Runbook required

Every alert links to "what to do." If you can't write one, the alert isn't ready.

⏳

for: 5m

Persist condition before firing. Kills 90% of false positives from spikes.

🎯

Alert on symptoms

Error rate, latency — what users feel. Not "CPU high" (that's a cause).

🔇

Silences for maintenance

Use Alertmanager silences during planned work — don't disable rules.

🎯 Practice Questions

Q1.

Two pages a week from a single alert no one ever takes action on. List three changes to consider — without disabling the alert.

Q2.

Why is "alert when CPU > 90%" usually a worse alert than "alert when API error rate > 1%"?

Show Answer

CPU = cause. Error rate = symptom.

A service can run at 95% CPU and serve every user fine (it's just busy). A service can run at 30% CPU and be returning 500s to half the world (a downstream is broken). Alerting on CPU pages you when no one is suffering, and stays silent when they are.

Modern SRE practice: alert on SLI breaches — error rate, latency, availability — and use CPU only as a diagnostic signal during incident response. (The Google SRE workbook calls this "page on symptoms, dashboard on causes.")

Q3.

Write the Slack webhook config in Alertmanager that posts only alerts with severity=page to channel #oncall.

Q4.

A teammate wants to silence an alert during a 2-hour deploy window. How do you do this without editing alert rules?

08

Using AI to Write PromQL, LogQL & Grafana Dashboards

PromQL is concise but tricky — exactly the shape AI is best at

▾

How to explain to students

PromQL syntax (rate, histogram_quantile, label matchers) is concise but unforgiving. AI shines here: describe the question in English, get the query, then refine. Same for Grafana dashboard JSON — AI can scaffold a 5-panel RED dashboard from a description, saving an hour of clicking.

The trap: AI sometimes invents metric names that don't exist in your fleet (http_requests_total exists; app_user_requests probably doesn't). Always verify by querying the metric first in Prometheus's expression browser.

AI prompts for observability

# ✅ Strong PromQL prompt

"Write PromQL queries for an HTTP service exposing these metrics:

- http_requests_total{method, route, status} (counter)

- http_request_duration_seconds_bucket{method, route, status, le} (histogram)

Give me:

1. Per-route requests/sec over 5 min

2. p95 latency per route

3. 5xx error ratio over 5 min as a percentage

4. SLO burn-rate alert: error budget for 99.9% availability over 30 days

Use the proper rate() / histogram_quantile() forms."

# ✅ Strong LogQL prompt

"Write LogQL for Loki to find: (a) all error logs from service=api in last 1h,

(b) requests where parsed JSON field 'latency_ms' > 1000,

(c) error rate per minute as a metric for a Grafana panel."

# Verify before pasting into a dashboard

$ curl -G "http://prometheus:9090/api/v1/query" \

--data-urlencode 'query=<the AI-generated query>'

"resultType": "vector", "result": [...] # non-empty = the metric exists

🎯 Practice Questions

Q1.

Take "show me errors in Grafana" and turn it into a 5-bullet detailed prompt that produces a working PromQL query.

Q2.

An AI-generated PromQL uses http_requests (no _total). Why might the query work locally but break in production, and what's the convention?

Show Answer

Prometheus convention: counters MUST end in _total. Tools like the rate() function and the Prometheus operator's auto-generated rules assume this naming.

If your local app exposes http_requests (without _total), AI's query rate(http_requests[5m]) works in your dev cluster — but breaks the moment it gets renamed in prod, or when a teammate writes the same query and assumes the canonical _total name.

Always verify against your actual /metrics endpoint before merging an AI-generated query.

Q3.

Why is asking AI to "generate a complete Grafana dashboard JSON" usually less useful than asking for individual panel queries?

09

Project: Full Prometheus + Grafana + Loki Stack via Docker Compose

Bring up the whole observability stack on your laptop in one command, instrument an app, see metrics + logs together

▾

How to explain to students

Walk through this on screen first. Once it works locally, students take the same compose stack and adapt it for their EC2 instances or homelab. The shape — Prometheus + node-exporter + Grafana + Loki + Promtail — is the same in production, just with persistent volumes, auth, and TLS in front.

compose.yaml — full stack

services:

app:

build: ./app

ports: ["3000:3000"]

labels: { service: api } # for Loki to pick up

prometheus:

image: prom/prometheus:latest

volumes: ['./prometheus.yml:/etc/prometheus/prometheus.yml']

ports: ["9090:9090"]

node-exporter:

image: prom/node-exporter:latest

pid: host

loki:

image: grafana/loki:latest

ports: ["3100:3100"]

promtail:

image: grafana/promtail:latest

volumes:

- /var/run/docker.sock:/var/run/docker.sock

- ./promtail.yml:/etc/promtail/config.yml

grafana:

image: grafana/grafana:latest

ports: ["3001:3000"]

environment: { GF_SECURITY_ADMIN_PASSWORD: admin }

volumes:

- ./grafana/datasources:/etc/grafana/provisioning/datasources

- ./grafana/dashboards:/etc/grafana/provisioning/dashboards

$ docker compose up -d

✓ prometheus ✓ node-exporter ✓ loki ✓ promtail ✓ grafana ✓ app

Open Grafana → http://localhost:3001 (admin / admin)

📦

All in one compose

6 services, one up -d. Lower the barrier to "I have monitoring."

🗂️

Provisioned datasources

Grafana sees Prometheus + Loki on first boot — no clicking through wizards.

🔗

Metrics + logs linked

Click a Grafana panel → "show logs at this time" → Loki opens with the same window.

🚀

Production shape

Same setup with persistent volumes + auth + TLS = a real prod deployment.

10

Quiz: Observability Basics, Grafana & Alerting

5 MCQs + 2 fill-in-the-command questions

▾

Sample quiz questions (interactive)

Q1. The "three pillars of observability" are:

A

Dashboards, alerts, runbooks

B

Metrics, logs, traces

C

Prometheus, Grafana, Loki

D

CPU, memory, disk

Q2. Prometheus is mostly:

A

Pull-based — scrapes /metrics endpoints

B

Push-based — apps push metrics to Prometheus

C

Log-based — parses log files

D

Trace-based — follows request spans

Q3. Best metric-type for "current queue depth"?

A

Counter

B

Gauge

C

Histogram

D

Summary

Q4. To get p95 latency you query:

A

rate(http_request_duration[5m])

B

avg(http_request_duration_seconds)

C

histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

D

max(http_request_duration_seconds)

Q5. Best practice — alert on:

A

CPU > 90% always

B

Symptoms users care about (error rate, latency)

C

Anything that changes

D

Disk > 50% — early warning

Fill-in-the-command

Fill 1: PromQL — error rate as a percentage over 5 minutes (assume http_requests_total{status}).

Fill 2: LogQL — all error-level logs from service=api.

11

Assignment: Slack/Discord Alert when CPU Exceeds 80%

A real, end-to-end alerting pipeline: instrument → scrape → rule → route → message in chat

▾

How to explain to students

Frame as the on-call setup task: "By Monday, CPU above 80% on any of our hosts must post a message to #alerts in Slack within 2 minutes." Forces them to integrate Prometheus + node-exporter + Alertmanager + a Slack/Discord webhook end-to-end.

📋 Assignment Requirements

Run the Compose stack from Module 9 (Prometheus + Grafana + Loki + node-exporter + Alertmanager)
Configure Prometheus to scrape node-exporter on at least one host (your laptop or an EC2)
Write a Prometheus alerting rule: HighCpu fires when 1-minute average CPU usage on any host > 80% for 2 minutes
Wire Alertmanager to a Slack incoming webhook OR Discord webhook (your choice)
Alert payload must include: hostname, current value (formatted), runbook URL (a placeholder is fine)
Demonstrate end-to-end by running stress --cpu 4 --timeout 180 and showing the resulting Slack/Discord post
Add a Grafana dashboard with one panel: CPU usage per host, last 1 hour
Bonus: Add a severity label and route severity=warn to a different channel than severity=page
Bonus: Add a silence via Alertmanager during a "maintenance window" — show the alert is suppressed
Bonus: Add a CPU-usage SLO ("CPU below 70% for 99% of the month") and a burn-rate alert

expected slack message

🚨 [FIRING] HighCpu — host: laptop-01

CPU usage: 91.2% (threshold: 80%)

Started: 14:23 UTC · Severity: page

Runbook: https://wiki.example.com/runbooks/high-cpu

📊

Grading rubric

Stack runs: 25. Rule fires correctly: 25. Slack/Discord delivery: 20. Dashboard panel: 15. Runbook + labels: 15.

🎯

Common mistakes

Forgot for: 2m (alerts flap), used node_cpu_seconds_total directly (need rate + (1 - idle)), webhook URL pasted in repo.

💡

Stretch

Replace CPU with a real SLI — request error rate, p95 latency. CPU is a teaching example, not a great real-world signal.