← Back to all topics
$ terraform apply && docker build && gh workflow run deploy

DevOps Capstone Project

Build a complete production-grade DevOps pipeline — Terraform → Docker → CI/CD → AWS → Monitoring → Security

01
Capstone Overview — What You're Building
Architecture, tech stack, deliverables — the full picture before you write a single line

The capstone at a glance

The capstone is a complete, production-grade DevOps project that integrates every module in this course into one coherent system. You'll provision infrastructure with Terraform, containerise an app with Docker, automate deployments with GitHub Actions (OIDC, no long-lived keys), deploy to AWS, add Grafana + Prometheus monitoring, and harden the pipeline with Trivy security scanning. The final deliverable is a working system + polished documentation + a 30-minute live walkthrough.

Pick any simple full-stack app (a Node.js REST API + React frontend, a Python Flask API, etc.). The DevOps layer — not the app code — is what you're graded on.

bash — capstone-architecture.sh
### CAPSTONE STACK ###

Infrastructure (Terraform)
VPC + subnets + security groups
EC2 instance (or ECS cluster) for the app
S3 bucket for Terraform state
IAM roles for OIDC GitHub Actions trust

Application (Docker)
Dockerfile (multi-stage, non-root USER)
docker-compose.yml (app + postgres + monitoring)
Image pushed to AWS ECR

CI/CD (GitHub Actions)
PR: lint → test → Trivy scan → build
Main: push to ECR → deploy to EC2/ECS
OIDC: no AWS_SECRET_ACCESS_KEY in secrets

Monitoring (Grafana + Prometheus)
Prometheus scrapes app + node_exporter
Grafana dashboard: RED metrics + CPU/memory
Alertmanager: CPU alert → Slack webhook

Security
Trivy image scan in CI — CRITICAL blocks deploy
gitleaks pre-commit hook
Secrets in AWS Secrets Manager, not .env
🏗️
Everything as code
No manual clicks. Infrastructure, pipelines, dashboards — all version-controlled.
🔁
Full deploy loop
A git push triggers lint → test → build → scan → push to ECR → deploy — no manual steps.
📊
Observable from day one
Grafana is part of the initial Compose stack — not added as an afterthought.
🔒
Secure by design
Trivy scans, no secrets in code, OIDC auth, non-root containers — baked in from the start.
capstone aws terraform docker github-actions

🎯 Practice Questions

Q1.
Draw (on paper or in a diagram tool) your capstone architecture. Label each component: where does Terraform start? Where does the CI pipeline hand off to the app? Where does Prometheus scrape from?
Show Answer
Expected diagram flow: GitHub repo → GitHub Actions (CI) → Trivy scan → ECR push → SSH/ECS deploy to EC2 (provisioned by Terraform). On the EC2: Docker Compose runs app + Prometheus + Grafana. Prometheus scrapes app (:3000/metrics) and node_exporter (:9100). Alertmanager sends alerts to Slack. Terraform state lives in S3.
Q2.
Why does the capstone use OIDC for GitHub Actions → AWS authentication instead of storing AWS_ACCESS_KEY_ID as a GitHub secret? What specific attack does OIDC prevent?
Q3.
Your teammate asks: "Why do we need Terraform if we can just click through the AWS console?" Write a 3-sentence answer that references state management, team collaboration, and reproducibility.
02
Terraform — Provision AWS Infrastructure
VPC, EC2, IAM OIDC role, S3 state backend — infrastructure as code from scratch

Infrastructure layer

The Terraform layer provisions everything the app needs to run: a VPC, a public subnet, a security group (ports 22, 80, 443, 9090 for Prometheus, 3000 for Grafana), an EC2 instance with an IAM instance profile, an ECR repository, an S3 bucket for Terraform state, and — critically — an IAM OIDC provider and role that GitHub Actions can assume without long-lived keys.

bash — terraform-structure.sh
# Directory layout for capstone Terraform
$ tree infra/
infra/
├── main.tf # VPC, subnet, security group, EC2
├── iam.tf # OIDC provider + GitHub Actions role
├── ecr.tf # ECR repository
├── outputs.tf # EC2 public IP, ECR URL, role ARN
├── variables.tf # region, instance_type, github_repo
├── terraform.tfvars # var values (never commit secrets)
└── backend.tf # S3 + DynamoDB remote state

# Bootstrap once (chicken-and-egg: S3 bucket before Terraform)
$ aws s3 mb s3://capstone-tfstate-${RANDOM}
make_bucket: capstone-tfstate-14423

# Then provision everything else
$ cd infra && terraform init && terraform plan -out=tfplan
Plan: 14 to add, 0 to change, 0 to destroy.
$ terraform apply tfplan
Apply complete! Resources: 14 added, 0 changed, 0 destroyed.

# Grab outputs for GitHub Actions secrets
$ terraform output -raw ecr_url
123456789.dkr.ecr.ap-southeast-1.amazonaws.com/capstone
$ terraform output -raw github_actions_role_arn
arn:aws:iam::123456789:role/github-actions-capstone
🔐
OIDC role in Terraform
Create the GitHub OIDC provider and IAM role in iam.tf — the role ARN becomes a GitHub Actions env var.
🗄️
Remote state first
Bootstrap the S3 bucket manually once, then all Terraform state lives remotely — safe for team use.
📤
Outputs drive the pipeline
ECR URL, EC2 IP, and role ARN as Terraform outputs flow directly into GitHub Actions environment variables.
🔄
Destroy safely
terraform destroy tears down the whole stack cleanly — great for cost control while iterating.
terraform aws-iam oidc ec2 ecr

🎯 Practice Questions

Q1.
Write the Terraform iam.tf snippet that creates an OIDC provider for GitHub Actions and an IAM role that trusts it. The role should allow push to ECR and describe EC2 instances.
Show Answer
resource "aws_iam_openid_connect_provider" "github" { url = "https://token.actions.githubusercontent.com" client_id_list = ["sts.amazonaws.com"] thumbprint_list = ["6938fd4d98bab03faadb97b34396831e3780aea1"] } resource "aws_iam_role" "github_actions" { name = "github-actions-capstone" assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = { Federated = aws_iam_openid_connect_provider.github.arn } Action = "sts:AssumeRoleWithWebIdentity" Condition = { StringLike = { "token.actions.githubusercontent.com:sub" = "repo:YOUR_ORG/YOUR_REPO:*" } } }] }) }
Q2.
What does terraform output -raw ecr_url return and why is piping this value into a GitHub Actions secret better than hardcoding it?
Q3.
You run terraform plan and see "14 to add, 0 to change, 0 to destroy." You then run it again immediately. What does Terraform report and why?
03
Docker — Containerise the Application
Multi-stage Dockerfile, non-root user, Docker Compose with healthchecks, push to ECR

Container layer

The application runs inside Docker containers — both locally (via Compose) and on EC2/ECS in production. A production-grade Dockerfile uses multi-stage builds to keep the final image small, runs as a non-root user, and exposes only the necessary port. The Compose file adds PostgreSQL, Prometheus, Grafana, and node_exporter as sidecar services.

dockerfile — Dockerfile
# Stage 1: build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Stage 2: runtime (no dev deps, no build tools)
FROM node:20-alpine
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER appuser
EXPOSE 3000
CMD ["node", "dist/server.js"]

# Build and push to ECR
$ docker build -t capstone:latest .
$ docker tag capstone:latest $ECR_URL:latest
$ aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URL
$ docker push $ECR_URL:latest
latest: digest: sha256:abc123... size: 28476
🏗️
Multi-stage build
Builder stage compiles; runtime stage runs. Dev dependencies never reach production.
👤
Non-root USER
Running as root in a container means a breakout gives full host access. Always create a dedicated user.
🔗
Compose for local dev
One docker compose up starts the full stack: app + database + monitoring. Mirrors production.
📦
ECR as registry
AWS ECR is private, co-located with your ECS/EC2, and integrates natively with IAM — no registry credentials to manage separately.
docker multi-stage ecr compose

🎯 Practice Questions

Q1.
Run docker images before and after converting a single-stage Dockerfile to multi-stage. What size reduction do you see? Record the before/after numbers.
Show Answer
Typical result for a Node.js app: single-stage with node:20 base ≈ 900MB–1.2GB. Multi-stage with node:20-alpine runtime ≈ 120–200MB. That's a 75–85% reduction. Alpine strips all non-essential OS packages; multi-stage removes build tools and dev dependencies.
Q2.
Write the docker-compose.yml service block for the monitoring sidecar (Prometheus + Grafana). Include volume mounts for config and data persistence, and a healthcheck for Prometheus.
Q3.
What is the Trivy command to scan your built Docker image before pushing to ECR? What flag makes it exit with a non-zero code if any CRITICAL severity CVEs are found?
04
CI/CD Pipeline — GitHub Actions with OIDC
Full workflow: lint → test → Trivy scan → build → push → deploy — no long-lived AWS keys

Pipeline design

The GitHub Actions workflow has two jobs: ci (runs on every PR) and deploy (runs on push to main). The CI job runs lint, unit tests, and Trivy image scan — if any fails, the PR cannot merge. The deploy job authenticates to AWS via OIDC (no secrets stored), pulls the latest image, and restarts the container on EC2 via SSH or updates the ECS task definition.

yaml — .github/workflows/deploy.yml
name: CI/CD Pipeline
on:
pull_request:
push:
branches: [main]

jobs:
ci:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run lint && npm test
- name: Build image for scanning
run: docker build -t capstone:${{ github.sha }} .
- name: Trivy scan
uses: aquasecurity/trivy-action@master
with:
image-ref: capstone:${{ github.sha }}
exit-code: '1' # fail on CRITICAL
severity: CRITICAL,HIGH

deploy:
needs: ci
if: github.ref == 'refs/heads/main'
permissions:
id-token: write # required for OIDC
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ap-southeast-1
- run: |
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URL
docker build -t $ECR_URL:${{ github.sha }} . && docker push $ECR_URL:${{ github.sha }}
- name: Deploy to EC2
run: ssh -o StrictHostKeyChecking=no ec2-user@${{ secrets.EC2_HOST }} 'cd /app && ./deploy.sh'
🔑
OIDC = no keys
id-token: write + configure-aws-credentials with role-to-assume — GitHub mints a short-lived token, AWS trusts it.
🛡️
Trivy gates the deploy
Trivy scan runs in the ci job; deploy uses needs: ci — a CRITICAL CVE blocks production.
🏷️
SHA tagging
Tagging images with github.sha makes every deploy traceable — you know exactly which commit is running.
🔀
PR vs main split
ci runs on all events; deploy only on main push — developers get fast feedback without triggering deploys.
github-actions oidc trivy ci-cd ecr

🎯 Practice Questions

Q1.
Why does the deploy job have permissions: id-token: write but the CI job does not? What happens if you add that permission to the CI job as well?
Show Answer
The id-token: write permission allows the job to request an OIDC JWT from GitHub's token endpoint — it's only needed for the job that calls configure-aws-credentials with OIDC. Adding it to the CI job is harmless but unnecessary — it won't trigger AWS auth unless you also call configure-aws-credentials. The principle of least privilege suggests not adding it unless needed.
Q2.
Explain what happens step-by-step when a developer opens a PR to main. Which jobs run? What would cause the PR to be blocked from merging?
Q3.
The deploy script on the EC2 server (./deploy.sh) needs to pull the new image and restart the container with zero downtime. Write the 5-line shell script that achieves this using Docker Compose.
Q4.
What is the difference between tagging an image as :latest versus :${{ github.sha }}? Why does using SHA tags make rollbacks easier?
05
End-to-End Testing of Microservices
Integration tests, contract tests, and smoke tests that run in the CI pipeline

Testing strategy

Unit tests verify individual functions; end-to-end (E2E) tests verify the whole system. In a CI pipeline, the testing pyramid looks like: unit tests (many, fast), integration tests (moderate, medium-speed), and smoke tests (few, run post-deploy). For the capstone, the CI job runs unit + integration tests against a test database; a post-deploy smoke test hits the real staging endpoint.

yaml — test strategy in CI
# GitHub Actions: spin up test DB, run tests
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: testpass
options: >-
--health-cmd pg_isready
--health-interval 5s
--health-retries 5

# Run integration tests against real Postgres
$ DATABASE_URL=postgres://postgres:testpass@localhost/test npm test
✔ POST /api/users creates a user (45ms)
✔ GET /api/users/:id returns user (12ms)
✔ DELETE /api/users/:id soft-deletes (8ms)
3 passing (65ms)

# Smoke test after deploy (staging env)
$ curl -sf https://staging.myapp.com/health | jq .status
"ok"

# If smoke test fails → trigger rollback
$ if ! curl -sf https://staging.myapp.com/health; then ./rollback.sh; fi
🧪
Integration over mocks
GitHub Actions service containers give you a real Postgres in CI — mocks hide real-world failures.
💨
Smoke tests post-deploy
A simple /health endpoint check after deploy catches broken deploys before users do.
🔁
Auto-rollback on failure
Pair the smoke test with a rollback script — failed smoke test restores the previous SHA-tagged image.
📐
Testing pyramid
Many unit (milliseconds) → fewer integration (seconds) → minimal E2E (minutes). Don't invert the pyramid.
testing integration-tests smoke-tests ci-pipeline

🎯 Practice Questions

Q1.
Add a /health endpoint to your application that returns {"status":"ok","db":"connected"}. The endpoint should actually check the database connection — not just return a hardcoded string.
Show Answer
Node.js + Postgres example:
app.get('/health', async (req, res) => { try { await db.query('SELECT 1'); res.json({ status: 'ok', db: 'connected' }); } catch (err) { res.status(503).json({ status: 'error', db: err.message }); } });
This fails fast when the database is unreachable — which is exactly what a health check should do.
Q2.
Why is using a real Postgres container in CI better than mocking the database layer? Give a concrete example of a bug that mocks would miss but an integration test would catch.
Q3.
Write the rollback.sh script that re-deploys the previous Docker image on the EC2 server when a smoke test fails. Assume the previous SHA is stored in a file called .last_sha.
06
AWS Deployment — EC2, ECS & Elastic Beanstalk
Choose the right AWS compute target and deploy your Docker container to production

Compute target decision

Three AWS options for running a Dockerised app — pick one for your capstone. EC2 + Docker Compose: simplest, most control, you manage updates. ECS Fargate: serverless containers, AWS manages the host, integrates with ALB and service discovery. Elastic Beanstalk: managed PaaS, best for teams who want one-command deploys without thinking about ECS task definitions.

bash — deployment-options.sh
# OPTION A: EC2 + Compose (capstone default)
# deploy.sh runs on the server via SSH in GitHub Actions
$ cat deploy.sh
#!/bin/bash
set -e
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URL
docker compose pull app
docker compose up -d --no-deps app
echo $(cat .current_sha) > .last_sha
echo $NEW_SHA > .current_sha

# OPTION B: ECS Fargate (GitHub Actions step)
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ecs-task-def.json
service: capstone-service
cluster: capstone-cluster
wait-for-service-stability: true

# OPTION C: Elastic Beanstalk
$ eb deploy capstone-prod --label v1.2.0
INFO: Deploying new version to instance(s).
INFO: New application version was deployed to running EC2 instances.
🖥️
EC2 + Compose
Most control, lowest abstraction. Good for capstone — you see every moving part.
ECS Fargate
No server management. AWS scales tasks automatically. Best for production at scale.
🌱
Elastic Beanstalk
Managed PaaS — eb deploy handles load balancer, rolling deploy, and health checks automatically.
🔄
Rolling deploy
docker compose up -d --no-deps app replaces only the app container, keeping the database and monitoring running.
ec2 ecs-fargate elastic-beanstalk deployment

🎯 Practice Questions

Q1.
Your capstone app is deployed to EC2 using Docker Compose. A deployment fails halfway — the new container crashes at startup. What state is the system in? How do you recover?
Show Answer
Docker Compose's --no-deps means only the app container was replaced. If the new container crashes, the old one is gone — the app is down. Recovery: docker compose up -d --no-deps app with the previous SHA-tagged image (pulled from ECR). This is why storing .last_sha and having a rollback.sh is critical — recovery becomes a one-command operation.
Q2.
What is the difference between ECS Fargate and EC2 for running containers? When would a production team choose Fargate over EC2 + Compose?
Q3.
The Elastic Beanstalk eb deploy command accepts a --label flag. What is this label used for and how does it help you perform a rollback if the deployment fails?
07
Rollback via CloudFront & Versioned Deploys
How to recover from a bad deploy in under 2 minutes using SHA-tagged images and CloudFront

Rollback strategy

A rollback is a deploy of the previous known-good version. It should be a one-command operation. For the capstone: (1) every image is tagged with its github.sha, (2) CloudFront serves the static frontend, (3) API calls go to the EC2 origin. A bad deploy on EC2 is rolled back by SSH + rollback.sh. A bad frontend deploy is rolled back by invalidating the CloudFront cache and re-deploying the old static build to S3.

bash — rollback-runbook.sh
### API ROLLBACK (EC2 + Docker) ###
$ cat rollback.sh
#!/bin/bash
PREV_SHA=$(cat .last_sha)
echo "Rolling back to $PREV_SHA"
docker pull $ECR_URL:$PREV_SHA
ECR_IMAGE=$ECR_URL:$PREV_SHA docker compose up -d --no-deps app
echo "Rollback complete. Running: $PREV_SHA"

### FRONTEND ROLLBACK (S3 + CloudFront) ###
# 1. Re-upload the previous build artifact
$ aws s3 sync s3://capstone-builds/$PREV_SHA/ s3://capstone-frontend/ --delete

# 2. Invalidate CloudFront cache so edge serves new (old) files
$ aws cloudfront create-invalidation --distribution-id $CF_DIST_ID --paths "/*"
{
"Invalidation": {
"Status": "InProgress",
"InvalidationBatch": { "Paths": { "Items": ["/*"] } }
}
}
# Propagates to all edge locations in ~30 seconds
One-command rollback
SHA-tagged images in ECR mean any previous version can be pulled and deployed in under 60 seconds.
🌐
CloudFront invalidation
Old static files cached at edge nodes must be explicitly invalidated — S3 sync alone is not enough.
📋
Runbook over heroics
A rollback script that anyone on the team can run beats a complex procedure only the senior engineer knows.
🔍
Smoke test before rollback
Always run the smoke test after rollback — confirm the previous version is actually healthy before declaring victory.
cloudfront rollback s3 incident-response

🎯 Practice Questions

Q1.
Walk through a full rollback scenario: production goes down at 2am. Your smoke test failed after the last deploy. Write the 5 commands you run — in order — to restore service.
Show Answer
1. ssh ec2-user@$EC2_HOST — get on the server
2. ./rollback.sh — restore previous image
3. docker compose ps app — confirm container is Up
4. curl -sf http://localhost:3000/health — confirm app responds
5. aws cloudfront create-invalidation --distribution-id $CF_DIST_ID --paths "/*" — clear edge cache if frontend was also deployed. Then notify the team and write a post-mortem.
Q2.
Why does CloudFront serve stale content even after you've re-uploaded files to S3? What is the default CloudFront cache TTL and how do you override it per file type?
Q3.
What is the difference between a "rollback" and a "revert"? When would you use a git revert instead of a deployment rollback?
08
Monitoring & Alerting — Grafana + Prometheus
Wire up the full observability stack: metrics, dashboards, and a CPU alert to Slack

Observability layer

The monitoring stack runs as Docker Compose services alongside the application: Prometheus scrapes metrics from the app (via prom-client) and from node_exporter (host CPU/memory/disk). Grafana visualises both with a RED dashboard (Rate, Errors, Duration) and a USE dashboard (CPU Utilisation, Saturation, Errors). Alertmanager routes CPU alerts to a Slack webhook.

yaml — prometheus/alerts.yml
groups:
- name: capstone.rules
rules:

- alert: HighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "CPU above 80% for 2 minutes on {{ $labels.instance }}"

- alert: AppHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 1m
labels:
severity: critical
annotations:
summary: "Error rate above 5% on {{ $labels.instance }}"

# Verify alerts loaded correctly
$ curl -s localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'
"HighCPU"
"AppHighErrorRate"
📈
RED + USE dashboards
RED (Rate/Errors/Duration) for your app; USE (Utilisation/Saturation/Errors) for the host. Two panels, full picture.
⏱️
Alert "for" duration
The for: 2m clause prevents alert flapping — the condition must persist for 2 minutes before firing.
🔔
Slack webhook
Alertmanager sends structured messages to a Slack channel — include the instance name and metric value in annotations.
🩺
prom-client in the app
Add prom-client (Node.js) or prometheus_client (Python) to instrument HTTP request count, duration, and errors.
prometheus grafana alertmanager monitoring node-exporter

🎯 Practice Questions

Q1.
Instrument your capstone app with prom-client. Add a counter for total HTTP requests by method and status code. Write the middleware function and the /metrics endpoint that Prometheus scrapes.
Show Answer
const client = require('prom-client'); const httpRequests = new client.Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'status'] }); app.use((req, res, next) => { res.on('finish', () => { httpRequests.inc({ method: req.method, status: res.statusCode }); }); next(); }); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); });
Q2.
Why does the CPU alert use a for: 2m clause? What would happen if you removed it and set the threshold to 80%?
Q3.
Write the Alertmanager alertmanager.yml receiver block that sends a message to a Slack webhook URL stored in an environment variable called SLACK_WEBHOOK.
09
Security — Trivy, Secrets & Pipeline Hardening
Scan images, manage secrets properly, and lock down the pipeline before the final assessment

Security checklist for the capstone

A production-grade capstone must pass a security review. The five areas checked: (1) no secrets in git history, (2) Trivy CRITICAL/HIGH scan blocking the pipeline, (3) all runtime secrets in AWS Secrets Manager (not .env files on the server), (4) non-root USER in Dockerfile, (5) IAM roles follow least privilege (OIDC role can only push to ECR and deploy to ECS — not admin).

bash — security-audit.sh
# 1. Check git history for leaked secrets
$ gitleaks detect --source . --verbose
No leaks found.

# 2. Trivy scan — must pass before merge
$ trivy image --exit-code 1 --severity CRITICAL,HIGH $ECR_URL:latest
Total: 0 (CRITICAL: 0, HIGH: 0)
capstone:latest (alpine 3.19) — no vulnerabilities found

# 3. Secrets Manager — retrieve at runtime, not startup
$ aws secretsmanager get-secret-value --secret-id capstone/db-password --query SecretString --output text
{"DB_PASSWORD":"my-secure-pass"}

# 4. Confirm container runs as non-root
$ docker inspect capstone:latest | jq '.[0].Config.User'
"appuser"

# 5. IAM role policy — scope to minimum needed
$ aws iam get-role-policy --role-name github-actions-capstone --policy-name deploy | jq '.PolicyDocument.Statement[].Action'
["ecr:GetAuthorizationToken","ecr:BatchGetImage","ecr:PutImage","ecs:UpdateService"]
🔍
Gitleaks pre-commit
Install as a pre-commit hook — prevents secrets reaching the remote before CI ever runs.
🛡️
Trivy in CI gate
--exit-code 1 means a CRITICAL CVE blocks the PR merge. No exceptions during capstone assessment.
🔐
Secrets Manager at runtime
The app calls Secrets Manager on startup — secrets never sit in .env files or GitHub secrets in plaintext.
📏
Least-privilege IAM
The OIDC role can only do what the pipeline needs: ECR push + ECS update. No AdministratorAccess.
trivy gitleaks secrets-manager iam-least-privilege devsecops

🎯 Practice Questions

Q1.
Run trivy image against your capstone image. If you find any CRITICAL or HIGH CVEs, what are the options for fixing them? List at least 3 remediation approaches.
Show Answer
Options: (1) Update base imageFROM node:20-alpineFROM node:20-alpine3.19 (pinned, latest patch), (2) Update the vulnerable package — add RUN apk upgrade --no-cache to the runtime stage to apply OS patches, (3) Remove the package — if a vulnerable library isn't used, delete it from the image, (4) Accept and document — use --ignore-unfixed for CVEs with no available fix, and track in your SECURITY.md.
Q2.
A developer accidentally commits DB_PASSWORD=supersecret in a .env file to a public GitHub repo. What are the first 3 actions to take in the correct order?
Q3.
Configure your app to retrieve the database password from AWS Secrets Manager at startup. Write the Node.js code that fetches the secret and sets it as an environment variable before the database pool is initialised.
10
Documentation — README & Architecture Diagram
Professional documentation that lets any engineer understand and operate your system in 30 minutes

Documentation as code

The capstone README is as important as the code. An interviewer, hiring manager, or future team member should be able to read it and: (1) understand what the system does, (2) spin up a local dev environment in under 5 minutes, (3) understand how to deploy and roll back, (4) know where to look when something breaks. The architecture diagram replaces a thousand words — draw it once, link it in the README.

bash — readme-structure.md
### CAPSTONE README STRUCTURE ###

# Capstone: [App Name] — Full DevOps Pipeline

## Architecture
![architecture diagram](./docs/architecture.png)
Brief description: what each component does and how they connect.

## Tech Stack
| Layer | Technology |
|-------------|-------------------|
| App | Node.js + Express |
| DB | PostgreSQL 15 |
| IaC | Terraform 1.7 |
| CI/CD | GitHub Actions |
| Monitoring | Prometheus+Grafana|
| Registry | AWS ECR |
| Hosting | EC2 (t3.micro) |

## Local Development
```bash
git clone https://github.com/you/capstone
cp .env.example .env # fill in secrets
docker compose up -d
open http://localhost:3000
```

## Deploying to AWS
## Rolling Back
## Monitoring
## Security Decisions
📐
Architecture diagram
Use draw.io, Excalidraw, or Mermaid. Show: GitHub → Actions → ECR → EC2, Prometheus scraping, Grafana, CloudFront.
5-minute local setup
If the "Local Development" section takes more than 5 minutes to follow, it's too complex. Simplify with Compose.
🔄
Runbooks in README
Include a "Rolling Back" section with the exact command. Your future self at 2am will thank you.
🔒
Security decisions
Document why you chose Secrets Manager over env vars, why non-root user, why OIDC. Shows engineering judgment.
documentation readme architecture-diagram runbook

🎯 Practice Questions

Q1.
Draw your capstone architecture diagram using Excalidraw or draw.io. Include: GitHub repo, GitHub Actions, ECR, EC2, Docker Compose, Prometheus, Grafana, Alertmanager, CloudFront. Export as PNG and add to your README.
Q2.
Write a "Security Decisions" section for your README that explains 3 choices you made (e.g., OIDC instead of access keys, Secrets Manager instead of .env, non-root Docker user). For each, explain the threat it mitigates.
Q3.
Hand your README to a classmate who hasn't seen your project. Time how long it takes them to get the app running locally. What did they get stuck on? Fix those friction points.
11
Final Assessment — 30-Minute Walkthrough + Q&A
What the panel evaluates, how to prepare, and questions you should be ready to answer

Assessment structure

The 30-minute assessment has three parts: (1) 10-minute demo — you walk through a live deploy from a git push to a running container on AWS, (2) 15-minute technical Q&A — the panel asks deep questions on any module in the course, (3) 5-minute architectural review — "what would you do differently if you had to scale this to 10,000 users?" The panel is looking for depth of understanding, not memorised answers.

bash — assessment-prep.sh
### DEMO SCRIPT (rehearse this 3+ times) ###

1. Show the repo structure and Terraform infra/
2. Make a small code change, push to a feature branch, open a PR
3. Show the CI checks running: lint → test → Trivy
4. Merge the PR → show deploy job running with OIDC
5. Open the live URL — show the change is live
6. Open Grafana — show the deploy spike in the RED dashboard
7. Trigger the CPU alert by running a load test (optional)
8. Run rollback.sh to demonstrate recovery

### LIKELY Q&A QUESTIONS ###
"Why OIDC instead of access keys?"
"What happens if Trivy finds a CRITICAL CVE?"
"How would you add staging + production environments?"
"What's the MTTR if production goes down at 2am?"
"How would you scale this to 100 users? 10,000 users?"
"What would you add first if you had another week?"
🎭
Live demo, no slides
The panel wants to see real infrastructure working — not PowerPoint. Make sure your EC2 is running before you start.
🧠
Know your "why"s
Every tool choice should have a reason: "I chose Fargate because..." or "I chose EC2 because..." No random decisions.
🔮
Scale questions
Scaling answer template: database (RDS Multi-AZ), app (ECS auto-scaling), CDN (CloudFront), cache (ElastiCache).
Honest about gaps
"I didn't implement X because of time, but I know the approach would be..." beats bluffing every time.
assessment presentation demo q-and-a

🎯 Practice Questions

Q1.
Answer the scale question: "How would you evolve this capstone architecture to handle 10,000 concurrent users?" List specific AWS services you'd add and why.
Show Answer
At 10,000 concurrent users: (1) Database: Move from Postgres on EC2 to RDS Multi-AZ with a read replica for read-heavy queries. (2) App tier: Replace single EC2 with ECS Fargate behind an Application Load Balancer; set auto-scaling on CPU/request count. (3) Static assets: Already on CloudFront — tune cache TTL and add S3 Transfer Acceleration. (4) Caching: Add ElastiCache (Redis) to cache frequent DB queries and session data. (5) Async work: Move background jobs to SQS + Lambda to avoid blocking the API. (6) Monitoring: Upgrade to Datadog or managed Grafana Cloud for cross-region visibility.
Q2.
Practice the full 30-minute demo with a classmate acting as the panel. Time each section. What parts run over? What questions catch you off guard?
Q3.
The panel asks: "What would you do differently if you rebuilt this from scratch?" Give an honest answer that shows technical growth — not just "I'd do everything the same."
12
Assignment: Polish Docs & Ship the Full Project
Final submission checklist — everything that must be live and documented before the assessment

Final submission

The capstone is assessed on the combination of working infrastructure + clean documentation + clear explanations. Use this checklist to ensure nothing is missing before your 30-minute walkthrough. Submit the GitHub repo URL and the live URL to your instructor at least 24 hours before the assessment so they can review the pipeline and documentation independently.

bash — submission-checklist.sh
### INFRASTRUCTURE ✓ ###
☐ terraform apply completes with 0 errors
☐ EC2 instance is running and accessible via SSH
☐ ECR repository contains at least 3 SHA-tagged images
☐ S3 + DynamoDB Terraform state backend configured

### CI/CD ✓ ###
☐ GitHub Actions pipeline: all 3 stages (lint/test/Trivy) pass on PR
☐ OIDC authentication working (no AWS_SECRET_ACCESS_KEY in secrets)
☐ Push to main triggers automatic deploy to EC2
☐ Smoke test runs post-deploy and rollback.sh works

### MONITORING ✓ ###
☐ Grafana dashboard live at http://[EC2_IP]:3001
☐ RED dashboard showing app request rate and error rate
☐ CPU alert configured and tested (fire + resolve)
☐ Slack webhook receiving alert messages

### SECURITY ✓ ###
☐ gitleaks: 0 secrets found in git history
☐ trivy image: 0 CRITICAL/HIGH CVEs in final image
☐ All secrets in AWS Secrets Manager (not .env on server)
☐ Dockerfile runs as non-root user

### DOCUMENTATION ✓ ###
☐ README: architecture diagram + local setup + deploy + rollback
☐ Architecture diagram exported as PNG in /docs
☐ SECURITY.md documenting decisions + known limitations
☐ .env.example (never .env) committed to the repo

📋 Final Assessment Grading (100 points)

  • Working infrastructure (25pts) — Terraform provisions cleanly, EC2 running, ECR populated with SHA-tagged images
  • CI/CD pipeline (25pts) — All stages pass, OIDC auth, auto-deploy on main push, smoke test + rollback demonstrated live
  • Monitoring (20pts) — Grafana dashboard live, RED metrics, CPU alert fires to Slack, alert resolves correctly
  • Security (15pts) — No secrets in git, Trivy clean, Secrets Manager used, non-root Dockerfile, least-privilege IAM
  • Documentation (15pts) — README with architecture diagram, local setup under 5 minutes, runbooks, security decisions documented
  • Q&A depth (bonus up to 10pts) — Demonstrates understanding beyond surface level: trade-offs explained, scale thinking, honest gaps acknowledged