The capstone at a glance
The capstone is a complete, production-grade DevOps project that integrates every module in this course into one coherent system. You'll provision infrastructure with Terraform, containerise an app with Docker, automate deployments with GitHub Actions (OIDC, no long-lived keys), deploy to AWS, add Grafana + Prometheus monitoring, and harden the pipeline with Trivy security scanning. The final deliverable is a working system + polished documentation + a 30-minute live walkthrough.
Pick any simple full-stack app (a Node.js REST API + React frontend, a Python Flask API, etc.). The DevOps layer — not the app code — is what you're graded on.
🎯 Practice Questions
Show Answer
AWS_ACCESS_KEY_ID as a GitHub secret? What specific attack does OIDC prevent?Infrastructure layer
The Terraform layer provisions everything the app needs to run: a VPC, a public subnet, a security group (ports 22, 80, 443, 9090 for Prometheus, 3000 for Grafana), an EC2 instance with an IAM instance profile, an ECR repository, an S3 bucket for Terraform state, and — critically — an IAM OIDC provider and role that GitHub Actions can assume without long-lived keys.
iam.tf — the role ARN becomes a GitHub Actions env var.terraform destroy tears down the whole stack cleanly — great for cost control while iterating.🎯 Practice Questions
iam.tf snippet that creates an OIDC provider for GitHub Actions and an IAM role that trusts it. The role should allow push to ECR and describe EC2 instances.Show Answer
resource "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = ["6938fd4d98bab03faadb97b34396831e3780aea1"]
}
resource "aws_iam_role" "github_actions" {
name = "github-actions-capstone"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Federated = aws_iam_openid_connect_provider.github.arn }
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringLike = {
"token.actions.githubusercontent.com:sub" = "repo:YOUR_ORG/YOUR_REPO:*"
}
}
}]
})
}terraform output -raw ecr_url return and why is piping this value into a GitHub Actions secret better than hardcoding it?terraform plan and see "14 to add, 0 to change, 0 to destroy." You then run it again immediately. What does Terraform report and why?Container layer
The application runs inside Docker containers — both locally (via Compose) and on EC2/ECS in production. A production-grade Dockerfile uses multi-stage builds to keep the final image small, runs as a non-root user, and exposes only the necessary port. The Compose file adds PostgreSQL, Prometheus, Grafana, and node_exporter as sidecar services.
docker compose up starts the full stack: app + database + monitoring. Mirrors production.🎯 Practice Questions
docker images before and after converting a single-stage Dockerfile to multi-stage. What size reduction do you see? Record the before/after numbers.Show Answer
node:20 base ≈ 900MB–1.2GB. Multi-stage with node:20-alpine runtime ≈ 120–200MB. That's a 75–85% reduction. Alpine strips all non-essential OS packages; multi-stage removes build tools and dev dependencies.docker-compose.yml service block for the monitoring sidecar (Prometheus + Grafana). Include volume mounts for config and data persistence, and a healthcheck for Prometheus.Pipeline design
The GitHub Actions workflow has two jobs: ci (runs on every PR) and deploy (runs on push to main). The CI job runs lint, unit tests, and Trivy image scan — if any fails, the PR cannot merge. The deploy job authenticates to AWS via OIDC (no secrets stored), pulls the latest image, and restarts the container on EC2 via SSH or updates the ECS task definition.
id-token: write + configure-aws-credentials with role-to-assume — GitHub mints a short-lived token, AWS trusts it.ci job; deploy uses needs: ci — a CRITICAL CVE blocks production.github.sha makes every deploy traceable — you know exactly which commit is running.ci runs on all events; deploy only on main push — developers get fast feedback without triggering deploys.🎯 Practice Questions
permissions: id-token: write but the CI job does not? What happens if you add that permission to the CI job as well?Show Answer
id-token: write permission allows the job to request an OIDC JWT from GitHub's token endpoint — it's only needed for the job that calls configure-aws-credentials with OIDC. Adding it to the CI job is harmless but unnecessary — it won't trigger AWS auth unless you also call configure-aws-credentials. The principle of least privilege suggests not adding it unless needed../deploy.sh) needs to pull the new image and restart the container with zero downtime. Write the 5-line shell script that achieves this using Docker Compose.:latest versus :${{ github.sha }}? Why does using SHA tags make rollbacks easier?Testing strategy
Unit tests verify individual functions; end-to-end (E2E) tests verify the whole system. In a CI pipeline, the testing pyramid looks like: unit tests (many, fast), integration tests (moderate, medium-speed), and smoke tests (few, run post-deploy). For the capstone, the CI job runs unit + integration tests against a test database; a post-deploy smoke test hits the real staging endpoint.
/health endpoint check after deploy catches broken deploys before users do.🎯 Practice Questions
/health endpoint to your application that returns {"status":"ok","db":"connected"}. The endpoint should actually check the database connection — not just return a hardcoded string.Show Answer
app.get('/health', async (req, res) => {
try {
await db.query('SELECT 1');
res.json({ status: 'ok', db: 'connected' });
} catch (err) {
res.status(503).json({ status: 'error', db: err.message });
}
});This fails fast when the database is unreachable — which is exactly what a health check should do.
rollback.sh script that re-deploys the previous Docker image on the EC2 server when a smoke test fails. Assume the previous SHA is stored in a file called .last_sha.Compute target decision
Three AWS options for running a Dockerised app — pick one for your capstone. EC2 + Docker Compose: simplest, most control, you manage updates. ECS Fargate: serverless containers, AWS manages the host, integrates with ALB and service discovery. Elastic Beanstalk: managed PaaS, best for teams who want one-command deploys without thinking about ECS task definitions.
eb deploy handles load balancer, rolling deploy, and health checks automatically.docker compose up -d --no-deps app replaces only the app container, keeping the database and monitoring running.🎯 Practice Questions
Show Answer
--no-deps means only the app container was replaced. If the new container crashes, the old one is gone — the app is down. Recovery: docker compose up -d --no-deps app with the previous SHA-tagged image (pulled from ECR). This is why storing .last_sha and having a rollback.sh is critical — recovery becomes a one-command operation.eb deploy command accepts a --label flag. What is this label used for and how does it help you perform a rollback if the deployment fails?Rollback strategy
A rollback is a deploy of the previous known-good version. It should be a one-command operation. For the capstone: (1) every image is tagged with its github.sha, (2) CloudFront serves the static frontend, (3) API calls go to the EC2 origin. A bad deploy on EC2 is rolled back by SSH + rollback.sh. A bad frontend deploy is rolled back by invalidating the CloudFront cache and re-deploying the old static build to S3.
🎯 Practice Questions
Show Answer
ssh ec2-user@$EC2_HOST — get on the server2.
./rollback.sh — restore previous image3.
docker compose ps app — confirm container is Up4.
curl -sf http://localhost:3000/health — confirm app responds5.
aws cloudfront create-invalidation --distribution-id $CF_DIST_ID --paths "/*" — clear edge cache if frontend was also deployed. Then notify the team and write a post-mortem.Observability layer
The monitoring stack runs as Docker Compose services alongside the application: Prometheus scrapes metrics from the app (via prom-client) and from node_exporter (host CPU/memory/disk). Grafana visualises both with a RED dashboard (Rate, Errors, Duration) and a USE dashboard (CPU Utilisation, Saturation, Errors). Alertmanager routes CPU alerts to a Slack webhook.
for: 2m clause prevents alert flapping — the condition must persist for 2 minutes before firing.prom-client (Node.js) or prometheus_client (Python) to instrument HTTP request count, duration, and errors.🎯 Practice Questions
prom-client. Add a counter for total HTTP requests by method and status code. Write the middleware function and the /metrics endpoint that Prometheus scrapes.Show Answer
const client = require('prom-client');
const httpRequests = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'status']
});
app.use((req, res, next) => {
res.on('finish', () => {
httpRequests.inc({ method: req.method, status: res.statusCode });
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});for: 2m clause? What would happen if you removed it and set the threshold to 80%?alertmanager.yml receiver block that sends a message to a Slack webhook URL stored in an environment variable called SLACK_WEBHOOK.Security checklist for the capstone
A production-grade capstone must pass a security review. The five areas checked: (1) no secrets in git history, (2) Trivy CRITICAL/HIGH scan blocking the pipeline, (3) all runtime secrets in AWS Secrets Manager (not .env files on the server), (4) non-root USER in Dockerfile, (5) IAM roles follow least privilege (OIDC role can only push to ECR and deploy to ECS — not admin).
--exit-code 1 means a CRITICAL CVE blocks the PR merge. No exceptions during capstone assessment.🎯 Practice Questions
trivy image against your capstone image. If you find any CRITICAL or HIGH CVEs, what are the options for fixing them? List at least 3 remediation approaches.Show Answer
FROM node:20-alpine → FROM node:20-alpine3.19 (pinned, latest patch), (2) Update the vulnerable package — add RUN apk upgrade --no-cache to the runtime stage to apply OS patches, (3) Remove the package — if a vulnerable library isn't used, delete it from the image, (4) Accept and document — use --ignore-unfixed for CVEs with no available fix, and track in your SECURITY.md.DB_PASSWORD=supersecret in a .env file to a public GitHub repo. What are the first 3 actions to take in the correct order?Documentation as code
The capstone README is as important as the code. An interviewer, hiring manager, or future team member should be able to read it and: (1) understand what the system does, (2) spin up a local dev environment in under 5 minutes, (3) understand how to deploy and roll back, (4) know where to look when something breaks. The architecture diagram replaces a thousand words — draw it once, link it in the README.
🎯 Practice Questions
Assessment structure
The 30-minute assessment has three parts: (1) 10-minute demo — you walk through a live deploy from a git push to a running container on AWS, (2) 15-minute technical Q&A — the panel asks deep questions on any module in the course, (3) 5-minute architectural review — "what would you do differently if you had to scale this to 10,000 users?" The panel is looking for depth of understanding, not memorised answers.
🎯 Practice Questions
Show Answer
Final submission
The capstone is assessed on the combination of working infrastructure + clean documentation + clear explanations. Use this checklist to ensure nothing is missing before your 30-minute walkthrough. Submit the GitHub repo URL and the live URL to your instructor at least 24 hours before the assessment so they can review the pipeline and documentation independently.
📋 Final Assessment Grading (100 points)
- Working infrastructure (25pts) — Terraform provisions cleanly, EC2 running, ECR populated with SHA-tagged images
- CI/CD pipeline (25pts) — All stages pass, OIDC auth, auto-deploy on main push, smoke test + rollback demonstrated live
- Monitoring (20pts) — Grafana dashboard live, RED metrics, CPU alert fires to Slack, alert resolves correctly
- Security (15pts) — No secrets in git, Trivy clean, Secrets Manager used, non-root Dockerfile, least-privilege IAM
- Documentation (15pts) — README with architecture diagram, local setup under 5 minutes, runbooks, security decisions documented
- Q&A depth (bonus up to 10pts) — Demonstrates understanding beyond surface level: trade-offs explained, scale thinking, honest gaps acknowledged