Top 15 Cloud Engineer Interview Questions for 2026

1. Walk me through how you would structure a Terraform codebase for a multi-environment deployment.

IaC structure is one of the clearest signals of cloud-engineering maturity. Interviewers want to see that you've thought about module reuse, state isolation, and environment parity — not just copy-pasted directories. Bonus signal if you mention remote state and locking.

I structure Terraform around reusable modules in a modules/ directory and per-environment root configurations in envs/dev, envs/staging, and envs/prod that consume those modules with environment-specific variables. State lives in S3 with DynamoDB locking, and each environment has its own state file so a bad dev apply can never touch prod. I keep the module interfaces stable and version them with Git tags so rolling out a change is a conscious promotion between environments.

2. Explain how you would implement least-privilege IAM for a new microservice on AWS.

IAM is where most cloud security incidents originate, and hiring managers want to see real operational habits. Watch for specifics — role-per-service, scoped resource ARNs, condition keys — rather than a generic "principle of least privilege" answer.

I create a dedicated IAM role per service and attach a customer-managed policy scoped to the exact actions and resource ARNs that service needs. I start by denying everything, then add permissions driven by CloudTrail logs from a staging environment. For workloads on EKS I use IRSA so pods assume the role directly without long-lived credentials, and I audit with IAM Access Analyzer monthly to catch over-permissioned roles that have drifted.

3. How would you design a CI/CD pipeline for a containerised application?

Pipeline design questions probe your end-to-end thinking — source, build, test, security scanning, and safe deployment. Strong candidates mention image scanning, rollback, and progressive delivery rather than just "push to main, deploy to prod."

I'd wire up GitHub Actions or GitLab CI to run unit tests and linting on every PR, build a container image on merge to main, and push it to ECR with both a semver and a git-SHA tag. Trivy or Snyk scans the image before it's promoted. Deployment is handled by Argo CD watching the Git repo, with Argo Rollouts for canary or blue-green so I can roll back via Git revert if metrics degrade.

4. Describe how you would set up monitoring and alerting for a new production service.

Observability questions test whether you understand the difference between metrics, logs, and traces and how to avoid alert fatigue. The weakest answers list tools without strategy; strong answers talk about SLOs and paging thresholds.

I start by defining the service's SLOs — typically availability and p95 latency — and build alerts only on symptoms that indicate SLO burn. Metrics go to Prometheus or CloudWatch, logs to a centralised store like Loki or CloudWatch Logs with structured JSON, and traces to something OpenTelemetry-compatible. I keep paging alerts under ten per service and everything else goes to a ticket queue so we don't normalise getting woken up.

5. How do you approach cost optimisation on AWS?

FinOps has become a core cloud-engineer responsibility. Interviewers want concrete levers you've pulled — not just "we used Reserved Instances once." Listen for continuous practices over one-off projects.

I run monthly cost reviews using Cost Explorer with tagging enforced via Service Control Policies so every resource rolls up to a cost centre. My biggest wins have typically come from rightsizing with Compute Optimizer, moving non-prod to Savings Plans, and migrating stateless workloads to Graviton. I also set up budget alerts per account at 80 percent so surprise bills surface before month-end.

6. Explain VPC peering versus Transit Gateway.

Browse:4-Day Week Jobs 4-Day Week Companies

Trusted by 2M+ job seekers

Ready to find your 4-day week job?

Browse opportunities at companies that prioritize work-life balance.

Browse Jobs

Networking fundamentals separate candidates who understand cloud infrastructure from those who only know application deployment. Expect follow-ups on route tables and transitive routing.

VPC peering is a point-to-point connection between two VPCs with non-transitive routing, which means if you have VPCs A, B, and C peered in a triangle, A-to-C traffic won't hop through B. Transit Gateway acts as a hub-and-spoke router, supports transitive routing, and scales to thousands of VPCs and on-prem connections via Direct Connect. I default to Transit Gateway beyond three VPCs because the peering mesh gets unmanageable quickly.

7. How do you handle secrets management in a cloud environment?

Secrets handling is a common interview trap because it's easy to talk about and hard to do well. Watch for candidates who mention rotation, audit logging, and avoiding secrets in environment variables or CI logs.

I use AWS Secrets Manager or HashiCorp Vault depending on the stack, with automatic rotation enabled for database credentials. Applications fetch secrets at startup via IAM-authenticated SDK calls — never baked into container images or CI variables. For CI itself, I use OIDC federation so GitHub Actions assumes an AWS role without storing static keys, and I audit Secrets Manager access logs to spot unusual patterns.

8. Walk me through troubleshooting a Kubernetes pod that is stuck in CrashLoopBackOff.

Hands-on debugging questions are where candidates either demonstrate real experience or expose the gap. Strong answers walk through a systematic sequence rather than listing random kubectl commands.

I start with kubectl describe pod to see the events and the exit code from the previous container instance. Then kubectl logs --previous to get the logs from the crashed container — that usually reveals the cause. If it's an image-pull issue it's usually permissions or a typo in the image tag; if it's a runtime crash I'll shell into a debug container with kubectl debug or run the image locally. Liveness probe misconfiguration is the sneakiest cause — too aggressive and the pod gets killed before it's ready.

9. How would you design a disaster recovery strategy for a business-critical application?

DR questions test whether you think in terms of RTO and RPO rather than just "we have backups." Interviewers want to hear about the cost-complexity tradeoff between DR postures.

I align the strategy to business-defined RTO and RPO targets. For a tier-one application needing sub-hour recovery, I'd run active-active across two regions with global load balancing and cross-region replication on the data layer. For less critical services, pilot-light or warm-standby is more cost-effective. Whatever the posture, I test failover at least quarterly — unreliable DR is worse than no DR because it creates false confidence.

10. What is your approach to immutable infrastructure?

Immutability is a cultural shift as much as a technical one. Interviewers want candidates who've internalised why we replace rather than modify, and the operational hygiene that implies.

Every change goes through our Terraform and container image pipeline — no SSH-ing into servers to patch something live. If a bug exists in prod, the fix lands in Git, builds a new image, and rolls out through the deployment pipeline. It's slower in the moment but eliminates configuration drift and makes every environment reproducible. The few times I've broken the rule and hotfixed a box, it bit me within a month.

11. How do you manage multi-account AWS organisations?

Trusted by 2M+ job seekers

Get 4-day week jobs in your inbox

Create a free account to receive curated opportunities weekly.

Free forever. No spam, unsubscribe anytime.

Large AWS estates are messy without good account structure. Interviewers are looking for Control Tower or Organizations knowledge and an understanding of why we isolate workloads.

I use AWS Organizations with Control Tower for baseline guardrails, structured into OUs by environment and workload sensitivity. Each team gets separate dev, staging, and prod accounts so blast radius is bounded. SCPs enforce non-negotiables like "no public S3 buckets" and "only approved regions." Cross-account access goes through IAM Identity Center with time-bound sessions rather than long-lived access keys.

12. Tell me about an incident you led or participated in.

Incident-response stories reveal calm under pressure, structured thinking, and a blameless-postmortem mindset. Interviewers are watching for self-reflection and concrete remediation.

We had a two-hour outage caused by a Terraform apply that unintentionally detached an EBS volume from our primary database. I was on-call and led the response — declared the incident, rolled back the change, and restored from snapshot. The root-cause post-mortem identified three contributing factors: missing prevent_destroy, no required PR approvals for prod changes, and insufficient alerting on volume-attach state. We shipped all three fixes within a fortnight and shared the write-up publicly inside the company.

13. How do you decide between managed services and self-hosted alternatives?

The "build vs buy" question tests pragmatism. Interviewers don't want ideologues either way — they want someone who can articulate the tradeoffs in context.

I start with managed services by default because operational burden compounds over time. I'd self-host only when the managed option has a genuine dealbreaker — unacceptable cost at scale, a feature gap, or compliance that rules out the managed tier. I've self-hosted PostgreSQL for cost reasons at one role and regretted it by year two when the maintenance load caught up. Managed services aren't cheaper per instance; they're cheaper per engineer-hour.

14. What does a good on-call rotation look like to you?

Sustainable on-call is a quality-of-life issue and often a leading indicator of team health. Interviewers want to hear that you care about it structurally, not just personally.

A good rotation has enough engineers that you're on-call no more than one week in six, follow-the-sun where possible so no one carries pager into the night regularly, and clear escalation paths. Crucially, every page needs to be reviewed — noisy alerts get tuned or deleted rather than normalised. If we're getting paged for the same reason twice, that's a backlog item, not a burden on the next rotation.

15. Why a four-day week for this role?

Companies offering reduced schedules want engineers who protect deep work and avoid the always-on culture that cloud-ops roles can drift into. Interviewers are probing how you'd actually make 32 hours work.

Cloud engineering rewards focused blocks of deep work — Terraform refactors, debugging tricky networking issues, post-mortem write-ups — and fragmented calendars are the enemy of all of that. A four-day week forces better runbook hygiene, documentation, and automation so the team isn't dependent on one person being online. I think it pushes a team toward genuinely resilient operations rather than people-as-fallback.

interview questionscloud engineer

15 Cloud Engineer Interview Questions (2026)