1. Walk me through how you would design an ETL pipeline for a new data source?
Pipeline design is the bread and butter of the role, and the answer reveals how a candidate thinks about reliability, cost, and downstream consumers. Hiring managers listen for structured reasoning — source profiling, extraction cadence, transformation logic, loading strategy — rather than a rote recital of tools. Strong candidates ask clarifying questions about volume, latency, and SLAs before jumping to architecture.
I'd start by profiling the source — schema stability, volume, update patterns, and whether it supports CDC or just full snapshots. For a typical batch source I'd land raw data in S3 or GCS as Parquet, then use dbt on Snowflake for transformations into staging, intermediate, and mart layers. Airflow or Dagster would orchestrate, with idempotent tasks, retries, and alerting via PagerDuty. I'd also add Great Expectations tests on staging tables and monitor row counts and freshness in Monte Carlo or a homegrown dashboard.
2. How do you decide between batch and streaming for a given pipeline?
The trade-off between batch and streaming is a classic architecture decision, and the answer shows whether someone chases novelty or solves the actual problem. Streaming adds operational complexity, infrastructure cost, and debugging difficulty — it should be justified, not assumed. Look for candidates who anchor the decision in business requirements, not Kafka enthusiasm.
My default is batch unless the use case genuinely needs sub-minute latency — fraud detection, real-time personalisation, operational dashboards feeding live decisions. For most analytical reporting, hourly or even daily batch with dbt and Airflow is simpler, cheaper, and easier to debug. When streaming is justified, I've used Kafka with Flink or Spark Structured Streaming, landing to an iceberg or delta lake table that batch jobs can also consume. I try to avoid running two parallel code paths for the same logic.
3. Explain slowly changing dimensions and when you would use each type?
Dimensional modelling still underpins most analytical warehouses, and SCDs are where candidates reveal whether they've actually shipped a warehouse or just read about one. The interviewer is listening for a clear distinction between the types and a realistic sense of when Type 2 history is worth the complexity. Bonus points for mentioning surrogate keys and effective dating.
Type 1 overwrites the old value — fine for correcting typos or when history does not matter. Type 2 preserves history by inserting a new row with effective_from and effective_to timestamps and a current_flag — essential for things like sales territory changes where you need point-in-time reporting. Type 3 adds a previous_value column, which I rarely use because it only captures one level of history. In practice I default to Type 2 for anything business-critical and Type 1 for lookup attributes, using hashed surrogate keys to make joins stable.
4. How would you handle a late-arriving fact in a data warehouse?
Late-arriving data is a realistic production headache, and this question separates people who have operated warehouses from those who have only built greenfield ones. The interviewer wants to hear about dimension lookups, reprocessing windows, and how the candidate keeps metrics stable when data drifts in days later. Answers that ignore idempotency are a red flag.
If the fact arrives after its dimension rows exist, I just load it normally with the correct surrogate key lookup based on the event timestamp. If the dimension is not there yet, I will insert a placeholder row with an inferred_flag so the fact still loads, then update it when the real dimension arrives. For reprocessing, I design pipelines to be idempotent over a rolling window — typically 7 to 14 days — using merge operations rather than append-only loads so reruns do not duplicate.
5. What is your approach to data quality testing?
Data quality failures silently break decisions, so engineers who take testing seriously are rare and valuable. The answer reveals whether the candidate treats data like code — with tests, CI, and ownership — or whether quality is someone else's problem. Listen for layered testing, not just one magic tool.
I layer it. Source-level checks on row counts and schema drift, staging-level tests using dbt tests or Great Expectations for nulls, uniqueness, referential integrity, and accepted ranges. At the mart layer I add business logic tests — revenue never negative, active_users >= paying_users, and so on. All tests run in CI on pull requests against a sample, then again post-load in production with alerting to a dedicated Slack channel. Critical tables also get freshness SLAs monitored independently.
6. How do you control costs on a cloud data warehouse like Snowflake or BigQuery?
Ready to find your 4-day week job?
Browse opportunities at companies that prioritize work-life balance.
Browse JobsWarehouse bills spiral quickly, and a data engineer who cannot reason about cost will eventually become a problem. Interviewers look for practical levers — clustering, partitioning, warehouse sizing, query optimisation — rather than vague gestures at optimisation. This is especially important for senior candidates.
A few levers I reach for regularly. On Snowflake I right-size warehouses per workload, use auto-suspend aggressively, and separate transformation warehouses from BI ones so heavy jobs do not block dashboards. I partition and cluster large tables on high-cardinality filter columns, rewrite queries that scan whole tables, and use materialized views or incremental dbt models for anything run repeatedly. I also set resource monitors with hard caps and review the top 20 most expensive queries weekly with the analytics team.
7. Describe a time you had to debug a broken pipeline under time pressure.
Behavioural questions on incidents reveal composure, diagnostic method, and communication — all critical when the CEO's board deck is blank at 8am. Strong answers follow a clear narrative: detection, triage, root cause, fix, follow-up. Weak answers either blame others or skip the post-mortem.
Our nightly revenue pipeline failed silently one Monday because a source system started sending timestamps in a different timezone. Dashboards showed a 40% drop in Sunday sales. I caught it in the morning slack, rolled the mart tables back to Friday's snapshot within 30 minutes so the exec team had working numbers, then traced the issue to a schema contract we had not enforced. I added a test for timezone format on ingest and wrote a short post-mortem. Nothing fancy — just fast triage, clear comms, and a durable fix.
8. How do you model a data vault versus a dimensional warehouse?
Data vault comes up with enterprise-scale candidates, and the answer reveals depth of modelling knowledge. Interviewers want to hear when each approach fits — vault for auditability and source-agnostic loading, dimensional for consumption. Candidates who dogmatically prefer one are usually less experienced than they sound.
Data vault splits entities into hubs (business keys), links (relationships), and satellites (descriptive attributes with full history). It is append-only, highly auditable, and great when you are integrating many source systems with changing schemas. Dimensional modelling — facts and conformed dimensions — is better for consumption because it is intuitive for analysts. In practice I often use vault as the raw integration layer and build dimensional marts on top, though for smaller orgs I skip vault entirely and go straight to Kimball-style dimensional.
9. What is your experience with dbt and how do you structure projects?
dbt has become the default transformation layer, and project structure reveals how a candidate thinks about maintainability. Interviewers listen for layered models, clear naming, and thoughtful use of macros and tests. Monolithic projects with 2000-line models signal trouble.
I use the staging, intermediate, marts convention. Staging models are thin — one per source table, renaming and casting only. Intermediate models handle joins and business logic in reusable chunks. Marts are the consumable layer, materialized as tables with clustering where volume demands it. I keep macros for genuinely repeated logic, use sources with freshness checks, and run dbt build in CI with a slim selector so PRs only test what changed. Exposures document downstream dashboards so we know what breaks if a mart changes.
10. How do you approach schema evolution in a data lake?
Schema drift is an unavoidable operational reality, and the answer shows whether a candidate has managed data at scale. Interviewers want concrete mechanisms — schema registries, table formats, contracts — rather than hand-waving about flexibility. This is increasingly central with lakehouse architectures.
On a lakehouse with Iceberg or Delta, schema evolution is much saner than with raw Parquet — you get additive column changes and type widening without rewriting files. I pair that with a schema registry (Confluent or a homegrown one in Git) and CI checks that fail PRs introducing breaking changes. For producers, I push schema contracts with explicit versioning; consumers read through views that insulate them from raw table changes. Breaking changes require a coordinated migration window, not a silent redeploy.
11. Walk me through how you would optimise a slow SQL query on a billion-row table.
Get 4-day week jobs in your inbox
Create a free account to receive curated opportunities weekly.
Sign up for freeFree forever. No spam, unsubscribe anytime.
SQL fluency is non-negotiable for data engineers, and this question exposes whether the candidate thinks in execution plans or just in syntax. Good answers reference partitioning, join strategies, predicate pushdown, and actual profiling — not just adding an index. Indexes rarely apply to columnar warehouses, which is itself a signal.
I start with the query plan — on Snowflake that is the profile view, on BigQuery the execution graph. I am looking for full table scans, huge intermediate result sets, or skewed joins. Common wins are adding partition and cluster keys aligned with the filter and join columns, rewriting subqueries as CTEs or vice versa depending on the engine, replacing correlated subqueries with window functions, and pre-aggregating large fact tables into incremental models. I also check warehouse sizing — sometimes the query is fine and the compute is just undersized.
12. How do you handle PII and sensitive data in your pipelines?
Privacy is both a legal and ethical obligation, and casual handling of PII is a serious red flag. Interviewers want clear patterns — column-level encryption, masking, access controls, retention — plus awareness of GDPR and similar regulations. Senior candidates should talk about data contracts with source teams.
PII gets tagged at ingest using column-level metadata, and access is controlled through role-based masking policies — analysts see hashed values, a small authorised group sees cleartext. For right-to-be-forgotten requests we keep a deletion queue and run a weekly job that propagates deletes through warehouse and downstream marts. Retention is enforced with automatic expiration on raw tables. I also work with the security team to review new data sources for sensitive fields before they land rather than discovering them in a mart later.
13. Tell me about a time you disagreed with an analyst or stakeholder on a data decision.
This probes collaboration, influence, and ego. Data engineers sit between producers and consumers, and the ability to push back constructively matters. Interviewers want someone who advocates for correctness without being obstructive — and who ultimately serves the business, not their own architectural preferences.
An analyst wanted real-time streaming for a marketing dashboard that was only reviewed weekly. The cost would have been roughly six times our batch setup. I asked to sit with them for an hour and watch how they actually used the dashboard, then proposed hourly refresh with a clearly labelled "last updated" timestamp. That solved their actual concern — staleness during campaign launches — at a fraction of the cost. I learned to ask what problem they are solving, not what solution they want.
14. How do you keep up with the modern data stack?
The data tooling landscape shifts fast, and stale knowledge becomes a liability quickly. Interviewers are not looking for buzzword bingo — they want genuine curiosity and a filter for what is hype versus what solves problems. A candidate who namechecks ten tools they have never used is worse than one who has gone deep on three.
I follow a few newsletters — Benn Stancil, Seattle Data Guy, the dbt blog — and read post-mortems from engineering orgs I respect. I set aside Friday afternoons for small experiments, usually running a toy pipeline with a tool I am curious about. Conferences like Coalesce or Data Council are worth it every couple of years. Mostly I try to stay sceptical; I adopt tools when they solve a concrete pain in what we are running, not because they trended on LinkedIn.
15. Where do you want to be in three years, and what are you looking for in a role?
Motivation and fit questions often feel throwaway but carry weight — they reveal whether the candidate has a direction and whether this team can serve it. Vague answers about growth or impact are less useful than concrete interests. A grounded answer also signals someone who will stick around.
I want to be a staff-level data engineer owning a meaningful platform area — probably around streaming or data quality, both of which I have been drawn to. Short term I am looking for a team that takes data seriously as a product, with analysts and engineers working closely rather than lobbing tickets over a wall. Work-life balance matters to me — I do my best work when I have got space to think, which is partly why a reduced-hours setup appeals.