ETL/ELT Patterns Interview Q&A

25 interview questions on ETL/ELT patterns and data transformation on Azure

Question 1: What is the difference between ETL and ELT?

Answer: ETL transforms before loading (staging area). ELT loads raw data first, transforms using target system capabilities. ELT leverages cloud data warehouse compute.

Question 2: When would you use ADF Copy Activity vs Data Flow?

Answer: Copy Activity: Simple data movement, format conversion, faster execution. Data Flow: Complex transformations (joins, aggregations), visual debugging, Spark-based.

Question 3: How do you implement idempotent data loading?

Answer: Use Delta Lake merge (upsert), watermarks with unique keys, checkpointing, and deduplication logic. Ensure same data can be reprocessed safely.

Question 4: What is the benefit of using Delta Lake for ETL?

Answer: ACID transactions, schema evolution, time travel for reprocessing, data skipping, and streaming support. Enables reliable ETL operations.

Question 5: How do you handle slowly changing dimensions?

Answer: SCD Type 1: Overwrite (MERGE UPDATE). SCD Type 2: Add new row with effective dates (MERGE INSERT). SCD Type 3: Add column for previous value.

Question 6: What is the benefit of incremental loading?

Answer: Processes only changed data, reducing processing time and costs. Use watermarks, CDC, or timestamps for change detection.

Question 7: How do you implement data quality checks in ETL?

Answer: Great Expectations for Python validation, ADF Data Flows for visual rules, Purview for classification, and SQL checks in Synapse.

Question 8: What is the difference between batch and micro-batch?

Answer: Batch: Process on schedule (hourly/daily). Micro-batch: Process small batches continuously (seconds). Use batch for large datasets; micro-batch for near-real-time.

Question 9: How do you handle schema changes in ETL?

Answer: Schema drift in ADF, Delta Lake mergeSchema, Purview schema discovery, and schema registry for versioned schemas.

Question 10: What is the benefit of using Synapse Serverless for ELT?

Answer: Query data in-place from ADLS, pay-per-TB-scanned, no infrastructure management, and external tables for schema-on-read.

Question 11: How do you implement fan-out/fan-in pattern?

Answer: Databricks workflows with task dependencies, ADF foreach activity, or parallel processing with merge at the end.

Question 12: What is the benefit of using stored procedures in Synapse?

Answer: Encapsulate business logic, parameterized execution, version control, and performance optimization with precompiled plans.

Question 13: How do you handle late-arriving data?

Answer: Stream Analytics late arrival policy, ADF watermarks with flexible windows, Delta Lake time travel for reprocessing.

Question 14: What is the difference between full and incremental load?

Answer: Full: Replace entire datasets (simple, expensive). Incremental: Process only changes (complex, efficient). Use watermarks for incremental.

Question 15: How do you implement CDC in Azure?

Answer: ADF CDC for SQL Server, Debezium for open-source databases, Cosmos DB Change Feed for NoSQL, and Event Hubs for streaming.

Question 16: What is the benefit of using CTAS in Synapse?

Answer: Creates new table with optimal distribution and indexing. Faster than INSERT INTO for large data loads. Avoids fragmentation.

Question 17: How do you handle data lineage in ETL?

Answer: Purview automatic lineage from ADF/Databricks/Synapse, custom lineage via SDK, and documentation in business glossary.

Question 18: What is the benefit of using parameters in ADF?

Answer: Reusability, environment promotion, dynamic content expressions, and separation of configuration from logic.

Question 19: How do you implement error handling in ETL?

Answer: Try-catch patterns, retry policies, dead-letter queues, logging, alerting, and quarantine for failed records.

Question 20: What is the difference between watermarks and CDC?

Answer: Watermarks: Track last processed timestamp/ID (requires timestamp column). CDC: Capture database changes from transaction log (more accurate, requires CDC enabled).

Question 21: How do you optimize ETL performance?

Answer: Parallel processing, appropriate file formats (Parquet), partitioning, CTAS for Synapse, and auto-scaling clusters.

Question 22: What is the benefit of using Databricks for ETL?

Answer: Spark-based processing, Delta Lake integration, notebook development, job orchestration, and Unity Catalog governance.

Question 23: How do you implement data masking in ETL?

Answer: Dynamic data masking in Synapse, custom masking logic in transformations, and Purview sensitivity labels.

Question 24: What is the benefit of using ADF Mapping Data Flows?

Answer: Visual ETL development, Spark-based execution, debugging with data preview, and integration with ADF orchestration.

Question 25: How do you test ETL pipelines?

Answer: Unit tests for transformation logic, integration tests with sample data, data quality validation, performance testing, and end-to-end testing.