ETL/ELT Patterns Interview Q&A
25 interview questions on ETL/ELT patterns and data transformation on Azure
Question 1: What is the difference between ETL and ELT?
Answer: ETL transforms before loading (staging area). ELT loads raw data first, transforms using target system capabilities. ELT leverages cloud data warehouse compute.
Question 2: When would you use ADF Copy Activity vs Data Flow?
Answer: Copy Activity: Simple data movement, format conversion, faster execution. Data Flow: Complex transformations (joins, aggregations), visual debugging, Spark-based.
Question 3: How do you implement idempotent data loading?
Answer: Use Delta Lake merge (upsert), watermarks with unique keys, checkpointing, and deduplication logic. Ensure same data can be reprocessed safely.
Question 4: What is the benefit of using Delta Lake for ETL?
Answer: ACID transactions, schema evolution, time travel for reprocessing, data skipping, and streaming support. Enables reliable ETL operations.
Question 5: How do you handle slowly changing dimensions?
Answer: SCD Type 1: Overwrite (MERGE UPDATE). SCD Type 2: Add new row with effective dates (MERGE INSERT). SCD Type 3: Add column for previous value.
Question 6: What is the benefit of incremental loading?
Answer: Processes only changed data, reducing processing time and costs. Use watermarks, CDC, or timestamps for change detection.
Question 7: How do you implement data quality checks in ETL?
Answer: Great Expectations for Python validation, ADF Data Flows for visual rules, Purview for classification, and SQL checks in Synapse.
Question 8: What is the difference between batch and micro-batch?
Answer: Batch: Process on schedule (hourly/daily). Micro-batch: Process small batches continuously (seconds). Use batch for large datasets; micro-batch for near-real-time.
Question 9: How do you handle schema changes in ETL?
Answer: Schema drift in ADF, Delta Lake mergeSchema, Purview schema discovery, and schema registry for versioned schemas.
Question 10: What is the benefit of using Synapse Serverless for ELT?
Answer: Query data in-place from ADLS, pay-per-TB-scanned, no infrastructure management, and external tables for schema-on-read.
Question 11: How do you implement fan-out/fan-in pattern?
Answer: Databricks workflows with task dependencies, ADF foreach activity, or parallel processing with merge at the end.
Question 12: What is the benefit of using stored procedures in Synapse?
Answer: Encapsulate business logic, parameterized execution, version control, and performance optimization with precompiled plans.
Question 13: How do you handle late-arriving data?
Answer: Stream Analytics late arrival policy, ADF watermarks with flexible windows, Delta Lake time travel for reprocessing.
Question 14: What is the difference between full and incremental load?
Answer: Full: Replace entire datasets (simple, expensive). Incremental: Process only changes (complex, efficient). Use watermarks for incremental.
Question 15: How do you implement CDC in Azure?
Answer: ADF CDC for SQL Server, Debezium for open-source databases, Cosmos DB Change Feed for NoSQL, and Event Hubs for streaming.
Question 16: What is the benefit of using CTAS in Synapse?
Answer: Creates new table with optimal distribution and indexing. Faster than INSERT INTO for large data loads. Avoids fragmentation.
Question 17: How do you handle data lineage in ETL?
Answer: Purview automatic lineage from ADF/Databricks/Synapse, custom lineage via SDK, and documentation in business glossary.
Question 18: What is the benefit of using parameters in ADF?
Answer: Reusability, environment promotion, dynamic content expressions, and separation of configuration from logic.
Question 19: How do you implement error handling in ETL?
Answer: Try-catch patterns, retry policies, dead-letter queues, logging, alerting, and quarantine for failed records.
Question 20: What is the difference between watermarks and CDC?
Answer: Watermarks: Track last processed timestamp/ID (requires timestamp column). CDC: Capture database changes from transaction log (more accurate, requires CDC enabled).
Question 21: How do you optimize ETL performance?
Answer: Parallel processing, appropriate file formats (Parquet), partitioning, CTAS for Synapse, and auto-scaling clusters.
Question 22: What is the benefit of using Databricks for ETL?
Answer: Spark-based processing, Delta Lake integration, notebook development, job orchestration, and Unity Catalog governance.
Question 23: How do you implement data masking in ETL?
Answer: Dynamic data masking in Synapse, custom masking logic in transformations, and Purview sensitivity labels.
Question 24: What is the benefit of using ADF Mapping Data Flows?
Answer: Visual ETL development, Spark-based execution, debugging with data preview, and integration with ADF orchestration.
Question 25: How do you test ETL pipelines?
Answer: Unit tests for transformation logic, integration tests with sample data, data quality validation, performance testing, and end-to-end testing.