Synapse Analytics Interview Q&A
25 interview questions on Synapse Analytics pools, performance, and data warehousing
Question 1: What is the difference between Dedicated and Serverless SQL Pools?
Answer: Dedicated: Reserved compute, DWU-based, predictable performance. Serverless: Pay-per-TB-scanned, auto-scale, for ad-hoc exploration. Use dedicated for production; serverless for exploration.
Question 2: How do you choose a distribution strategy?
Answer: Hash: Large fact tables (join columns). Replicated: Small dimension tables (<2GB). Round Robin: Staging tables. Choose based on table size, join patterns, and query performance.
Question 3: What is the benefit of CTAS over INSERT INTO?
Answer: CTAS creates a new table with optimal distribution and indexing. INSERT INTO appends to existing tables. Use CTAS for initial loads; INSERT INTO for incremental updates.
Question 4: How do you optimize query performance?
Answer: 1) Correct distribution, 2) Clustered Columnstore Indexes, 3) Statistics, 4) Partitioning, 5) Result set caching, 6) Materialized views, 7) SET statements.
Question 5: What is the maximum DWU for Dedicated SQL Pool?
Answer: DW6000c (12 compute nodes, 12TB storage). Scale based on workload requirements. Use auto-pause for non-24/7 workloads.
Question 6: How do you handle data skew?
Answer: Check with DBCC PDW_SHOWSPACEUSED. Change distribution key. Use Round Robin for staging. Consider replicated tables for small dimensions.
Question 7: What is the benefit of Clustered Columnstore Indexes?
Answer: Optimal compression (10x), columnar storage, batch-mode execution. Default and recommended for fact tables.
Question 8: How do you implement SCD in Synapse?
Answer: SCD Type 1: MERGE UPDATE. SCD Type 2: MERGE INSERT with effective dates. Use Synapse Serverless for staging changes.
Question 9: What is the difference between external tables and views?
Answer: External tables: Point to files in ADLS with schema. Views: Virtual tables defined by SQL queries. Use external tables for raw data; views for curated data.
Question 10: How do you monitor Synapse performance?
Answer: sys.dm_pdw_exec_requests for query history, DBCC PDW_SHOWSPACEUSED for distribution, Azure Monitor for metrics, and custom KQL queries.
Question 11: What is the benefit of result set caching?
Answer: Stores query results in memory for 5 minutes. Subsequent identical queries return cached results instantly. Reduces compute usage.
Question 12: How do you handle schema evolution?
Answer: Use CTAS to recreate tables with new schema, ALTER TABLE for additive changes, or Synapse Serverless for schema-on-read exploration.
Question 13: What is the difference between CTAS and CREATE TABLE?
Answer: CTAS: Creates table from query results with distribution/indexing. CREATE TABLE: Creates empty table with schema. Use CTAS for data loading.
Question 14: How do you implement incremental loading?
Answer: Use watermarks, MERGE statement for upserts, partition switching for large deletes, and Synapse Pipelines for orchestration.
Question 15: What is the benefit of Synapse Link?
Answer: HTAP capabilities, real-time analytics on operational data (Cosmos DB), no ETL pipeline required, consistent view of transactional data.
Question 16: How do you optimize Synapse costs?
Answer: Auto-pause for non-24/7 workloads, reserved capacity for stable workloads, right-size DWU, and use serverless for ad-hoc queries.
Question 17: What is the difference between Synapse and Azure SQL DW?
Answer: Synapse: Unified analytics (SQL, Spark, Pipelines). Azure SQL DW: Dedicated SQL pool only. Synapse provides more capabilities and integration.
Question 18: How do you handle cross-database queries?
Answer: Use Synapse Serverless with external tables, or create views that reference external data. Cross-database queries are limited in Dedicated pools.
Question 19: What is the benefit of dedicated pools for BI?
Answer: Predictable performance, low latency for dashboards, integration with Power BI, and support for complex queries with large datasets.
Question 20: How do you implement data masking?
Answer: Use dynamic data masking in Synapse SQL, column-level security, and row-level security for sensitive data protection.
Question 21: What is the difference between Synapse and Databricks?
Answer: Synapse: SQL-focused data warehousing with Spark. Databricks: Spark-focused analytics with SQL. Use Synapse for SQL workloads; Databricks for Spark workloads.
Question 22: How do you handle large table loads?
Answer: Use CTAS for initial loads, MERGE for incremental updates, PolyBase for bulk loading, and partition switching for fast deletes.
Question 23: What is the benefit of Synapse Pipelines?
Answer: Built-in orchestration, integration with SQL/Spark pools, monitoring, and Git integration. Simplifies data pipeline management within Synapse.
Question 24: How do you implement data governance in Synapse?
Answer: Purview integration, row-level security, column-level security, dynamic data masking, and audit logging.
Question 25: What is the future of Synapse Analytics?
Answer: Deeper integration with Fabric, enhanced serverless capabilities, improved Spark performance, and expanded ecosystem integration.