Azure Synapse Analytics: Pools, Serverless & Architecture
Enterprise data warehousing with dedicated pools, serverless querying, and unified analytics
Synapse Workspace Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SYNAPSE WORKSPACE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β SYNAPSE WORKSPACE β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β SQL Pool β β SQL Pool β β β
β β β (Dedicated) β β (Serverless) β β β
β β β β β β β β
β β β DW200c-DW6000c β β Pay per TB β β β
β β β Reserved compute β β scanned β β β
β β β β β β β β
β β β Distributions: β β External tables β β β
β β β Hash/Round Robin β β Views β β β
β β β Replicated β β Lake databases β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β Spark Pool β β Pipelines β β β
β β β β β β β β
β β β Spark 3.3/3.4 β β ADF-powered β β β
β β β Auto-scale β β Orchestration β β β
β β β Delta Lake β β Monitoring β β β
β β β Notebooks β β Triggers β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
β β β Data Explorer β β Studio β β β
β β β Pool β β β β β
β β β β β SQL scripts β β β
β β β Kusto queries β β Notebooks β β β
β β β Log analytics β β Data flows β β β
β β ββββββββββββββββββββ β Power BI β β β
β β ββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β INTEGRATED CONNECTIVITY: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Azure Active Directory Authentication β β
β β β’ Managed Virtual Network β β
β β β’ Managed Private Endpoints β β
β β β’ Git Integration (Azure DevOps / GitHub) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dedicated SQL Pool Distribution Strategies
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DISTRIBUTION STRATEGIES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β HASH DISTRIBUTION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Distribution Key: CustomerID β β
β β β β
β β Compute Node 1 Compute Node 2 Compute Node 3 β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Customer β β Customer β β Customer β β β
β β β ID: 1,4,7... β β ID: 2,5,8... β β ID: 3,6,9... β β β
β β β (Hash mod 3) β β (Hash mod 3) β β (Hash mod 3) β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β β
β β Best for: Large fact tables, join columns β β
β β Avoid: Small dimension tables (causes skew) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ROUND ROBIN DISTRIBUTION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data distributed evenly across all distributions β β
β β β β
β β Compute Node 1 Compute Node 2 Compute Node 3 β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Rows: 1,4,7 β β Rows: 2,5,8 β β Rows: 3,6,9 β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β β
β β Best for: Loading raw data, staging tables β β
β β Avoid: Queries requiring joins (full shuffle needed) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β REPLICATED DISTRIBUTION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Full copy on each Compute Node β β
β β β β
β β Compute Node 1 Compute Node 2 Compute Node 3 β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Full Table β β Full Table β β Full Table β β β
β β β Copy β β Copy β β Copy β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β β
β β Best for: Small dimension tables (<2GB) β β
β β Avoid: Large tables (memory constraints) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CTAS (Create Table As Select) Pattern
-- Create a distributed table from staging
CREATE TABLE [dbo].[FactSales]
WITH
(
DISTRIBUTION = HASH(SaleDate),
CLUSTERED COLUMNSTORE INDEX,
PARTITION = (SaleDate RANGE RIGHT FOR VALUES
('2024-01-01', '2024-02-01', '2024-03-01',
'2024-04-01', '2024-05-01', '2024-06-01'))
)
AS
SELECT
s.SaleID,
s.CustomerKey,
p.ProductKey,
s.SaleDate,
s.Quantity,
s.UnitPrice,
s.Quantity * s.UnitPrice AS TotalAmount,
d.DateKey,
c.CustomerSegment
FROM [staging].[Sales] s
INNER JOIN [dim].[Customers] c ON s.CustomerID = c.CustomerID
INNER JOIN [dim].[Products] p ON s.ProductID = p.ProductID
INNER JOIN [dim].[Dates] d ON s.SaleDate = d.FullDate
WHERE s.SaleDate >= '2024-01-01';
-- Verify distribution
DBCC PDW_SHOWSPACEUSED('dbo.FactSales');
Statistics and Indexes
-- Create statistics for better query plans
CREATE STATISTICS STAT_FactSales_SaleDate
ON [dbo].[FactSales](SaleDate);
CREATE STATISTICS STAT_FactSales_CustomerKey
ON [dbo].[FactSales](CustomerKey);
-- Create indexed view for common aggregations
CREATE VIEW [dbo].[vw_DailySalesSummary]
WITH SCHEMABINDING
AS
SELECT
SaleDate,
COUNT_BIG(*) AS TotalTransactions,
SUM(TotalAmount) AS DailyRevenue
FROM [dbo].[FactSales]
GROUP BY SaleDate;
CREATE UNIQUE CLUSTERED INDEX IX_vw_DailySalesSummary
ON [dbo].[vw_DailySalesSummary](SaleDate);
βΉοΈ
Pro Tip: Use CTAS (Create Table As Select) for data loading instead of INSERT INTO. CTAS creates a new table with optimal distribution and indexing, avoiding fragmentation of existing tables.
Serverless SQL Pool - External Tables
-- Create external data source pointing to ADLS
CREATE EXTERNAL DATA SOURCE [AzureDataLake]
WITH (
LOCATION = 'https://stdatalake001.dfs.core.windows.net',
CREDENTIAL = [ManagedIdentityCredential]
);
-- Create external file format for Parquet
CREATE EXTERNAL FILE FORMAT [ParquetFormat]
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
-- Create external table
CREATE EXTERNAL TABLE [dbo].[ExternalSales]
WITH (
LOCATION = 'curated/sales/',
DATA_SOURCE = [AzureDataLake],
FILE_FORMAT = [ParquetFormat]
)
AS
SELECT * FROM OPENROWSET(
BULK 'curated/sales/**/*.parquet',
FORMAT = 'PARQUET'
) WITH (
SaleID BIGINT,
CustomerKey INT,
ProductKey INT,
SaleDate DATE,
Quantity INT,
UnitPrice DECIMAL(18,2),
TotalAmount DECIMAL(18,2)
) AS [Sales];
-- Query with pushdown computation
SELECT
SaleDate,
SUM(TotalAmount) AS Revenue,
COUNT(*) AS Transactions
FROM [dbo].[ExternalSales]
WHERE SaleDate >= '2024-01-01'
GROUP BY SaleDate
ORDER BY SaleDate;
Synapse Pool Sizing Guide
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β POOL SIZING RECOMMENDATIONS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β WORKLOAD DWU NODES STORAGE COST/MO β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Development DW100c 1 250 GB $750 β
β Small Production DW500c 2 1 TB $3,750 β
β Medium Production DW1000c 2 2 TB $7,500 β
β Large Production DW3000c 6 6 TB $22,500 β
β Enterprise DW6000c 12 12 TB $45,000 β
β β
β AUTO-PAUSE CONFIGURATION: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Inactivity timeout: 1 hour (default) β β
β β Resume time: 3-5 minutes β β
β β Cost savings: Up to 70% for non-24/7 workloads β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β RESERVATION DISCOUNTS: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1-year commitment: 30% discount β β
β β 3-year commitment: 50% discount β β
β β DWU flexibility: Scale up/down within commitment β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Python SDK for Synapse
from azure.identity import DefaultAzureCredential
from azure.synapse.artifacts import ArtifactsClient
from azure.synapse.spark import SparkClient
import time
credential = DefaultAzureCredential()
# Artifacts Client
artifacts_client = ArtifactsClient(
credential=credential,
endpoint="https://syn-workspace.dev.azuresynapse.net"
)
# Run a SQL script
script_run = artifacts_client.sql_script.create_sql_script(
sql_script_name="daily_etl",
properties={
"content": {
"query": "EXEC sp_DailyETL @date = '2024-01-15'",
"currentConnection": {
"name": "Built-in"
}
}
}
)
# Submit Spark job
spark_client = SparkClient(
credential=credential,
endpoint="https://syn-workspace.dev.azuresynapse.net",
spark_pool_name="SparkPool01"
)
spark_client.spark_batch.create_spark_batch_job(
spark_batch_job={
"file": "abfss://notebooks@stdatalake001.dfs.core.windows.net/etl_job.py",
"configuration": {
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "1",
"spark.dynamicAllocation.maxExecutors": "10"
}
}
)
Interview Questions
Q1: Explain the difference between CTAS and INSERT INTO in Synapse. A: CTAS creates a new table with optimal distribution and indexing based on the WITH clause. INSERT INTO appends to existing tables but doesn't change distribution. Use CTAS for initial loads and large transformations; INSERT INTO for incremental updates.
Q2: How do you optimize query performance in Synapse Dedicated SQL Pool? A: 1) Choose correct distribution (Hash for facts, Replicated for small dims), 2) Use Clustered Columnstore Indexes, 3) Update statistics regularly, 4) Use partitioning for large tables, 5) Implement result-set caching, 6) Use materialized views for common aggregations.
Q3: When would you use Serverless vs Dedicated SQL Pool? A: Serverless for ad-hoc exploration, data lake querying, and pay-per-use scenarios. Dedicated for production data warehousing with predictable performance, complex joins, and high-concurrency requirements.