Azure Synapse Analytics: Pools, Serverless & Architecture

Enterprise data warehousing with dedicated pools, serverless querying, and unified analytics

Synapse Workspace Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    SYNAPSE WORKSPACE ARCHITECTURE                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  SYNAPSE WORKSPACE                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌──────────────────┐  ┌──────────────────┐                │   │
│  │  │ SQL Pool         │  │ SQL Pool         │                │   │
│  │  │ (Dedicated)      │  │ (Serverless)     │                │   │
│  │  │                  │  │                  │                │   │
│  │  │ DW200c-DW6000c  │  │ Pay per TB       │                │   │
│  │  │ Reserved compute │  │ scanned          │                │   │
│  │  │                  │  │                  │                │   │
│  │  │ Distributions:   │  │ External tables  │                │   │
│  │  │ Hash/Round Robin │  │ Views            │                │   │
│  │  │ Replicated       │  │ Lake databases   │                │   │
│  │  └──────────────────┘  └──────────────────┘                │   │
│  │                                                               │   │
│  │  ┌──────────────────┐  ┌──────────────────┐                │   │
│  │  │ Spark Pool       │  │ Pipelines        │                │   │
│  │  │                  │  │                  │                │   │
│  │  │ Spark 3.3/3.4   │  │ ADF-powered      │                │   │
│  │  │ Auto-scale       │  │ Orchestration    │                │   │
│  │  │ Delta Lake       │  │ Monitoring       │                │   │
│  │  │ Notebooks        │  │ Triggers         │                │   │
│  │  └──────────────────┘  └──────────────────┘                │   │
│  │                                                               │   │
│  │  ┌──────────────────┐  ┌──────────────────┐                │   │
│  │  │ Data Explorer    │  │ Studio           │                │   │
│  │  │ Pool             │  │                  │                │   │
│  │  │                  │  │ SQL scripts      │                │   │
│  │  │ Kusto queries    │  │ Notebooks        │                │   │
│  │  │ Log analytics    │  │ Data flows        │                │   │
│  │  └──────────────────┘  │ Power BI         │                │   │
│  │                        └──────────────────┘                │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  INTEGRATED CONNECTIVITY:                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ • Azure Active Directory Authentication                      │   │
│  │ • Managed Virtual Network                                     │   │
│  │ • Managed Private Endpoints                                   │   │
│  │ • Git Integration (Azure DevOps / GitHub)                    │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Dedicated SQL Pool Distribution Strategies

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTION STRATEGIES                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  HASH DISTRIBUTION                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Distribution Key: CustomerID                              │   │
│  │                                                           │   │
│  │ Compute Node 1    Compute Node 2    Compute Node 3       │   │
│  │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐     │   │
│  │ │ Customer     │ │ Customer     │ │ Customer     │     │   │
│  │ │ ID: 1,4,7... │ │ ID: 2,5,8... │ │ ID: 3,6,9... │     │   │
│  │ │ (Hash mod 3) │ │ (Hash mod 3) │ │ (Hash mod 3) │     │   │
│  │ └──────────────┘ └──────────────┘ └──────────────┘     │   │
│  │                                                           │   │
│  │ Best for: Large fact tables, join columns                │   │
│  │ Avoid: Small dimension tables (causes skew)              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ROUND ROBIN DISTRIBUTION                                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Data distributed evenly across all distributions         │   │
│  │                                                           │   │
│  │ Compute Node 1    Compute Node 2    Compute Node 3       │   │
│  │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐     │   │
│  │ │ Rows: 1,4,7  │ │ Rows: 2,5,8  │ │ Rows: 3,6,9  │     │   │
│  │ └──────────────┘ └──────────────┘ └──────────────┘     │   │
│  │                                                           │   │
│  │ Best for: Loading raw data, staging tables               │   │
│  │ Avoid: Queries requiring joins (full shuffle needed)     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  REPLICATED DISTRIBUTION                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Full copy on each Compute Node                           │   │
│  │                                                           │   │
│  │ Compute Node 1    Compute Node 2    Compute Node 3       │   │
│  │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐     │   │
│  │ │ Full Table   │ │ Full Table   │ │ Full Table   │     │   │
│  │ │ Copy         │ │ Copy         │ │ Copy         │     │   │
│  │ └──────────────┘ └──────────────┘ └──────────────┘     │   │
│  │                                                           │   │
│  │ Best for: Small dimension tables (<2GB)                  │   │
│  │ Avoid: Large tables (memory constraints)                 │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

CTAS (Create Table As Select) Pattern

-- Create a distributed table from staging
CREATE TABLE [dbo].[FactSales]
WITH
(
    DISTRIBUTION = HASH(SaleDate),
    CLUSTERED COLUMNSTORE INDEX,
    PARTITION = (SaleDate RANGE RIGHT FOR VALUES
        ('2024-01-01', '2024-02-01', '2024-03-01',
         '2024-04-01', '2024-05-01', '2024-06-01'))
)
AS
SELECT
    s.SaleID,
    s.CustomerKey,
    p.ProductKey,
    s.SaleDate,
    s.Quantity,
    s.UnitPrice,
    s.Quantity * s.UnitPrice AS TotalAmount,
    d.DateKey,
    c.CustomerSegment
FROM [staging].[Sales] s
INNER JOIN [dim].[Customers] c ON s.CustomerID = c.CustomerID
INNER JOIN [dim].[Products] p ON s.ProductID = p.ProductID
INNER JOIN [dim].[Dates] d ON s.SaleDate = d.FullDate
WHERE s.SaleDate >= '2024-01-01';

-- Verify distribution
DBCC PDW_SHOWSPACEUSED('dbo.FactSales');

Statistics and Indexes

-- Create statistics for better query plans
CREATE STATISTICS STAT_FactSales_SaleDate
ON [dbo].[FactSales](SaleDate);

CREATE STATISTICS STAT_FactSales_CustomerKey
ON [dbo].[FactSales](CustomerKey);

-- Create indexed view for common aggregations
CREATE VIEW [dbo].[vw_DailySalesSummary]
WITH SCHEMABINDING
AS
SELECT
    SaleDate,
    COUNT_BIG(*) AS TotalTransactions,
    SUM(TotalAmount) AS DailyRevenue
FROM [dbo].[FactSales]
GROUP BY SaleDate;

CREATE UNIQUE CLUSTERED INDEX IX_vw_DailySalesSummary
ON [dbo].[vw_DailySalesSummary](SaleDate);

ℹ️

Pro Tip: Use CTAS (Create Table As Select) for data loading instead of INSERT INTO. CTAS creates a new table with optimal distribution and indexing, avoiding fragmentation of existing tables.

Serverless SQL Pool - External Tables

-- Create external data source pointing to ADLS
CREATE EXTERNAL DATA SOURCE [AzureDataLake]
WITH (
    LOCATION = 'https://stdatalake001.dfs.core.windows.net',
    CREDENTIAL = [ManagedIdentityCredential]
);

-- Create external file format for Parquet
CREATE EXTERNAL FILE FORMAT [ParquetFormat]
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

-- Create external table
CREATE EXTERNAL TABLE [dbo].[ExternalSales]
WITH (
    LOCATION = 'curated/sales/',
    DATA_SOURCE = [AzureDataLake],
    FILE_FORMAT = [ParquetFormat]
)
AS
SELECT * FROM OPENROWSET(
    BULK 'curated/sales/**/*.parquet',
    FORMAT = 'PARQUET'
) WITH (
    SaleID BIGINT,
    CustomerKey INT,
    ProductKey INT,
    SaleDate DATE,
    Quantity INT,
    UnitPrice DECIMAL(18,2),
    TotalAmount DECIMAL(18,2)
) AS [Sales];

-- Query with pushdown computation
SELECT
    SaleDate,
    SUM(TotalAmount) AS Revenue,
    COUNT(*) AS Transactions
FROM [dbo].[ExternalSales]
WHERE SaleDate >= '2024-01-01'
GROUP BY SaleDate
ORDER BY SaleDate;

Synapse Pool Sizing Guide

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    POOL SIZING RECOMMENDATIONS                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WORKLOAD              DWU       NODES    STORAGE    COST/MO   │
│  ───────────────────────────────────────────────────────────── │
│  Development           DW100c    1        250 GB     $750      │
│  Small Production      DW500c    2        1 TB       $3,750    │
│  Medium Production     DW1000c   2        2 TB       $7,500    │
│  Large Production      DW3000c   6        6 TB       $22,500   │
│  Enterprise            DW6000c   12       12 TB      $45,000   │
│                                                                 │
│  AUTO-PAUSE CONFIGURATION:                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Inactivity timeout: 1 hour (default)                    │   │
│  │ Resume time: 3-5 minutes                                │   │
│  │ Cost savings: Up to 70% for non-24/7 workloads         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  RESERVATION DISCOUNTS:                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 1-year commitment: 30% discount                         │   │
│  │ 3-year commitment: 50% discount                         │   │
│  │ DWU flexibility: Scale up/down within commitment        │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Python SDK for Synapse

from azure.identity import DefaultAzureCredential
from azure.synapse.artifacts import ArtifactsClient
from azure.synapse.spark import SparkClient
import time

credential = DefaultAzureCredential()

# Artifacts Client
artifacts_client = ArtifactsClient(
    credential=credential,
    endpoint="https://syn-workspace.dev.azuresynapse.net"
)

# Run a SQL script
script_run = artifacts_client.sql_script.create_sql_script(
    sql_script_name="daily_etl",
    properties={
        "content": {
            "query": "EXEC sp_DailyETL @date = '2024-01-15'",
            "currentConnection": {
                "name": "Built-in"
            }
        }
    }
)

# Submit Spark job
spark_client = SparkClient(
    credential=credential,
    endpoint="https://syn-workspace.dev.azuresynapse.net",
    spark_pool_name="SparkPool01"
)

spark_client.spark_batch.create_spark_batch_job(
    spark_batch_job={
        "file": "abfss://notebooks@stdatalake001.dfs.core.windows.net/etl_job.py",
        "configuration": {
            "spark.dynamicAllocation.enabled": "true",
            "spark.dynamicAllocation.minExecutors": "1",
            "spark.dynamicAllocation.maxExecutors": "10"
        }
    }
)

Interview Questions

Q1: Explain the difference between CTAS and INSERT INTO in Synapse. A: CTAS creates a new table with optimal distribution and indexing based on the WITH clause. INSERT INTO appends to existing tables but doesn't change distribution. Use CTAS for initial loads and large transformations; INSERT INTO for incremental updates.

Q2: How do you optimize query performance in Synapse Dedicated SQL Pool? A: 1) Choose correct distribution (Hash for facts, Replicated for small dims), 2) Use Clustered Columnstore Indexes, 3) Update statistics regularly, 4) Use partitioning for large tables, 5) Implement result-set caching, 6) Use materialized views for common aggregations.

Q3: When would you use Serverless vs Dedicated SQL Pool? A: Serverless for ad-hoc exploration, data lake querying, and pay-per-use scenarios. Dedicated for production data warehousing with predictable performance, complex joins, and high-concurrency requirements.