Synapse Performance: Distribution, Indexing & Caching

Optimize Synapse Dedicated SQL Pool performance with distribution, indexing, and caching strategies

Performance Optimization Layers

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    SYNAPSE PERFORMANCE OPTIMIZATION                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  LAYER 1: DISTRIBUTION                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ HASH: Distribute by join/group-by key (even distribution)    │   │
│  │ ROUND ROBIN: Random distribution (fast load, no skew)        │   │
│  │ REPLICATE: Full copy on each node (small tables <2GB)        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  LAYER 2: INDEXING                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ CLUSTERED COLUMNSTORE INDEX: Default, best for analytics     │   │
│  │ CLUSTERED INDEX: B-tree, best for point lookups              │   │
│  │ NONCLUSTERED INDEX: Secondary lookup paths                    │   │
│  │ NONCLUSTERED COLUMNSTORE INDEX: Columnar for secondary       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  LAYER 3: STATISTICS & CACHING                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ AUTO CREATE STATISTICS: Enabled by default                   │   │
│  │ MANUALLY CREATED STATISTICS: For complex queries             │   │
│  │ RESULT SET CACHING: Cache query results for reuse            │   │
│  │ MATERIALIZED VIEWS: Pre-computed aggregations                │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  LAYER 4: QUERY TUNING                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ USE CTAS for data loading (not INSERT INTO)                  │   │
│  │ USE SET statements for session-level optimization            │   │
│  │ AVOID cross-database queries                                 │   │
│  │ USE partition switching for large deletes                    │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Distribution Analysis

-- Check distribution skew
DBCC PDW_SHOWSPACEUSED('dbo.FactSales');

-- Output:
-- index_name | distribution_id | row_count | size_in_bytes
-- -----------+-----------------+-----------+-------------
-- CCI_1      | 0               | 1000000   | 50000000
-- CCI_1      | 1               | 1000000   | 50000000
-- CCI_1      | 2               | 1000001   | 50000010

-- Check statistics
sys.dm_pdw_exec_stats
WHERE command LIKE '%SELECT%'

-- Monitor query performance
SELECT
    query_id,
    sql_text,
    start_time,
    end_time,
    total_elapsed_time / 1000 AS duration_seconds,
    row_count,
    spilled_thread_count
FROM sys.dm_pdw_exec_requests
WHERE status = 'Completed'
ORDER BY total_elapsed_time DESC;

Index Optimization

-- Create CCI for fact table
CREATE CLUSTERED COLUMNSTORE INDEX CCI_FactSales
ON dbo.FactSales;

-- Create CCI for fact table
CREATE CLUSTERED COLUMNSTORE INDEX CCI_FactSales
ON dbo.FactSales;

-- Create nonclustered index for lookup queries
CREATE NONCLUSTERED INDEX IX_FactSales_CustomerKey
ON dbo.FactSales(customer_key)
INCLUDE (sale_date, total_amount);

-- Create materialized view for common aggregations
CREATE MATERIALIZED VIEW MV_DailySalesSummary
WITH (DISTRIBUTION = HASH(sale_date))
AS
SELECT
    sale_date,
    COUNT(*) AS transaction_count,
    SUM(total_amount) AS daily_revenue
FROM dbo.FactSales
GROUP BY sale_date;

SET Statements for Optimization

-- Enable result set caching
SET RESULT_SET_CACHING ON;

-- Enable statistics profiling
SET STATISTICS PROFILE ON;

-- Adjust DOP (Degree of Parallelism)
SET statement_dop = 4;

-- Enable bulk insert optimization
SET BULK_INSERT_BATCH_SIZE = 100000;

ℹ️

Pro Tip: Use EXPLAIN to analyze query plans. Look for distributed moves (shuffles) and partition scans as performance bottlenecks. Optimize distribution keys to minimize data movement.

Interview Questions

Q1: How do you identify and fix data skew in Synapse? A: Use DBCC PDW_SHOWSPACEUSED to check distribution sizes. If skew exists, change distribution key to a more evenly distributed column, or use Round Robin for staging tables.

Q2: What is the difference between CCI and clustered index? A: CCI (Columnstore) is optimal for analytics (scans, aggregations). Clustered (B-tree) is optimal for point lookups and range scans. Use CCI for fact tables; clustered for dimension tables.

Q3: How does result set caching improve performance? A: Result set caching stores query results in memory for 5 minutes (configurable). Subsequent identical queries return cached results instantly, reducing compute usage and query time.