Databricks Workflows, Jobs & SQL Analytics

Production workflow orchestration, SQL analytics, and automated job scheduling on Azure Databricks

Databricks Workflows Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    DATABRICKS WORKFLOWS ARCHITECTURE                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    DATABRICKS WORKSPACE                      │   │
│  │                                                               │   │
│  │  WORKFLOW: Daily Sales ETL                                   │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │  Task 1: Extract Raw Data                           │    │   │
│  │  │  ┌─────────────────────────────────────────────┐   │    │   │
│  │  │  │ Cluster: Job Cluster (auto-create)           │   │    │   │
│  │  │  │ Notebook: /extract_raw_data                   │   │    │   │
│  │  │  │ Timeout: 3600s                               │   │    │   │
│  │  │  └─────────────────────────────────────────────┘   │    │   │
│  │  │         │                                            │    │   │
│  │  │         ▼ (Success)                                  │    │   │
│  │  │  Task 2: Transform Data                             │    │   │
│  │  │  ┌─────────────────────────────────────────────┐   │    │   │
│  │  │  │ Cluster: Existing Cluster                    │   │    │   │
│  │  │  │ Notebook: /transform_curated                  │   │    │   │
│  │  │  │ Timeout: 7200s                               │   │    │   │
│  │  │  └─────────────────────────────────────────────┘   │    │   │
│  │  │         │                                            │    │   │
│  │  │         ▼ (Success)                                  │    │   │
│  │  │  Task 3: Load to Synapse                            │    │   │
│  │  │  ┌─────────────────────────────────────────────┐   │    │   │
│  │  │  │ Cluster: Job Cluster (auto-create)           │   │    │   │
│  │  │  │ Notebook: /load_to_synapse                    │   │    │   │
│  │  │  │ Timeout: 3600s                               │   │    │   │
│  │  │  └─────────────────────────────────────────────┘   │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  │                                                               │   │
│  │  SQL WAREHOUSES                                               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │  Serverless SQL Warehouse                          │    │   │
│  │  │  ┌─────────────────────────────────────────────┐   │    │   │
│  │  │  │ Auto-scale: 1-16 slots                       │   │    │   │
│  │  │  │ Always-on: Configurable                      │   │    │   │
│  │  │  │ Query execution: Serverless                  │   │    │   │
│  │  │  └─────────────────────────────────────────────┘   │    │   │
│  │  │                                                    │    │   │
│  │  │  Classic SQL Warehouse                            │    │   │
│  │  │  ┌─────────────────────────────────────────────┐   │    │   │
│  │  │  │ Cluster: Shared/Prophet/Enterprise           │   │    │   │
│  │  │  │ Size: 2X-Small to 2X-Large                   │   │    │   │
│  │  │  │ Auto-stop: Configurable                      │   │    │   │
│  │  │  └─────────────────────────────────────────────┘   │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Workflow Configuration

{
  "name": "daily_sales_etl",
  "tasks": [
    {
      "task_key": "extract_raw",
      "description": "Extract raw sales data from API",
      "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "node_type_id": "Standard_D4s_v3",
        "num_workers": 2,
        "autoscale": {
          "min_workers": 1,
          "max_workers": 4
        },
        "spark_conf": {
          "spark.databricks.delta.optimizeWrite.enabled": "true",
          "spark.databricks.delta.autoCompact.enabled": "true"
        },
        "custom_tags": {
          "team": "data-engineering",
          "project": "sales-etl"
        }
      },
      "notebook_task": {
        "notebook_path": "/Repos/data_engineering/extract_raw_data",
        "base_parameters": {
          "date": "{{job.parameters.process_date}}",
          "source": "{{job.parameters.source}}"
        }
      },
      "timeout_seconds": 3600,
      "retry_on_failure": {
        "max_retries": 3,
        "retry_interval_seconds": 300
      },
      " libraries": [
        {
          "pypi": {
            "package": "requests==2.31.0"
          }
        }
      ]
    },
    {
      "task_key": "transform_curated",
      "description": "Transform raw data to curated format",
      "depends_on": [
        {
          "task_key": "extract_raw",
          "dependency_types": ["SUCCESS"]
        }
      ],
      "existing_cluster_id": "1234-567890-abcde",
      "notebook_task": {
        "notebook_path": "/Repos/data_engineering/transform_curated"
      },
      "timeout_seconds": 7200
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 2 * * ?",
    "timezone_id": "America/New_York"
  },
  "max_concurrent_runs": 1,
  "email_notifications": {
    "on_failure": ["data-team@company.com"],
    "on_start": ["data-team@company.com"]
  }
}

SQL Warehouse Configuration

# Databricks SQL Warehouse operations
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import sql

w = WorkspaceClient()

# Create SQL Warehouse
warehouse = w.sql_warehouses.create(
    name="Analytics Warehouse",
    cluster_size="2X-Small",
    max_num_clusters=4,
    min_num_clusters=1,
    auto_stop_mins=10,
    enable_serverless_compute=True,
    channel="CHANNEL_10_4"
)

print(f"Warehouse ID: {warehouse.id}")

# Get warehouse status
status = w.sql_warehouses.get(warehouse.id)
print(f"State: {status.state}")
print(f"Health: {status.health}")

# Execute SQL query
result = w.statement_execution.execute_statement(
    warehouse_id=warehouse.id,
    statement="SELECT * FROM sales.fact_sales WHERE sale_date >= '2024-01-01' LIMIT 100"
)

for row in result.result.data_array:
    print(row)

Parameterized SQL Queries

-- Databricks SQL with parameters
SELECT
    sale_date,
    product_category,
    SUM(total_amount) AS revenue,
    COUNT(*) AS transactions,
    AVG(total_amount) AS avg_order_value
FROM sales.fact_sales
WHERE sale_date BETWEEN :start_date AND :end_date
  AND product_category = :category
GROUP BY sale_date, product_category
ORDER BY sale_date;

-- Dashboard query with filters
WITH daily_metrics AS (
    SELECT
        sale_date,
        COUNT(DISTINCT customer_key) AS unique_customers,
        SUM(total_amount) AS revenue,
        SUM(quantity) AS units_sold
    FROM sales.fact_sales
    WHERE sale_date >= DATE_SUB(CURRENT_DATE(), 30)
    GROUP BY sale_date
)
SELECT
    sale_date,
    revenue,
    unique_customers,
    units_sold,
    LAG(revenue) OVER (ORDER BY sale_date) AS prev_day_revenue,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY sale_date)) / 
          LAG(revenue) OVER (ORDER BY sale_date) * 100, 2) AS growth_pct
FROM daily_metrics
ORDER BY sale_date DESC;

ℹ️

Pro Tip: Use Serverless SQL Warehouses for ad-hoc queries and dashboards—they start in seconds and scale automatically. Use Classic Warehouses for ETL workloads requiring persistent connections.

Interview Questions

Q1: When would you use Databricks Workflows vs Azure Data Factory? A: Databricks Workflows for Spark-based ETL within Databricks ecosystem. ADF for orchestrating across multiple services (SQL, ADLS, Functions). Use ADF when Databricks is one step in a larger pipeline; use Workflows when all steps are Databricks notebooks.

Q2: How do you optimize SQL warehouse performance in Databricks? A: 1) Use appropriate warehouse size, 2) Enable serverless compute, 3) Optimize table statistics and Z-ordering, 4) Use materialized views for frequent queries, 5) Configure result caching, 6) Use parameterized queries to avoid recompilation.

Q3: Explain the concept of a "job cluster" vs "interactive cluster" in Databricks. A: Job clusters are ephemeral, created for specific jobs and destroyed after completion. Interactive clusters persist for development/exploration. Job clusters are cost-efficient for scheduled workloads; interactive clusters for ad-hoc analysis.