Databricks Workflows, Jobs & SQL Analytics
Production workflow orchestration, SQL analytics, and automated job scheduling on Azure Databricks
Databricks Workflows Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATABRICKS WORKFLOWS ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATABRICKS WORKSPACE β β
β β β β
β β WORKFLOW: Daily Sales ETL β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Task 1: Extract Raw Data β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Cluster: Job Cluster (auto-create) β β β β
β β β β Notebook: /extract_raw_data β β β β
β β β β Timeout: 3600s β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β β
β β β βΌ (Success) β β β
β β β Task 2: Transform Data β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Cluster: Existing Cluster β β β β
β β β β Notebook: /transform_curated β β β β
β β β β Timeout: 7200s β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β β
β β β βΌ (Success) β β β
β β β Task 3: Load to Synapse β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Cluster: Job Cluster (auto-create) β β β β
β β β β Notebook: /load_to_synapse β β β β
β β β β Timeout: 3600s β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β SQL WAREHOUSES β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Serverless SQL Warehouse β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Auto-scale: 1-16 slots β β β β
β β β β Always-on: Configurable β β β β
β β β β Query execution: Serverless β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β
β β β Classic SQL Warehouse β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Cluster: Shared/Prophet/Enterprise β β β β
β β β β Size: 2X-Small to 2X-Large β β β β
β β β β Auto-stop: Configurable β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Workflow Configuration
{
"name": "daily_sales_etl",
"tasks": [
{
"task_key": "extract_raw",
"description": "Extract raw sales data from API",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_D4s_v3",
"num_workers": 2,
"autoscale": {
"min_workers": 1,
"max_workers": 4
},
"spark_conf": {
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true"
},
"custom_tags": {
"team": "data-engineering",
"project": "sales-etl"
}
},
"notebook_task": {
"notebook_path": "/Repos/data_engineering/extract_raw_data",
"base_parameters": {
"date": "{{job.parameters.process_date}}",
"source": "{{job.parameters.source}}"
}
},
"timeout_seconds": 3600,
"retry_on_failure": {
"max_retries": 3,
"retry_interval_seconds": 300
},
" libraries": [
{
"pypi": {
"package": "requests==2.31.0"
}
}
]
},
{
"task_key": "transform_curated",
"description": "Transform raw data to curated format",
"depends_on": [
{
"task_key": "extract_raw",
"dependency_types": ["SUCCESS"]
}
],
"existing_cluster_id": "1234-567890-abcde",
"notebook_task": {
"notebook_path": "/Repos/data_engineering/transform_curated"
},
"timeout_seconds": 7200
}
],
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?",
"timezone_id": "America/New_York"
},
"max_concurrent_runs": 1,
"email_notifications": {
"on_failure": ["data-team@company.com"],
"on_start": ["data-team@company.com"]
}
}
SQL Warehouse Configuration
# Databricks SQL Warehouse operations
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import sql
w = WorkspaceClient()
# Create SQL Warehouse
warehouse = w.sql_warehouses.create(
name="Analytics Warehouse",
cluster_size="2X-Small",
max_num_clusters=4,
min_num_clusters=1,
auto_stop_mins=10,
enable_serverless_compute=True,
channel="CHANNEL_10_4"
)
print(f"Warehouse ID: {warehouse.id}")
# Get warehouse status
status = w.sql_warehouses.get(warehouse.id)
print(f"State: {status.state}")
print(f"Health: {status.health}")
# Execute SQL query
result = w.statement_execution.execute_statement(
warehouse_id=warehouse.id,
statement="SELECT * FROM sales.fact_sales WHERE sale_date >= '2024-01-01' LIMIT 100"
)
for row in result.result.data_array:
print(row)
Parameterized SQL Queries
-- Databricks SQL with parameters
SELECT
sale_date,
product_category,
SUM(total_amount) AS revenue,
COUNT(*) AS transactions,
AVG(total_amount) AS avg_order_value
FROM sales.fact_sales
WHERE sale_date BETWEEN :start_date AND :end_date
AND product_category = :category
GROUP BY sale_date, product_category
ORDER BY sale_date;
-- Dashboard query with filters
WITH daily_metrics AS (
SELECT
sale_date,
COUNT(DISTINCT customer_key) AS unique_customers,
SUM(total_amount) AS revenue,
SUM(quantity) AS units_sold
FROM sales.fact_sales
WHERE sale_date >= DATE_SUB(CURRENT_DATE(), 30)
GROUP BY sale_date
)
SELECT
sale_date,
revenue,
unique_customers,
units_sold,
LAG(revenue) OVER (ORDER BY sale_date) AS prev_day_revenue,
ROUND((revenue - LAG(revenue) OVER (ORDER BY sale_date)) /
LAG(revenue) OVER (ORDER BY sale_date) * 100, 2) AS growth_pct
FROM daily_metrics
ORDER BY sale_date DESC;
βΉοΈ
Pro Tip: Use Serverless SQL Warehouses for ad-hoc queries and dashboardsβthey start in seconds and scale automatically. Use Classic Warehouses for ETL workloads requiring persistent connections.
Interview Questions
Q1: When would you use Databricks Workflows vs Azure Data Factory? A: Databricks Workflows for Spark-based ETL within Databricks ecosystem. ADF for orchestrating across multiple services (SQL, ADLS, Functions). Use ADF when Databricks is one step in a larger pipeline; use Workflows when all steps are Databricks notebooks.
Q2: How do you optimize SQL warehouse performance in Databricks? A: 1) Use appropriate warehouse size, 2) Enable serverless compute, 3) Optimize table statistics and Z-ordering, 4) Use materialized views for frequent queries, 5) Configure result caching, 6) Use parameterized queries to avoid recompilation.
Q3: Explain the concept of a "job cluster" vs "interactive cluster" in Databricks. A: Job clusters are ephemeral, created for specific jobs and destroyed after completion. Interactive clusters persist for development/exploration. Job clusters are cost-efficient for scheduled workloads; interactive clusters for ad-hoc analysis.