Databricks on GCP Architecture
Unity Catalog
# Unity Catalog for data governance
# Create catalog
spark.sql("CREATE CATALOG IF NOT EXISTS production")
# Create schema (database)
spark.sql("CREATE SCHEMA IF NOT EXISTS production.analytics")
# Create table with Delta Lake
spark.sql("""
CREATE TABLE IF NOT EXISTS production.analytics.sales (
order_id STRING,
customer_id STRING,
amount DOUBLE,
order_date DATE
)
USING DELTA
LOCATION 'gs://my-delta-lake/sales/'
""")
# Grant permissions
spark.sql("GRANT SELECT ON CATALOG production TO `data-analysts@company.com`")
spark.sql("GRANT MODIFY ON SCHEMA production.analytics TO `data-engineers@company.com`")
Cluster Configuration
# Cluster configuration via API
cluster_config = {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "n2-standard-8",
"num_workers": 4,
"autoscale": {
"min_workers": 2,
"max_workers": 10
},
"spark_conf": {
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true"
},
"init_scripts": [
{
"dbfs": {
"destination": "dbfs:/init-scripts/install-packages.sh"
}
}
],
"cloud_attributes": {
"availability": "ON_DEMAND_GCP",
"zone_id": "us-central1-a"
}
}
β¨
Best Practice: Use Unity Catalog for centralized governance. Enable Delta Lake auto-optimization (optimizeWrite, autoCompact). Use pre-emptible workers for batch jobs. Implement cluster policies for cost control. Use secrets for credential management.
Common Interview Questions
Q1: What is Unity Catalog?
Answer: Unity Catalog is Databricks' unified governance solution for data and AI. It provides centralized access control, auditing, lineage, and data discovery across Databricks workspaces. It supports Delta Lake, external tables, and files.
Q2: How does Databricks on GCP differ from Dataproc?
Answer: Databricks provides a managed Spark environment with Delta Lake, Unity Catalog, and collaborative notebooks. Dataproc is Google's managed Spark/Hadoop service with more control over cluster configuration. Databricks excels at ML and collaboration; Dataproc for cost-optimized batch processing.
Q3: What is the benefit of Delta Lake on GCP?
Answer: Delta Lake provides ACID transactions, schema evolution, time travel, and data quality enforcement on Parquet files. It enables reliable data pipelines, incremental processing, and point-in-time recovery.
Q4: How do you optimize Spark jobs on Databricks?
Answer: 1) Enable Adaptive Query Execution, 2) Use Delta Lake caching, 3) Optimize shuffle partitions, 4) Use broadcast joins for small tables, 5) Enable auto-compact and optimize-write, 6) Use pre-emptible workers for batch.
Q5: How do you integrate Databricks with BigQuery?
Answer: Use the BigQuery connector for Databricks to read/write BigQuery tables. Use Databricks for complex Spark transformations and BigQuery for analytics. Consider using Delta Lake as a unified format accessible by both systems.