πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Databricks on GCP: Unity Catalog & Clusters

GCP Data EngineeringDatabricks on GCP⭐ Premium

Advertisement

Databricks on GCP

Master Databricks on GCP including Unity Catalog, cluster management, Spark optimization, and integration with GCP services.

18 min readAdvanced

Databricks on GCP Architecture

Dataflow vs Dataproc: When to Use What
Dataflow
Apache Beam (Serverless)
βœ“ Fully managed, no cluster setup
βœ“ Auto-scaling (up and down)
βœ“ Unified stream + batch
βœ“ Exactly-once processing
βœ“ Pay per CPU/GB-second
βœ— Limited customization
βœ— Harder to debug
βœ— Vendor lock-in (Beam)
Use for: New pipelines, streaming, ETL jobs, serverless-first teams
Dataproc
Spark/Hadoop (Managed)
βœ“ Full Spark/Hadoop ecosystem
βœ“ Easy migration from on-prem
βœ“ Custom scripts & libraries
βœ“ Preemptible VMs (91% off)
βœ“ Jupyter/Zeppelin built-in
βœ— Cluster management needed
βœ— Manual scaling
βœ— Idle cluster costs money
Use for: Existing Spark code, ML workloads, lift-and-shift from on-prem Hadoop

Unity Catalog

# Unity Catalog for data governance
# Create catalog
spark.sql("CREATE CATALOG IF NOT EXISTS production")

# Create schema (database)
spark.sql("CREATE SCHEMA IF NOT EXISTS production.analytics")

# Create table with Delta Lake
spark.sql("""
CREATE TABLE IF NOT EXISTS production.analytics.sales (
    order_id STRING,
    customer_id STRING,
    amount DOUBLE,
    order_date DATE
)
USING DELTA
LOCATION 'gs://my-delta-lake/sales/'
""")

# Grant permissions
spark.sql("GRANT SELECT ON CATALOG production TO `data-analysts@company.com`")
spark.sql("GRANT MODIFY ON SCHEMA production.analytics TO `data-engineers@company.com`")

Cluster Configuration

# Cluster configuration via API
cluster_config = {
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "n2-standard-8",
    "num_workers": 4,
    "autoscale": {
        "min_workers": 2,
        "max_workers": 10
    },
    "spark_conf": {
        "spark.sql.adaptive.enabled": "true",
        "spark.sql.adaptive.coalescePartitions.enabled": "true",
        "spark.databricks.delta.optimizeWrite.enabled": "true",
        "spark.databricks.delta.autoCompact.enabled": "true"
    },
    "init_scripts": [
        {
            "dbfs": {
                "destination": "dbfs:/init-scripts/install-packages.sh"
            }
        }
    ],
    "cloud_attributes": {
        "availability": "ON_DEMAND_GCP",
        "zone_id": "us-central1-a"
    }
}

✨

Best Practice: Use Unity Catalog for centralized governance. Enable Delta Lake auto-optimization (optimizeWrite, autoCompact). Use pre-emptible workers for batch jobs. Implement cluster policies for cost control. Use secrets for credential management.

πŸ’¬

Common Interview Questions

Q1: What is Unity Catalog?

Answer: Unity Catalog is Databricks' unified governance solution for data and AI. It provides centralized access control, auditing, lineage, and data discovery across Databricks workspaces. It supports Delta Lake, external tables, and files.

Q2: How does Databricks on GCP differ from Dataproc?

Answer: Databricks provides a managed Spark environment with Delta Lake, Unity Catalog, and collaborative notebooks. Dataproc is Google's managed Spark/Hadoop service with more control over cluster configuration. Databricks excels at ML and collaboration; Dataproc for cost-optimized batch processing.

Q3: What is the benefit of Delta Lake on GCP?

Answer: Delta Lake provides ACID transactions, schema evolution, time travel, and data quality enforcement on Parquet files. It enables reliable data pipelines, incremental processing, and point-in-time recovery.

Q4: How do you optimize Spark jobs on Databricks?

Answer: 1) Enable Adaptive Query Execution, 2) Use Delta Lake caching, 3) Optimize shuffle partitions, 4) Use broadcast joins for small tables, 5) Enable auto-compact and optimize-write, 6) Use pre-emptible workers for batch.

Q5: How do you integrate Databricks with BigQuery?

Answer: Use the BigQuery connector for Databricks to read/write BigQuery tables. Use Databricks for complex Spark transformations and BigQuery for analytics. Consider using Delta Lake as a unified format accessible by both systems.

Advertisement