Vertex AI Workbench & Colab Enterprise

Master Vertex AI Workbench, Colab Enterprise, Jupyter notebooks, and data engineering workflows including Spark integration and model deployment.

14 min readIntermediate

Notebook Environments Overview

Google Cloud provides multiple notebook environments for data engineering and ML workflows.

Environment Comparison

🐝 Dataproc Architecture for Data Engineering

Interview Tip: Dataproc is ideal for migrating existing Spark/Hadoop workloads to GCP. Use preemptible VMs for workers (not master) to save up to 91%. Cluster auto-delete and auto-scaling help control costs. Use image versions to control Spark/Hadoop versions.

Vertex AI Workbench Setup

# Create Vertex AI Workbench instance
gcloud notebooks instances create my-workbench \
  --vm-image-project=deeplearning-platform-release \
  --vm-image-family=common-cpu-notebooks \
  --machine-type=n1-standard-4 \
  --location=us-central1-a \
  --disk-size=100GB \
  --disk-type=pd-ssd \
  --subnet=projects/my-project/regions/us-central1/subnetworks/default \
  --no-public-ip

# For GPU workloads
gcloud notebooks instances create gpu-workbench \
  --vm-image-project=deeplearning-platform-release \
  --vm-image-family=gpu-notebooks \
  --machine-type=n1-standard-8 \
  --accelerator-type=NVIDIA_TESLA_T4 \
  --accelerator-count=1 \
  --location=us-central1-a \
  --disk-size=200GB

Data Engineering Workflows

BigQuery Integration

# Vertex AI Workbench notebook cell
from google.cloud import bigquery
import pandas as pd
import matplotlib.pyplot as plt

# Initialize BigQuery client
client = bigquery.Client(project="my-project")

# Query data
query = """
SELECT
    DATE(order_date) as date,
    product_category,
    COUNT(*) as order_count,
    SUM(amount) as revenue
FROM `project.analytics.orders`
WHERE order_date >= '2025-01-01'
GROUP BY 1, 2
ORDER BY 1, 2
"""

df = client.query(query).to_dataframe()

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Revenue by date
df.groupby('date')['revenue'].sum().plot(ax=axes[0])
axes[0].set_title('Daily Revenue')
axes[0].set_ylabel('Revenue ($)')

# Revenue by category
df.groupby('product_category')['revenue'].sum().plot(kind='bar', ax=axes[1])
axes[1].set_title('Revenue by Category')
axes[1].set_ylabel('Revenue ($)')

plt.tight_layout()
plt.show()

Spark Integration

# PySpark in Vertex AI Workbench
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create Spark session
spark = SparkSession.builder \
    .appName("Data Engineering") \
    .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.34.0") \
    .getOrCreate()

# Read from BigQuery
df = spark.read \
    .format("bigquery") \
    .option("table", "project.analytics.events") \
    .option("filter", "event_date >= '2025-01-01'") \
    .load()

# Process data
processed_df = df \
    .withColumn("event_date", to_date(col("event_timestamp"))) \
    .groupBy("event_date", "event_type") \
    .agg(
        count("*").alias("event_count"),
        countDistinct("user_id").alias("unique_users")
    )

# Write back to BigQuery
processed_df.write \
    .format("bigquery") \
    .option("table", "project.analytics.daily_event_summary") \
    .option("temporaryGcsBucket", "my-temp-bucket") \
    .mode("overwrite") \
    .save()

spark.stop()

Dataflow Pipeline Development

# Develop Dataflow pipelines in notebooks
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

def develop_dataflow_pipeline():
    """Develop and test Dataflow pipeline."""
    # Test pipeline locally first
    pipeline_options = PipelineOptions([
        '--runner', 'DirectRunner',  # Local testing
        '--project', 'my-project',
    ])

    with beam.Pipeline(options=pipeline_options) as pipeline:
        (
            pipeline
            | 'Read' >> beam.io.ReadFromText('gs://my-bucket/sample.json')
            | 'Parse' >> beam.Map(lambda x: __import__('json').loads(x))
            | 'Transform' >> beam.Map(lambda x: {
                'id': x['id'],
                'value': x['value'] * 2
            })
            | 'Write' >> beam.io.WriteToText('gs://my-bucket/output/')
        )

    # Deploy to Dataflow for production
    # Change runner to DataflowRunner and deploy

Environment Configuration

Custom Container Images

# Custom Dockerfile for data engineering
FROM gcr.io/deeplearning-platform-release/base-cu113

# Install data engineering packages
RUN pip install \
    google-cloud-bigquery==3.14.0 \
    google-cloud-storage==2.14.0 \
    apache-beam[gcp]==2.52.0 \
    dbt-bigquery==1.7.0 \
    great-expectations==0.18.0 \
    pandas==2.1.0 \
    sqlalchemy==2.0.23

# Install system dependencies
RUN apt-get update && apt-get install -y \
    jq \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /home/jupyter

Startup Script

#!/bin/bash
# Startup script for Vertex AI Workbench

# Install additional packages
pip install google-cloud-bigquery pandas dbt-bigquery

# Configure BigQuery
gcloud config set project my-project
gcloud config set compute/region us-central1

# Mount GCS bucket
mkdir -p /home/jupyter/data-lake
gcsfuse my-data-lake /home/jupyter/data-lake

# Set environment variables
export GOOGLE_CLOUD_PROJECT=my-project
export BIGQUERY_DATASET=analytics

echo "Environment configured successfully"

✨

Best Practice: Use environment variables for configuration instead of hardcoding project IDs. Use GCSFuse to mount buckets for easy file access. Set up automatic shutdown schedules to avoid unnecessary costs. Use Git integration for version control of notebooks.

⚠️ Cost Alert

Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.

Cost Optimization

# Cost optimization strategies for notebooks
cost_strategies = {
    "auto_shutdown": {
        "description": "Automatically stop idle instances",
        "setting": "gcloud notebooks instances update INSTANCE --idle-shutdown-timeout=3600"
    },
    "right_sizing": {
        "description": "Use appropriate machine types",
        "recommendation": "Start with n1-standard-4, upgrade only if needed"
    },
    "preemptible": {
        "description": "Use preemptible VMs for non-critical work",
        "savings": "Up to 70% discount"
    },
    "persistent_disk": {
        "description": "Use persistent disks for data",
        "benefit": "Survives instance deletion, smaller boot disks"
    }
}

# Pricing example
pricing = {
    "n1-standard-4": "$0.19/hr (~$139/month)",
    "n1-standard-8": "$0.38/hr (~$277/month)",
    "preemptible_4": "$0.06/hr (~$44/month)",
    "ssd_100gb": "$17/month"
}

💬

Common Interview Questions

Q1: What is the difference between Vertex AI Workbench and Colab Enterprise?

Answer: Vertex AI Workbench provides managed JupyterLab with full control over infrastructure, including custom machine types and GPU/TPU support. Colab Enterprise is a fully managed service with automatic infrastructure management and built-in collaboration features. Choose Workbench for custom environments, Colab Enterprise for simplicity.

Q2: How do you connect Vertex AI Workbench to BigQuery?

Answer: Use the BigQuery client library in Python notebooks. Authenticate using the instance's service account or Workload Identity Federation. Use %bq magic commands for SQL queries. Mount GCS buckets using GCSFuse for data access.

Q3: What are the benefits of using notebooks for data engineering?

Answer: Notebooks provide interactive development, visualization, and documentation in one environment. They're excellent for data exploration, prototyping, and prototyping. For production, convert notebook code to scripts or DAGs. Use version control for notebook management.

Q4: How do you deploy notebook code to production?

Answer: 1) Extract production code from notebooks, 2) Convert to scripts or Airflow DAGs, 3) Use Cloud Build for CI/CD, 4) Deploy to Dataflow or Dataproc, 5) Schedule with Cloud Composer. Notebooks are for development; production code should be in version-controlled scripts.

Q5: How do you secure notebook environments?

Answer: 1) Use VPC-SC for network isolation, 2) Enable CMEK for data encryption, 3) Use service accounts with minimal permissions, 4) Disable public IPs, 5) Implement IAM controls, 6) Enable audit logging, 7) Use Secret Manager for credentials.

Vertex AI Workbench & Colab Enterprise for Data Engineering