Google Cloud Platform Overview for Data Engineering

GCP Project Structure

A GCP project is the top-level container for all GCP resources. Each project has a unique project ID and number, and contains billing, IAM, and resource configurations.

Project Hierarchy

🔐 GCP IAM Hierarchy & Roles

Interview Tip: Follow the principle of least privilege — grant only the permissions needed. Use predefined roles over basic roles. Service accounts should have minimal permissions and use Workload Identity in GKE instead of key-based auth.

Creating a Project

# Create a new project
gcloud projects create my-data-engineering-project --name="Data Engineering Project"

# Set the project as default
gcloud config set project my-data-engineering-project

# Enable billing
gcloud billing accounts list
gcloud billing projects link my-data-engineering-project --billing-account=ACCOUNT_ID

Project Best Practices

Use descriptive project names with department prefixes (e.g., analytics-data-lake)
Separate projects by environment (dev, staging, prod)
Use labels for cost allocation and resource management
Enable budget alerts for cost monitoring

Billing Management

GCP billing is based on pay-as-you-go pricing with no upfront costs. Understanding billing is crucial for data engineering cost optimization.

Billing Account Structure

GCP Pricing Models for Data Engineering

💳

On-Demand

Pay per use, no commitment

Dev/Test

📋

Committed (1yr)

Up to 37%

1-year commitment

Steady production

📝

Committed (3yr)

Up to 55%

3-year commitment

Long-term infra

⚡

Preemptible/Spot

Up to 91%

Short-lived VMs

Batch processing

💰

Sustained Use

Up to 30%

Auto discounts for long use

Always-on

🔥

Serverless

N/A

Pay per query/invocation

Event-driven

⚠️ Cost Alert

Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.

Cost Optimization Strategies

# Check current billing status
gcloud billing accounts describe ACCOUNT_ID

# Set up budget alerts
gcloud billing budgets create \
  --billing-account=ACCOUNT_ID \
  --display-name="Data Engineering Budget" \
  --budget-amount=1000 \
  --threshold-rule=percent=50 \
  --threshold-rule=percent=75 \
  --threshold-rule=percent=90

Resource Pricing Models

Resource	Pricing Model	Best Practice
Compute Engine	Per-second billing	Use preemptible VMs for batch jobs
Cloud Storage	Per-GB per-month	Use lifecycle rules for cost optimization
BigQuery	Per-TB scanned	Partition tables, use dry runs
Dataflow	Per-vCPU per-hour	Use autoscaling, monitor utilization
Dataproc	Per-minute per-node	Use preemptible workers, auto-scaling

Identity and Access Management (IAM)

IAM controls who can access which resources. It's essential for securing data pipelines and lakes.

IAM Roles Hierarchy

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Service Accounts

Service accounts are used by applications and compute resources to access GCP services.

# Create a service account for data pipelines
gcloud iam service-accounts create data-pipeline-sa \
    --display-name="Data Pipeline Service Account"

# Grant roles to the service account
gcloud projects add-iam-policy-binding my-project \
    --member="serviceAccount:data-pipeline-sa@my-project.iam.gserviceaccount.com" \
    --role="roles/bigquery.dataEditor"

gcloud projects add-iam-policy-binding my-project \
    --member="serviceAccount:data-pipeline-sa@my-project.iam.gserviceaccount.com" \
    --role="roles/storage.objectAdmin"

# Generate a key for the service account
gcloud iam service-accounts keys create key.json \
    --iam-account=data-pipeline-sa@my-project.iam.gserviceaccount.com

Best Practices

Use service accounts for applications, not user accounts
Follow principle of least privilege
Rotate keys regularly
Use Workload Identity for GKE workloads

Regions and Zones

GCP resources are deployed in regions and zones. Understanding this is crucial for data residency, latency, and availability.

Global Architecture

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Region Selection Criteria

# List available regions
gcloud compute regions list

# List zones in a region
gcloud compute zones list --filter="region:us-central1"

# Create a regional resource
gcloud compute instances create my-instance \
    --zone=us-central1-a \
    --machine-type=e2-medium \
    --image-family=debian-11 \
    --image-project=debian-cloud

Data Residency Considerations

Service	Multi-Region Options	Single Region Options
BigQuery	US, EU, ASIA	us-central1, europe-west1, etc.
Cloud Storage	NA, EU, ASIA	us-central1, europe-west1, etc.
Cloud SQL	Regional only	us-central1, europe-west1, etc.
Firestore	nam5, eur3, asia-southeast1	Regional options available

Core Data Services Overview

BigQuery

Serverless, highly scalable data warehouse for analytics.

-- Example: Creating a dataset and table
CREATE SCHEMA my_dataset
OPTIONS (
  location = 'US',
  description = 'Analytics dataset'
);

CREATE TABLE my_dataset.sales (
  sale_id INT64,
  product_id STRING,
  quantity INT64,
  amount FLOAT64,
  sale_date DATE
)
PARTITION BY sale_date
CLUSTER BY product_id;

Cloud Storage (GCS)

Object storage for data lake foundation.

# Create a bucket with lifecycle rules
gsutil mb -l us-central1 -c STANDARD gs://my-data-lake-bucket

# Set lifecycle rule for cost optimization
cat > lifecycle.json << EOF
{
  "rule": [
    {
      "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
      "condition": {"age": 30}
    },
    {
      "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
      "condition": {"age": 90}
    },
    {
      "action": {"type": "Delete"},
      "condition": {"age": 365}
    }
  ]
}
EOF

gsutil lifecycle set lifecycle.json gs://my-data-lake-bucket

Dataflow

Managed service for stream and batch data processing.

# Example: Simple Dataflow pipeline
import apache_beam as beam

with beam.Pipeline() as pipeline:
    (
        pipeline
        | 'Read from GCS' >> beam.io.ReadFromText('gs://bucket/input/*.csv')
        | 'Parse CSV' >> beam.Map(lambda line: line.split(','))
        | 'Filter valid records' >> beam.Filter(lambda row: len(row) == 5)
        | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
            'project:dataset.table',
            schema='id:STRING,name:STRING,value:INTEGER',
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
        )
    )

Pub/Sub

Real-time messaging service for event-driven architectures.

# Example: Publishing messages to Pub/Sub
from google.cloud import pubsub_v1

publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('my-project', 'my-topic')

data = b'{"event": "purchase", "amount": 99.99}'
future = publisher.publish(topic_path, data)
print(f'Published message ID: {future.result()}')

Data Engineering Architecture Patterns

Batch Processing Pattern

Architecture Diagram

Data Sources → GCS (Raw) → Dataflow → GCS (Processed) → BigQuery → Looker

Real-time Processing Pattern

Architecture Diagram

Data Sources → Pub/Sub → Dataflow Streaming → BigQuery → Dashboards

Data Lake Pattern

Architecture Diagram

Raw Data → GCS (Bronze) → GCS (Silver) → GCS (Gold) → BigQuery → Analytics

Cost Monitoring and Optimization

# Monitor BigQuery costs
bq show --format=prettyjson my-project:my_dataset.my_table | jq '.creationTime'

# Check Dataflow job costs
gcloud dataflow jobs list --region=us-central1

# Set up cost alerts
gcloud billing budgets create \
  --billing-account=ACCOUNT_ID \
  --display-name="Data Engineering Alert" \
  --budget-amount=500 \
  --threshold-rule=percent=80

Next Steps

Set up a GCP project with billing
Configure IAM and service accounts
Create a data lake in Cloud Storage
Set up BigQuery for analytics
Build your first Dataflow pipeline
Implement monitoring and cost controls

Google Cloud Platform Overview for Data Engineering

Google Cloud Platform Overview for Data Engineering

GCP Project Structure

Project Hierarchy

Creating a Project

Project Best Practices

Billing Management

Billing Account Structure

Cost Optimization Strategies

Resource Pricing Models

Identity and Access Management (IAM)

IAM Roles Hierarchy

Service Accounts

Best Practices

Regions and Zones

Global Architecture

Region Selection Criteria

Data Residency Considerations

Core Data Services Overview

BigQuery

Cloud Storage (GCS)

Dataflow

Pub/Sub

Data Engineering Architecture Patterns

Batch Processing Pattern

Real-time Processing Pattern

Data Lake Pattern

Cost Monitoring and Optimization

Next Steps

Premium Content

Need Expert GCP Data Engineering Help?