Google Cloud Platform Overview for Data Engineering
GCP Project Structure
A GCP project is the top-level container for all GCP resources. Each project has a unique project ID and number, and contains billing, IAM, and resource configurations.
Project Hierarchy
Creating a Project
# Create a new project
gcloud projects create my-data-engineering-project --name="Data Engineering Project"
# Set the project as default
gcloud config set project my-data-engineering-project
# Enable billing
gcloud billing accounts list
gcloud billing projects link my-data-engineering-project --billing-account=ACCOUNT_ID
Project Best Practices
- Use descriptive project names with department prefixes (e.g.,
analytics-data-lake) - Separate projects by environment (dev, staging, prod)
- Use labels for cost allocation and resource management
- Enable budget alerts for cost monitoring
Billing Management
GCP billing is based on pay-as-you-go pricing with no upfront costs. Understanding billing is crucial for data engineering cost optimization.
Billing Account Structure
Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.
Cost Optimization Strategies
# Check current billing status
gcloud billing accounts describe ACCOUNT_ID
# Set up budget alerts
gcloud billing budgets create \
--billing-account=ACCOUNT_ID \
--display-name="Data Engineering Budget" \
--budget-amount=1000 \
--threshold-rule=percent=50 \
--threshold-rule=percent=75 \
--threshold-rule=percent=90
Resource Pricing Models
| Resource | Pricing Model | Best Practice |
|---|---|---|
| Compute Engine | Per-second billing | Use preemptible VMs for batch jobs |
| Cloud Storage | Per-GB per-month | Use lifecycle rules for cost optimization |
| BigQuery | Per-TB scanned | Partition tables, use dry runs |
| Dataflow | Per-vCPU per-hour | Use autoscaling, monitor utilization |
| Dataproc | Per-minute per-node | Use preemptible workers, auto-scaling |
Identity and Access Management (IAM)
IAM controls who can access which resources. It's essential for securing data pipelines and lakes.
IAM Roles Hierarchy
Service Accounts
Service accounts are used by applications and compute resources to access GCP services.
# Create a service account for data pipelines
gcloud iam service-accounts create data-pipeline-sa \
--display-name="Data Pipeline Service Account"
# Grant roles to the service account
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:data-pipeline-sa@my-project.iam.gserviceaccount.com" \
--role="roles/bigquery.dataEditor"
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:data-pipeline-sa@my-project.iam.gserviceaccount.com" \
--role="roles/storage.objectAdmin"
# Generate a key for the service account
gcloud iam service-accounts keys create key.json \
--iam-account=data-pipeline-sa@my-project.iam.gserviceaccount.com
Best Practices
- Use service accounts for applications, not user accounts
- Follow principle of least privilege
- Rotate keys regularly
- Use Workload Identity for GKE workloads
Regions and Zones
GCP resources are deployed in regions and zones. Understanding this is crucial for data residency, latency, and availability.
Global Architecture
Region Selection Criteria
# List available regions
gcloud compute regions list
# List zones in a region
gcloud compute zones list --filter="region:us-central1"
# Create a regional resource
gcloud compute instances create my-instance \
--zone=us-central1-a \
--machine-type=e2-medium \
--image-family=debian-11 \
--image-project=debian-cloud
Data Residency Considerations
| Service | Multi-Region Options | Single Region Options |
|---|---|---|
| BigQuery | US, EU, ASIA | us-central1, europe-west1, etc. |
| Cloud Storage | NA, EU, ASIA | us-central1, europe-west1, etc. |
| Cloud SQL | Regional only | us-central1, europe-west1, etc. |
| Firestore | nam5, eur3, asia-southeast1 | Regional options available |
Core Data Services Overview
BigQuery
Serverless, highly scalable data warehouse for analytics.
-- Example: Creating a dataset and table
CREATE SCHEMA my_dataset
OPTIONS (
location = 'US',
description = 'Analytics dataset'
);
CREATE TABLE my_dataset.sales (
sale_id INT64,
product_id STRING,
quantity INT64,
amount FLOAT64,
sale_date DATE
)
PARTITION BY sale_date
CLUSTER BY product_id;
Cloud Storage (GCS)
Object storage for data lake foundation.
# Create a bucket with lifecycle rules
gsutil mb -l us-central1 -c STANDARD gs://my-data-lake-bucket
# Set lifecycle rule for cost optimization
cat > lifecycle.json << EOF
{
"rule": [
{
"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
"condition": {"age": 30}
},
{
"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
"condition": {"age": 90}
},
{
"action": {"type": "Delete"},
"condition": {"age": 365}
}
]
}
EOF
gsutil lifecycle set lifecycle.json gs://my-data-lake-bucket
Dataflow
Managed service for stream and batch data processing.
# Example: Simple Dataflow pipeline
import apache_beam as beam
with beam.Pipeline() as pipeline:
(
pipeline
| 'Read from GCS' >> beam.io.ReadFromText('gs://bucket/input/*.csv')
| 'Parse CSV' >> beam.Map(lambda line: line.split(','))
| 'Filter valid records' >> beam.Filter(lambda row: len(row) == 5)
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
'project:dataset.table',
schema='id:STRING,name:STRING,value:INTEGER',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
Pub/Sub
Real-time messaging service for event-driven architectures.
# Example: Publishing messages to Pub/Sub
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('my-project', 'my-topic')
data = b'{"event": "purchase", "amount": 99.99}'
future = publisher.publish(topic_path, data)
print(f'Published message ID: {future.result()}')
Data Engineering Architecture Patterns
Batch Processing Pattern
Data Sources β GCS (Raw) β Dataflow β GCS (Processed) β BigQuery β Looker
Real-time Processing Pattern
Data Sources β Pub/Sub β Dataflow Streaming β BigQuery β Dashboards
Data Lake Pattern
Raw Data β GCS (Bronze) β GCS (Silver) β GCS (Gold) β BigQuery β Analytics
Cost Monitoring and Optimization
# Monitor BigQuery costs
bq show --format=prettyjson my-project:my_dataset.my_table | jq '.creationTime'
# Check Dataflow job costs
gcloud dataflow jobs list --region=us-central1
# Set up cost alerts
gcloud billing budgets create \
--billing-account=ACCOUNT_ID \
--display-name="Data Engineering Alert" \
--budget-amount=500 \
--threshold-rule=percent=80
Next Steps
- Set up a GCP project with billing
- Configure IAM and service accounts
- Create a data lake in Cloud Storage
- Set up BigQuery for analytics
- Build your first Dataflow pipeline
- Implement monitoring and cost controls