VPC Architecture for Data Engineering
A Virtual Private Cloud (VPC) is the fundamental networking construct in GCP. For data engineers, proper VPC design ensures secure, low-latency communication between data services while controlling costs.
VPC Components
VPC Modes: Auto vs. Custom
Auto Mode VPC
# Auto mode creates subnets automatically in all regions
gcloud compute networks create auto-vpc \
--subnet-mode=auto \
--bgp-routing-mode=regional
# Auto mode creates /20 subnets in each region
# 10.128.0.0/20 for us-central1
# 10.129.0.0/20 for us-east1
# etc.
Custom Mode VPC (Recommended for Data Engineering)
# Custom mode gives you full control over subnet CIDR ranges
gcloud compute networks create data-engineering-vpc \
--subnet-mode=custom \
--bgp-routing-mode=global
# Create subnets for different data workloads
gcloud compute networks subnets create data-pipeline-subnet \
--network=data-engineering-vpc \
--region=us-central1 \
--range=10.0.1.0/24 \
--enable-private-ip-google-access
gcloud compute networks subnets create analytics-subnet \
--network=data-engineering-vpc \
--region=us-central1 \
--range=10.0.2.0/24 \
--enable-private-ip-google-access
gcloud compute networks subnets create management-subnet \
--network=data-engineering-vpc \
--region=us-central1 \
--range=10.0.3.0/24
β¨
Best Practice: Always use Custom mode VPC for data engineering. It provides predictable IP addressing, allows you to create subnets only where needed, and prevents IP range conflicts when peering with other VPCs.
Private Google Access
Private Google Access allows VMs without external IP addresses to reach Google APIs and services (BigQuery, GCS, Pub/Sub, etc.) over Google's internal network.
Configuration
# Enable Private Google Access on a subnet
gcloud compute networks subnets update data-pipeline-subnet \
--region=us-central1 \
--enable-private-ip-google-access
# Verify Private Google Access is enabled
gcloud compute networks subnets describe data-pipeline-subnet \
--region=us-central1 \
--format="value(privateIpGoogleAccess)"
Private Service Connect (Modern Alternative)
Private Service Connect provides more granular control over private access to Google APIs.
# Create Private Service Connect endpoint for BigQuery
gcloud compute addresses create bigquery-endpoint \
--global \
--ip-version=IPV4 \
--network=data-engineering-vpc \
--purpose=PRIVATE_SERVICE_CONNECT \
--addresses=10.0.100.1
# Create forwarding rule
gcloud compute forwarding-rules create bigquery-forwarding-rule \
--global \
--target-address=10.0.100.1 \
--target-attached-resource=bigquery.googleapis.com \
--network=data-engineering-vpc \
--ip-protocol=TCP \
--ports=443
VPC Peering for Data Engineering
VPC Peering connects two VPC networks, allowing private communication across projects.
Setting Up VPC Peering
# Create peering from Project A to Project B
gcloud compute networks peerings create peer-a-to-b \
--network=data-platform-vpc \
--peer-project=project-b-analytics \
--peer-network=analytics-vpc \
--auto-create-routes
# Create peering from Project B to Project A (required for bidirectional)
gcloud compute networks peerings create peer-b-to-a \
--network=analytics-vpc \
--peer-project=project-a-data-platform \
--peer-network=data-platform-vpc \
--auto-create-routes
β οΈ
Warning: VPC Peering is non-transitive. If Project A peers with B, and B peers with C, A cannot reach C. For this scenario, consider Network Connectivity Center or implement full mesh peering.
Shared VPC for Data Engineering
Shared VPC allows an organization to connect resources from multiple projects to a common VPC network, centrally managed by a host project.
# Enable Shared VPC in host project
gcloud shared-vpc enable PROJECT_ID_HOST
# Associate service project
gcloud shared-vpc associated-projects add PROJECT_ID_SERVICE \
--host-project=PROJECT_ID_HOST
Cloud NAT for Data Engineering
Cloud NAT provides internet access to private VMs without exposing them to incoming traffic. Essential for Dataproc/VMs that need to download packages or access external APIs.
# Create Cloud Router
gcloud compute routers create data-engineering-router \
--network=data-engineering-vpc \
--region=us-central1
# Create Cloud NAT
gcloud compute routers nats create data-engineering-nat \
--router=data-engineering-router \
--region=us-central1 \
--auto-allocate-nat-external-ips \
--nat-all-subnet-ip-ranges \
--log-nat-rules \
--log-metadata=INCLUDE_ALL_METADATA
VPC Firewall Rules for Data Engineering
Firewall rules control traffic to and from VM instances. For data engineering, you need rules that allow pipeline communication while blocking unauthorized access.
# Allow internal communication between data pipeline instances
gcloud compute firewall-rules create allow-internal-data-pipeline \
--network=data-engineering-vpc \
--allow=tcp,udp,icmp \
--source-ranges=10.0.1.0/24,10.0.2.0/24 \
--description="Allow internal communication between data pipeline subnets"
# Allow SSH only from specific management subnet
gcloud compute firewall-rules create allow-ssh-management \
--network=data-engineering-vpc \
--allow=tcp:22 \
--source-ranges=10.0.3.0/24 \
--target-tags=dataproc-master,dataproc-worker \
--description="Allow SSH only from management subnet"
# Deny all other ingress traffic
gcloud compute firewall-rules create deny-all-ingress \
--network=data-engineering-vpc \
--action=DENY \
--direction=INGRESS \
--rules=all \
--source-ranges=0.0.0.0/0 \
--description="Deny all other ingress traffic"
β¨
Best Practice: Implement firewall rules in a layered approach: 1) Allow internal communication, 2) Allow specific management access, 3) Deny all other traffic. Use network tags to target specific instances.
Network Security for Data Pipelines
VPC Flow Logs
VPC Flow Logs capture network traffic metadata for analysis and security monitoring.
# Enable VPC Flow Logs on data pipeline subnet
gcloud compute networks subnets update data-pipeline-subnet \
--region=us-central1 \
--enable-flow-logs \
--logging-aggregation-interval=INTERVAL_5_SEC \
--logging-sample-rate=0.5 \
--logging-metadata=include-all
Network Policies for GKE
# Kubernetes NetworkPolicy for data engineering GKE cluster
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: data-pipeline-network-policy
namespace: data-engineering
spec:
podSelector:
matchLabels:
app: dataflow-worker
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: dataflow-job-manager
ports:
- protocol: TCP
port: 8080
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/8 # Internal GCP
ports:
- protocol: TCP
port: 443
Performance Optimization
Choosing the Right Machine Type
# Data processing machine types
machine_types = {
"dataflow-worker": "n1-standard-4", # 4 vCPU, 15GB RAM
"dataproc-master": "n1-standard-8", # 8 vCPU, 30GB RAM
"dataproc-worker": "n1-standard-4", # 4 vCPU, 15GB RAM
"airflow-worker": "n1-standard-2", # 2 vCPU, 7.5GB RAM
"airflow-scheduler": "n1-standard-4", # 4 vCPU, 15GB RAM
}
Network Bandwidth Considerations
# Network performance characteristics
network_perf = {
"10gbps": {
"machine_types": ["n1-standard-1", "n1-standard-2"],
"use_case": "Light data processing, small transfers"
},
"20gbps": {
"machine_types": ["n1-standard-4", "n1-standard-8"],
"use_case": "Medium workloads, Dataflow workers"
},
"32gbps": {
"machine_types": ["n1-standard-16", "n1-standard-32"],
"use_case": "Heavy processing, large Dataproc clusters"
},
"100gbps": {
"machine_types": ["a2-highgpu-1g", "a2-highgpu-2g"],
"use_case": "GPU workloads, ML training"
}
}
βΉοΈ
Cost Tip: For data processing workloads, network egress within the same region is free. Cross-region egress costs 0.12/GB. Design your data architecture to minimize cross-region and internet traffic.
Common Interview Questions
Q1: When would you use Shared VPC vs. VPC Peering?
Answer: Shared VPC is best when a central networking team manages network resources across multiple projects. It provides centralized control over subnets, firewall rules, and routes. VPC Peering is simpler for connecting two independent projects, especially when each team manages their own VPC. Shared VPC is preferred in enterprise environments with strict network governance.
Q2: What is Private Google Access and why is it important for data engineering?
Answer: Private Google Access allows VMs without external IPs to reach Google APIs (BigQuery, GCS, etc.) over Google's internal network. This is critical for data engineering because it: 1) Eliminates the need for public IPs on processing VMs, 2) Reduces attack surface, 3) Provides lower latency access to Google services, 4) Enables compliance with data residency requirements.
Q3: How do you design a VPC for a multi-project data platform?
Answer: Use Shared VPC with a dedicated networking host project. Create separate subnets for each workload (data pipeline, analytics, management) with appropriate CIDR ranges. Enable Private Google Access on all data subnets. Implement firewall rules at the host project level. Use VPC Flow Logs for monitoring. This provides centralized network governance while allowing project-level resource management.
Q4: Explain the difference between Cloud NAT and Proxy VM for internet access.
Answer: Cloud NAT is a managed, highly available service that provides internet access to private VMs without single points of failure. A Proxy VM is a self-managed VM running NAT/proxy software. Cloud NAT is preferred because it's fully managed, scales automatically, provides higher availability, and has no maintenance overhead. Use Cloud NAT for production data workloads.
Q5: How do you secure data flow between VPCs in different projects?
Answer: Options include: 1) VPC Peering for direct private connectivity, 2) Shared VPC for centralized management, 3) Private Service Connect for service-specific access, 4) Cloud VPN for encrypted connectivity, 5) Cloud Interconnect for hybrid scenarios. For data engineering, VPC Peering or Shared VPC with Private Google Access provides the best balance of security and performance.