GCP VPC Networking for Data Engineering

Design secure, high-performance VPC architectures for data engineering workloads including Private Google Access, VPC peering, and network security.

16 min readAdvanced

VPC Architecture for Data Engineering

A Virtual Private Cloud (VPC) is the fundamental networking construct in GCP. For data engineers, proper VPC design ensures secure, low-latency communication between data services while controlling costs.

VPC Components

🌐 GCP VPC Architecture for Data Engineering

Interview Tip: GCP VPCs are global (unlike AWS VPCs which are regional). Always use Custom mode for data engineering to control CIDR ranges and enable Private Google Access.

VPC Modes: Auto vs. Custom

Auto Mode VPC

# Auto mode creates subnets automatically in all regions
gcloud compute networks create auto-vpc \
  --subnet-mode=auto \
  --bgp-routing-mode=regional

# Auto mode creates /20 subnets in each region
# 10.128.0.0/20 for us-central1
# 10.129.0.0/20 for us-east1
# etc.

Custom Mode VPC (Recommended for Data Engineering)

# Custom mode gives you full control over subnet CIDR ranges
gcloud compute networks create data-engineering-vpc \
  --subnet-mode=custom \
  --bgp-routing-mode=global

# Create subnets for different data workloads
gcloud compute networks subnets create data-pipeline-subnet \
  --network=data-engineering-vpc \
  --region=us-central1 \
  --range=10.0.1.0/24 \
  --enable-private-ip-google-access

gcloud compute networks subnets create analytics-subnet \
  --network=data-engineering-vpc \
  --region=us-central1 \
  --range=10.0.2.0/24 \
  --enable-private-ip-google-access

gcloud compute networks subnets create management-subnet \
  --network=data-engineering-vpc \
  --region=us-central1 \
  --range=10.0.3.0/24

✨

Best Practice: Always use Custom mode VPC for data engineering. It provides predictable IP addressing, allows you to create subnets only where needed, and prevents IP range conflicts when peering with other VPCs.

Private Google Access

Private Google Access allows VMs without external IP addresses to reach Google APIs and services (BigQuery, GCS, Pub/Sub, etc.) over Google's internal network.

Configuration

# Enable Private Google Access on a subnet
gcloud compute networks subnets update data-pipeline-subnet \
  --region=us-central1 \
  --enable-private-ip-google-access

# Verify Private Google Access is enabled
gcloud compute networks subnets describe data-pipeline-subnet \
  --region=us-central1 \
  --format="value(privateIpGoogleAccess)"

Private Service Connect (Modern Alternative)

Private Service Connect provides more granular control over private access to Google APIs.

🌐 GCP VPC Architecture for Data Engineering

Interview Tip: GCP VPCs are global (unlike AWS VPCs which are regional). Always use Custom mode for data engineering to control CIDR ranges and enable Private Google Access.

# Create Private Service Connect endpoint for BigQuery
gcloud compute addresses create bigquery-endpoint \
  --global \
  --ip-version=IPV4 \
  --network=data-engineering-vpc \
  --purpose=PRIVATE_SERVICE_CONNECT \
  --addresses=10.0.100.1

# Create forwarding rule
gcloud compute forwarding-rules create bigquery-forwarding-rule \
  --global \
  --target-address=10.0.100.1 \
  --target-attached-resource=bigquery.googleapis.com \
  --network=data-engineering-vpc \
  --ip-protocol=TCP \
  --ports=443

VPC Peering for Data Engineering

VPC Peering connects two VPC networks, allowing private communication across projects.

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Setting Up VPC Peering

# Create peering from Project A to Project B
gcloud compute networks peerings create peer-a-to-b \
  --network=data-platform-vpc \
  --peer-project=project-b-analytics \
  --peer-network=analytics-vpc \
  --auto-create-routes

# Create peering from Project B to Project A (required for bidirectional)
gcloud compute networks peerings create peer-b-to-a \
  --network=analytics-vpc \
  --peer-project=project-a-data-platform \
  --peer-network=data-platform-vpc \
  --auto-create-routes

⚠️

Warning: VPC Peering is non-transitive. If Project A peers with B, and B peers with C, A cannot reach C. For this scenario, consider Network Connectivity Center or implement full mesh peering.

Shared VPC for Data Engineering

Shared VPC allows an organization to connect resources from multiple projects to a common VPC network, centrally managed by a host project.

🌐 GCP VPC Architecture for Data Engineering

Interview Tip: GCP VPCs are global (unlike AWS VPCs which are regional). Always use Custom mode for data engineering to control CIDR ranges and enable Private Google Access.

# Enable Shared VPC in host project
gcloud shared-vpc enable PROJECT_ID_HOST

# Associate service project
gcloud shared-vpc associated-projects add PROJECT_ID_SERVICE \
  --host-project=PROJECT_ID_HOST

Cloud NAT for Data Engineering

Cloud NAT provides internet access to private VMs without exposing them to incoming traffic. Essential for Dataproc/VMs that need to download packages or access external APIs.

# Create Cloud Router
gcloud compute routers create data-engineering-router \
  --network=data-engineering-vpc \
  --region=us-central1

# Create Cloud NAT
gcloud compute routers nats create data-engineering-nat \
  --router=data-engineering-router \
  --region=us-central1 \
  --auto-allocate-nat-external-ips \
  --nat-all-subnet-ip-ranges \
  --log-nat-rules \
  --log-metadata=INCLUDE_ALL_METADATA

VPC Firewall Rules for Data Engineering

Firewall rules control traffic to and from VM instances. For data engineering, you need rules that allow pipeline communication while blocking unauthorized access.

# Allow internal communication between data pipeline instances
gcloud compute firewall-rules create allow-internal-data-pipeline \
  --network=data-engineering-vpc \
  --allow=tcp,udp,icmp \
  --source-ranges=10.0.1.0/24,10.0.2.0/24 \
  --description="Allow internal communication between data pipeline subnets"

# Allow SSH only from specific management subnet
gcloud compute firewall-rules create allow-ssh-management \
  --network=data-engineering-vpc \
  --allow=tcp:22 \
  --source-ranges=10.0.3.0/24 \
  --target-tags=dataproc-master,dataproc-worker \
  --description="Allow SSH only from management subnet"

# Deny all other ingress traffic
gcloud compute firewall-rules create deny-all-ingress \
  --network=data-engineering-vpc \
  --action=DENY \
  --direction=INGRESS \
  --rules=all \
  --source-ranges=0.0.0.0/0 \
  --description="Deny all other ingress traffic"

✨

Best Practice: Implement firewall rules in a layered approach: 1) Allow internal communication, 2) Allow specific management access, 3) Deny all other traffic. Use network tags to target specific instances.

Network Security for Data Pipelines

VPC Flow Logs

VPC Flow Logs capture network traffic metadata for analysis and security monitoring.

# Enable VPC Flow Logs on data pipeline subnet
gcloud compute networks subnets update data-pipeline-subnet \
  --region=us-central1 \
  --enable-flow-logs \
  --logging-aggregation-interval=INTERVAL_5_SEC \
  --logging-sample-rate=0.5 \
  --logging-metadata=include-all

Network Policies for GKE

# Kubernetes NetworkPolicy for data engineering GKE cluster
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: data-pipeline-network-policy
  namespace: data-engineering
spec:
  podSelector:
    matchLabels:
      app: dataflow-worker
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: dataflow-job-manager
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8  # Internal GCP
      ports:
        - protocol: TCP
          port: 443

Performance Optimization

Choosing the Right Machine Type

# Data processing machine types
machine_types = {
    "dataflow-worker": "n1-standard-4",      # 4 vCPU, 15GB RAM
    "dataproc-master": "n1-standard-8",      # 8 vCPU, 30GB RAM
    "dataproc-worker": "n1-standard-4",      # 4 vCPU, 15GB RAM
    "airflow-worker": "n1-standard-2",       # 2 vCPU, 7.5GB RAM
    "airflow-scheduler": "n1-standard-4",    # 4 vCPU, 15GB RAM
}

Network Bandwidth Considerations

# Network performance characteristics
network_perf = {
    "10gbps": {
        "machine_types": ["n1-standard-1", "n1-standard-2"],
        "use_case": "Light data processing, small transfers"
    },
    "20gbps": {
        "machine_types": ["n1-standard-4", "n1-standard-8"],
        "use_case": "Medium workloads, Dataflow workers"
    },
    "32gbps": {
        "machine_types": ["n1-standard-16", "n1-standard-32"],
        "use_case": "Heavy processing, large Dataproc clusters"
    },
    "100gbps": {
        "machine_types": ["a2-highgpu-1g", "a2-highgpu-2g"],
        "use_case": "GPU workloads, ML training"
    }
}

ℹ️

Cost Tip: For data processing workloads, network egress within the same region is free. Cross-region egress costs $0.01/GB, and internet egress costs$ 0.12/GB. Design your data architecture to minimize cross-region and internet traffic.

💬

Common Interview Questions

Q1: When would you use Shared VPC vs. VPC Peering?

Answer: Shared VPC is best when a central networking team manages network resources across multiple projects. It provides centralized control over subnets, firewall rules, and routes. VPC Peering is simpler for connecting two independent projects, especially when each team manages their own VPC. Shared VPC is preferred in enterprise environments with strict network governance.

Q2: What is Private Google Access and why is it important for data engineering?

Answer: Private Google Access allows VMs without external IPs to reach Google APIs (BigQuery, GCS, etc.) over Google's internal network. This is critical for data engineering because it: 1) Eliminates the need for public IPs on processing VMs, 2) Reduces attack surface, 3) Provides lower latency access to Google services, 4) Enables compliance with data residency requirements.

Q3: How do you design a VPC for a multi-project data platform?

Answer: Use Shared VPC with a dedicated networking host project. Create separate subnets for each workload (data pipeline, analytics, management) with appropriate CIDR ranges. Enable Private Google Access on all data subnets. Implement firewall rules at the host project level. Use VPC Flow Logs for monitoring. This provides centralized network governance while allowing project-level resource management.

Q4: Explain the difference between Cloud NAT and Proxy VM for internet access.

Answer: Cloud NAT is a managed, highly available service that provides internet access to private VMs without single points of failure. A Proxy VM is a self-managed VM running NAT/proxy software. Cloud NAT is preferred because it's fully managed, scales automatically, provides higher availability, and has no maintenance overhead. Use Cloud NAT for production data workloads.

Q5: How do you secure data flow between VPCs in different projects?

Answer: Options include: 1) VPC Peering for direct private connectivity, 2) Shared VPC for centralized management, 3) Private Service Connect for service-specific access, 4) Cloud VPN for encrypted connectivity, 5) Cloud Interconnect for hybrid scenarios. For data engineering, VPC Peering or Shared VPC with Private Google Access provides the best balance of security and performance.