GCP IAM for Data Engineering

Master Identity and Access Management, service accounts, workload identity federation, and security patterns for data engineering pipelines.

18 min readAdvanced

IAM Fundamentals for Data Engineers

Identity and Access Management (IAM) is the backbone of GCP security. For data engineers, IAM controls who can access data, run pipelines, and manage infrastructure. Misconfigured IAM is the #1 cause of data breaches in cloud environments.

IAM Hierarchy

🔐 GCP IAM Hierarchy & Roles

Interview Tip: Follow the principle of least privilege — grant only the permissions needed. Use predefined roles over basic roles. Service accounts should have minimal permissions and use Workload Identity in GKE instead of key-based auth.

IAM Roles for Data Engineers

GCP provides three types of IAM roles:

Role Type	Description	Example
Basic Roles	Owner, Editor, Viewer	`roles/viewer` — too broad for data eng
Predefined Roles	Service-specific	`roles/bigquery.dataEditor`
Custom Roles	User-defined	Combine specific permissions

⚠️

Security Warning: Never use basic roles (Owner/Editor/Viewer) for data engineering service accounts. These roles are overly permissive and violate the principle of least privilege. Always use predefined or custom roles.

Essential Data Engineering IAM Roles

# Predefined IAM roles for data engineering pipelines
Data Ingestion:
  - roles/storage.objectAdmin      # GCS read/write
  - roles/pubsub.publisher        # Publish to Pub/Sub
  - roles/pubsub.subscriber       # Subscribe to Pub/Sub

Data Processing:
  - roles/dataflow.developer       # Create/manage Dataflow jobs
  - roles/dataproc.editor          # Manage Dataproc clusters
  - roles/cloudfunctions.invoker   # Invoke Cloud Functions

Data Storage:
  - roles/bigquery.dataEditor      # Read/write BigQuery datasets
  - roles/bigquery.jobUser         # Run BigQuery queries
  - roles/bigtable.admin           # Full Bigtable access

Orchestration:
  - roles/composer.worker          # Cloud Composer operations
  - roles/iam.serviceAccountUser   # Act as service accounts

Monitoring:
  - roles/monitoring.metricWriter  # Write custom metrics
  - roles/logging.logWriter        # Write logs

Service Accounts Deep Dive

Service accounts are special accounts used by applications, not humans. They authenticate to GCP APIs and are essential for data pipeline security.

Types of Service Accounts

Dataflow vs Dataproc: When to Use What

Dataflow

Apache Beam (Serverless)

✓ Fully managed, no cluster setup

✓ Auto-scaling (up and down)

✓ Unified stream + batch

✓ Exactly-once processing

✓ Pay per CPU/GB-second

✗ Limited customization

✗ Harder to debug

✗ Vendor lock-in (Beam)

Use for: New pipelines, streaming, ETL jobs, serverless-first teams

Dataproc

Spark/Hadoop (Managed)

✓ Full Spark/Hadoop ecosystem

✓ Easy migration from on-prem

✓ Custom scripts & libraries

✓ Preemptible VMs (91% off)

✓ Jupyter/Zeppelin built-in

✗ Cluster management needed

✗ Manual scaling

✗ Idle cluster costs money

Use for: Existing Spark code, ML workloads, lift-and-shift from on-prem Hadoop

Creating Service Accounts for Data Pipelines

from google.cloud import iam_admin_v1
from google.iam.v1 import iam_policy_pb2, policy_pb2

def create_service_account(project_id, account_id, display_name):
    """Create a service account for a data pipeline."""
    client = iam_admin_v1.IAMClient()

    request = iam_admin_v1.CreateServiceAccountRequest(
        name=f"projects/{project_id}",
        account_id=account_id,
        service_account=iam_admin_v1.ServiceAccount(
            display_name=display_name,
            description="Service account for data engineering pipeline"
        )
    )

    service_account = client.create_service_account(request=request)
    print(f"Created service account: {service_account.email}")
    return service_account

# Create separate service accounts for each pipeline stage
create_service_account(
    "my-project",
    "ingestion-pipeline",
    "Data Ingestion Pipeline SA"
)

create_service_account(
    "my-project",
    "processing-pipeline",
    "Data Processing Pipeline SA"
)

create_service_account(
    "my-project",
    "analytics-pipeline",
    "Analytics Pipeline SA"
)

✨

Best Practice: Create separate service accounts for each pipeline stage (ingestion, processing, storage, analytics). This provides fine-grained access control and audit trails. If one account is compromised, the blast radius is limited.

Workload Identity Federation

Workload Identity Federation allows external identity providers (AWS, Azure, GitHub Actions, etc.) to access GCP resources without service account keys. This eliminates the security risk of long-lived credentials.

Architecture

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Setup Example: GitHub Actions → GCP

# Step 1: Create Workload Identity Pool
gcloud iam workload-identity-pools create "github-pool" \
  --location="global" \
  --display-name="GitHub Actions Pool"

# Step 2: Create Provider for GitHub
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
  --workload-identity-pool="github-pool" \
  --location="global" \
  --display-name="GitHub Provider" \
  --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
  --issuer-uri="https://token.actions.githubusercontent.com"

# Step 3: Allow GitHub repo to impersonate service account
gcloud iam service-accounts add-iam-policy-binding "data-pipeline@my-project.iam.gserviceaccount.com" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/locations/global/workloadIdentityPools/github-pool/attribute.repository/my-org/my-repo"

# GitHub Actions workflow using Workload Identity
name: Deploy Data Pipeline
on:
  push:
    branches: [main]

permissions:
  id-token: write  # Required for Workload Identity
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: 'projects/123456789/locations/global/workloadIdentityPools/github-pool/providers/github-provider'
          service_account: 'data-pipeline@my-project.iam.gserviceaccount.com'

      - uses: google-github-actions/setup-gcloud@v2

      - name: Deploy Dataflow Job
        run: |
          gcloud dataflow jobs run my-pipeline \
            --gcs-location=gs://my-bucket/templates/pipeline \
            --parameters=input=gs://input-bucket/data/

IAM Conditions for Data Engineering

IAM Conditions allow you to restrict access based on resource attributes, time, or request attributes. This is powerful for data engineering scenarios.

# Example: Restrict BigQuery access by time and resource
from google.iam.v1 import policy_pb2

# Condition: Only allow access during business hours
condition = {
    "expression": "request.time.getHours('America/New_York') >= 9 && request.time.getHours('America/New_York') <= 17",
    "title": "Business hours only",
    "description": "Restrict data access to business hours"
}

# Condition: Only allow access to specific datasets
dataset_condition = {
    "expression": "resource.name.startsWith('projects/_/datasets/analytics_')",
    "title": "Analytics datasets only",
    "description": "Only allow access to analytics_* datasets"
}

# Condition: Restrict by time AND resource
combined_condition = {
    "expression": "resource.name.startsWith('projects/_/datasets/sensitive_') && request.time.getDayOfWeek('America/New_York') in ['MONDAY', 'TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY']",
    "title": "Sensitive data business days only",
    "description": "Access sensitive datasets only on weekdays"
}

ℹ️

Pro Tip: Use IAM Conditions to enforce data access policies like time-based restrictions, dataset-level permissions, and environment-based access. Combined with audit logs, this provides a comprehensive data governance framework.

Service Account Key Management

Service account keys are the most sensitive credentials in GCP. If compromised, they provide full access to the service account's permissions.

Key Rotation Strategy

import datetime
from google.cloud import iam_admin_v1

def rotate_service_account_key(project_id, service_account_email):
    """Rotate service account keys regularly."""
    client = iam_admin_v1.IAMClient()

    # Create new key
    request = iam_admin_v1.CreateServiceAccountKeyRequest(
        name=f"projects/{project_id}/serviceAccounts/{service_account_email}",
        key_algorithm=iam_admin_v1.ServiceAccountKeyAlgorithm.KEY_ALG_RSA_2048
    )
    new_key = client.create_service_account_key(request=request)

    # Store new key in Secret Manager
    from google.cloud import secretmanager
    sm_client = secretmanager.SecretManagerClient()

    secret_id = f"{service_account_email.split('@')[0]}-key"
    parent = f"projects/{project_id}"

    # Create or access secret
    try:
        secret = sm_client.create_secret(
            request={
                "parent": parent,
                "secret_id": secret_id,
                "secret": {"replication": {"automatic": {}}}
            }
        )
    except Exception:
        secret = sm_client.secret_path(project_id, secret_id)

    # Store key data
    sm_client.add_secret_version(
        request={
            "parent": secret,
            "payload": {"data": new_key.private_key_data}
        }
    )

    print(f"New key created and stored in Secret Manager: {secret_id}")
    return new_key

# Rotate keys every 90 days (recommended)

Disable Default Service Account

# Disable the default compute service account in production
gcloud iam service-accounts disable \
  PROJECT_NUMBER-compute@developer.gserviceaccount.com \
  --project=PROJECT_ID

# Create and use dedicated service accounts instead
gcloud iam service-accounts create data-pipeline \
  --display-name="Data Pipeline Service Account" \
  --project=PROJECT_ID

IAM Audit Logging for Data Engineering

Audit logs are essential for compliance and debugging data access patterns.

# Enable audit logging for BigQuery
from google.cloud import logging_v2

def setup_audit_logging(project_id):
    """Configure audit logging for data engineering services."""
    client = logging_v2.ConfigServiceV2Client()

    # BigQuery data access audit config
    bigquery_audit_config = {
        "service": "bigquery.googleapis.com",
        "audit_configs": [
            {
                "audit_type": "DATA_READ",
                "log_type": "ADMIN_READ",
                "exempted_members": []
            },
            {
                "audit_type": "DATA_WRITE",
                "log_type": "DATA_READ",
                "exempted_members": []
            }
        ]
    }

    # GCS audit config
    gcs_audit_config = {
        "service": "storage.googleapis.com",
        "audit_configs": [
            {
                "audit_type": "DATA_READ",
                "log_type": "ADMIN_READ",
                "exempted_members": []
            }
        ]
    }

    # Apply audit configurations
    # (In practice, use Terraform or gcloud for production)
    print("Audit logging configured for BigQuery and GCS")

✨

Best Practice: Enable BigQuery DATA_READ and DATA_WRITE audit logs for compliance. These logs capture who queried what data and when. Store audit logs in a separate project to prevent tampering. Use log sinks to export to BigQuery for analysis.

💬

Common Interview Questions

Q1: What is the difference between a service account and a user account?

Answer: User accounts represent humans and authenticate via OAuth/interactive login. Service accounts represent applications and authenticate via keys or metadata. Service accounts don't have passwords, can't be used interactively, and are designed for automation. Data pipelines should always use service accounts, not user accounts.

Q2: How do you secure service account keys?

Answer: Never store keys in source code or environment variables. Use Secret Manager to store keys, enable automatic key rotation (90 days), restrict key creation permissions, and monitor key usage via audit logs. Better yet, use Workload Identity Federation to eliminate keys entirely for supported workloads.

Q3: Explain the principle of least privilege in the context of data engineering.

Answer: Each service account should have only the minimum permissions required. An ingestion service account should only be able to publish to Pub/Sub, not read BigQuery data. A processing service account should only be able to read input data and write to specific output locations. This limits blast radius if a service account is compromised.

Q4: When would you use IAM Conditions?

Answer: IAM Conditions are useful for time-based access (allow queries only during business hours), resource-based access (allow access only to specific datasets), and environment-based access (allow production access only from specific VPCs). They complement IAM roles by adding fine-grained restrictions without creating custom roles.

Q5: What is Workload Identity Federation and why is it important?

Answer: Workload Identity Federation allows external identity providers (AWS, Azure, GitHub, etc.) to access GCP resources using OIDC tokens instead of service account keys. This eliminates the security risk of long-lived credentials and is recommended by Google's security best practices for CI/CD pipelines and multi-cloud architectures.

GCP IAM for Data Engineering: Service Accounts & Workload Identity