IAM Fundamentals for Data Engineers
Identity and Access Management (IAM) is the backbone of GCP security. For data engineers, IAM controls who can access data, run pipelines, and manage infrastructure. Misconfigured IAM is the #1 cause of data breaches in cloud environments.
IAM Hierarchy
IAM Roles for Data Engineers
GCP provides three types of IAM roles:
| Role Type | Description | Example |
|---|---|---|
| Basic Roles | Owner, Editor, Viewer | roles/viewer β too broad for data eng |
| Predefined Roles | Service-specific | roles/bigquery.dataEditor |
| Custom Roles | User-defined | Combine specific permissions |
β οΈ
Security Warning: Never use basic roles (Owner/Editor/Viewer) for data engineering service accounts. These roles are overly permissive and violate the principle of least privilege. Always use predefined or custom roles.
Essential Data Engineering IAM Roles
# Predefined IAM roles for data engineering pipelines
Data Ingestion:
- roles/storage.objectAdmin # GCS read/write
- roles/pubsub.publisher # Publish to Pub/Sub
- roles/pubsub.subscriber # Subscribe to Pub/Sub
Data Processing:
- roles/dataflow.developer # Create/manage Dataflow jobs
- roles/dataproc.editor # Manage Dataproc clusters
- roles/cloudfunctions.invoker # Invoke Cloud Functions
Data Storage:
- roles/bigquery.dataEditor # Read/write BigQuery datasets
- roles/bigquery.jobUser # Run BigQuery queries
- roles/bigtable.admin # Full Bigtable access
Orchestration:
- roles/composer.worker # Cloud Composer operations
- roles/iam.serviceAccountUser # Act as service accounts
Monitoring:
- roles/monitoring.metricWriter # Write custom metrics
- roles/logging.logWriter # Write logs
Service Accounts Deep Dive
Service accounts are special accounts used by applications, not humans. They authenticate to GCP APIs and are essential for data pipeline security.
Types of Service Accounts
Creating Service Accounts for Data Pipelines
from google.cloud import iam_admin_v1
from google.iam.v1 import iam_policy_pb2, policy_pb2
def create_service_account(project_id, account_id, display_name):
"""Create a service account for a data pipeline."""
client = iam_admin_v1.IAMClient()
request = iam_admin_v1.CreateServiceAccountRequest(
name=f"projects/{project_id}",
account_id=account_id,
service_account=iam_admin_v1.ServiceAccount(
display_name=display_name,
description="Service account for data engineering pipeline"
)
)
service_account = client.create_service_account(request=request)
print(f"Created service account: {service_account.email}")
return service_account
# Create separate service accounts for each pipeline stage
create_service_account(
"my-project",
"ingestion-pipeline",
"Data Ingestion Pipeline SA"
)
create_service_account(
"my-project",
"processing-pipeline",
"Data Processing Pipeline SA"
)
create_service_account(
"my-project",
"analytics-pipeline",
"Analytics Pipeline SA"
)
β¨
Best Practice: Create separate service accounts for each pipeline stage (ingestion, processing, storage, analytics). This provides fine-grained access control and audit trails. If one account is compromised, the blast radius is limited.
Workload Identity Federation
Workload Identity Federation allows external identity providers (AWS, Azure, GitHub Actions, etc.) to access GCP resources without service account keys. This eliminates the security risk of long-lived credentials.
Architecture
Setup Example: GitHub Actions β GCP
# Step 1: Create Workload Identity Pool
gcloud iam workload-identity-pools create "github-pool" \
--location="global" \
--display-name="GitHub Actions Pool"
# Step 2: Create Provider for GitHub
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
--workload-identity-pool="github-pool" \
--location="global" \
--display-name="GitHub Provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
--issuer-uri="https://token.actions.githubusercontent.com"
# Step 3: Allow GitHub repo to impersonate service account
gcloud iam service-accounts add-iam-policy-binding "data-pipeline@my-project.iam.gserviceaccount.com" \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/locations/global/workloadIdentityPools/github-pool/attribute.repository/my-org/my-repo"
# GitHub Actions workflow using Workload Identity
name: Deploy Data Pipeline
on:
push:
branches: [main]
permissions:
id-token: write # Required for Workload Identity
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: 'projects/123456789/locations/global/workloadIdentityPools/github-pool/providers/github-provider'
service_account: 'data-pipeline@my-project.iam.gserviceaccount.com'
- uses: google-github-actions/setup-gcloud@v2
- name: Deploy Dataflow Job
run: |
gcloud dataflow jobs run my-pipeline \
--gcs-location=gs://my-bucket/templates/pipeline \
--parameters=input=gs://input-bucket/data/
IAM Conditions for Data Engineering
IAM Conditions allow you to restrict access based on resource attributes, time, or request attributes. This is powerful for data engineering scenarios.
# Example: Restrict BigQuery access by time and resource
from google.iam.v1 import policy_pb2
# Condition: Only allow access during business hours
condition = {
"expression": "request.time.getHours('America/New_York') >= 9 && request.time.getHours('America/New_York') <= 17",
"title": "Business hours only",
"description": "Restrict data access to business hours"
}
# Condition: Only allow access to specific datasets
dataset_condition = {
"expression": "resource.name.startsWith('projects/_/datasets/analytics_')",
"title": "Analytics datasets only",
"description": "Only allow access to analytics_* datasets"
}
# Condition: Restrict by time AND resource
combined_condition = {
"expression": "resource.name.startsWith('projects/_/datasets/sensitive_') && request.time.getDayOfWeek('America/New_York') in ['MONDAY', 'TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY']",
"title": "Sensitive data business days only",
"description": "Access sensitive datasets only on weekdays"
}
βΉοΈ
Pro Tip: Use IAM Conditions to enforce data access policies like time-based restrictions, dataset-level permissions, and environment-based access. Combined with audit logs, this provides a comprehensive data governance framework.
Service Account Key Management
Service account keys are the most sensitive credentials in GCP. If compromised, they provide full access to the service account's permissions.
Key Rotation Strategy
import datetime
from google.cloud import iam_admin_v1
def rotate_service_account_key(project_id, service_account_email):
"""Rotate service account keys regularly."""
client = iam_admin_v1.IAMClient()
# Create new key
request = iam_admin_v1.CreateServiceAccountKeyRequest(
name=f"projects/{project_id}/serviceAccounts/{service_account_email}",
key_algorithm=iam_admin_v1.ServiceAccountKeyAlgorithm.KEY_ALG_RSA_2048
)
new_key = client.create_service_account_key(request=request)
# Store new key in Secret Manager
from google.cloud import secretmanager
sm_client = secretmanager.SecretManagerClient()
secret_id = f"{service_account_email.split('@')[0]}-key"
parent = f"projects/{project_id}"
# Create or access secret
try:
secret = sm_client.create_secret(
request={
"parent": parent,
"secret_id": secret_id,
"secret": {"replication": {"automatic": {}}}
}
)
except Exception:
secret = sm_client.secret_path(project_id, secret_id)
# Store key data
sm_client.add_secret_version(
request={
"parent": secret,
"payload": {"data": new_key.private_key_data}
}
)
print(f"New key created and stored in Secret Manager: {secret_id}")
return new_key
# Rotate keys every 90 days (recommended)
Disable Default Service Account
# Disable the default compute service account in production
gcloud iam service-accounts disable \
PROJECT_NUMBER-compute@developer.gserviceaccount.com \
--project=PROJECT_ID
# Create and use dedicated service accounts instead
gcloud iam service-accounts create data-pipeline \
--display-name="Data Pipeline Service Account" \
--project=PROJECT_ID
IAM Audit Logging for Data Engineering
Audit logs are essential for compliance and debugging data access patterns.
# Enable audit logging for BigQuery
from google.cloud import logging_v2
def setup_audit_logging(project_id):
"""Configure audit logging for data engineering services."""
client = logging_v2.ConfigServiceV2Client()
# BigQuery data access audit config
bigquery_audit_config = {
"service": "bigquery.googleapis.com",
"audit_configs": [
{
"audit_type": "DATA_READ",
"log_type": "ADMIN_READ",
"exempted_members": []
},
{
"audit_type": "DATA_WRITE",
"log_type": "DATA_READ",
"exempted_members": []
}
]
}
# GCS audit config
gcs_audit_config = {
"service": "storage.googleapis.com",
"audit_configs": [
{
"audit_type": "DATA_READ",
"log_type": "ADMIN_READ",
"exempted_members": []
}
]
}
# Apply audit configurations
# (In practice, use Terraform or gcloud for production)
print("Audit logging configured for BigQuery and GCS")
β¨
Best Practice: Enable BigQuery DATA_READ and DATA_WRITE audit logs for compliance. These logs capture who queried what data and when. Store audit logs in a separate project to prevent tampering. Use log sinks to export to BigQuery for analysis.
Common Interview Questions
Q1: What is the difference between a service account and a user account?
Answer: User accounts represent humans and authenticate via OAuth/interactive login. Service accounts represent applications and authenticate via keys or metadata. Service accounts don't have passwords, can't be used interactively, and are designed for automation. Data pipelines should always use service accounts, not user accounts.
Q2: How do you secure service account keys?
Answer: Never store keys in source code or environment variables. Use Secret Manager to store keys, enable automatic key rotation (90 days), restrict key creation permissions, and monitor key usage via audit logs. Better yet, use Workload Identity Federation to eliminate keys entirely for supported workloads.
Q3: Explain the principle of least privilege in the context of data engineering.
Answer: Each service account should have only the minimum permissions required. An ingestion service account should only be able to publish to Pub/Sub, not read BigQuery data. A processing service account should only be able to read input data and write to specific output locations. This limits blast radius if a service account is compromised.
Q4: When would you use IAM Conditions?
Answer: IAM Conditions are useful for time-based access (allow queries only during business hours), resource-based access (allow access only to specific datasets), and environment-based access (allow production access only from specific VPCs). They complement IAM roles by adding fine-grained restrictions without creating custom roles.
Q5: What is Workload Identity Federation and why is it important?
Answer: Workload Identity Federation allows external identity providers (AWS, Azure, GitHub, etc.) to access GCP resources using OIDC tokens instead of service account keys. This eliminates the security risk of long-lived credentials and is recommended by Google's security best practices for CI/CD pipelines and multi-cloud architectures.