Data Governance on GCP

Master data governance on GCP including Dataplex, Cloud DLP, policy tags, access controls, and compliance frameworks.

18 min readAdvanced

Data Governance Framework on GCP

🛡️ GCP Security Architecture for Data Engineering

Interview Tip: GCP follows a shared responsibility model — Google secures the infrastructure, you secure your data. Enable encryption at rest (default), use CMEK for sensitive data, implement VPC Service Controls for data exfiltration prevention, and use Cloud DLP to classify and protect PII.

Policy Tags for Column-Level Security

from google.cloud import dataplex_v1

client = dataplex_v1.DataCatalogClient()

# Create policy tag taxonomy
taxonomy = client.create_taxonomy(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "taxonomy": {
            "display_name": "Data Classification",
            "description": "Policy tags for data classification",
            "policy_tag_tree": {
                "child_taxonomies": [
                    {
                        "display_name": "Public",
                        "description": "Non-sensitive data"
                    },
                    {
                        "display_name": "Internal",
                        "description": "Internal use only"
                    },
                    {
                        "display_name": "Confidential",
                        "description": "Sensitive data requiring protection"
                    },
                    {
                        "display_name": "Restricted",
                        "description": "Highly sensitive PII/PHI"
                    }
                ]
            }
        }
    }
)

# Apply policy tag to BigQuery column
# In BigQuery, use policy tags for column-level security

-- Apply policy tag to BigQuery column
ALTER TABLE `project.dataset.users`
ALTER COLUMN email
SET OPTIONS (
  policy_tag = 'projects/my-project/locations/us-central1/taxonomies/123/policyTags/456'
);

Cloud DLP Integration

from google.cloud import dlp_v2

client = dlp_v2.DlpServiceClient()

# Inspect data for sensitive information
def inspect_data(project_id, content):
    """Inspect data for PII."""
    parent = f"projects/{project_id}"

    inspect_config = {
        "info_types": [
            {"name": "EMAIL_ADDRESS"},
            {"name": "PHONE_NUMBER"},
            {"name": "CREDIT_CARD_NUMBER"},
            {"name": "US_SOCIAL_SECURITY_NUMBER"}
        ],
        "min_likelihood": "LIKELY"
    }

    response = client.inspect_content(
        request={
            "parent": parent,
            "inspect_config": inspect_config,
            "item": {"value": content}
        }
    )

    return response.result.info_type_inspectations

# De-identify sensitive data
def deidentify_data(project_id, content):
    """De-identify sensitive data."""
    parent = f"projects/{project_id}"

    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "info_types": [{"name": "EMAIL_ADDRESS"}],
                    "primitive_transformation": {
                        "character_mask_config": {
                            "masking_character": "*",
                            "number_to_mask": 0,
                            "reverse_order": False
                        }
                    }
                }
            ]
        }
    }

    response = client.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "item": {"value": content}
        }
    )

    return response.result.item.value

✨

Best Practice: Implement a data classification framework: Public, Internal, Confidential, Restricted. Use policy tags for column-level security in BigQuery. Apply Cloud DLP for automated PII detection. Enable audit logging for compliance. Review access controls quarterly.

💬

Common Interview Questions

Q1: What are the key components of data governance?

Answer: 1) Data quality management, 2) Data security and access control, 3) Data lineage tracking, 4) Data cataloging and discovery, 5) Compliance management, 6) Data retention policies, 7) Privacy protection.

Q2: How do you implement column-level security in BigQuery?

Answer: Use policy tags to classify columns by sensitivity level. Apply IAM policies to policy tags to control access. Users without access see NULL values. Policy tags support four levels: Public, Internal, Confidential, Restricted.

Q3: What is Cloud DLP and when should you use it?

Answer: Cloud DLP detects, classifies, and de-identifies sensitive data. Use it for: 1) PII detection in data lakes, 2) Data masking for non-production environments, 3) Compliance auditing, 4) Automated classification of sensitive data.

Q4: How do you handle GDPR data deletion requests?

Answer: 1) Identify all data stores containing the user's data, 2) Use BigQuery time-travel for historical data, 3) Implement soft deletes with retention policies, 4) Use Cloud DLP to scan for residual PII, 5) Document deletion for compliance auditing.

Q5: What is the purpose of audit logs in data governance?

Answer: Audit logs track who accessed what data and when. They're essential for: 1) Compliance auditing (HIPAA, GDPR), 2) Security incident investigation, 3) Access pattern analysis, 4) Data usage tracking, 5) Policy enforcement verification.

Data Governance: Dataplex, DLP & Policy Tags