πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Cloud DLP: De-identification, Risk Analysis & Templates

GCP Data EngineeringCloud DLP⭐ Premium

Advertisement

Cloud DLP Deep Dive

Master Cloud DLP including sensitive data detection, de-identification, risk analysis, and compliance patterns.

16 min readIntermediate

Cloud DLP Architecture

πŸ›‘οΈ GCP Security Architecture for Data Engineering
GCP Security: Defense-in-Depth for Data EngineeringENCRYPTIONAt RestAES-256, auto-rotated by defaultIn TransitTLS 1.2+, internal Google networkCMEKCustomer-managed encryption keysCSEKCustomer-supplied encryption keysVPC SERVICE CONTROLSService PerimetersBoundaries around GCP servicesAccess LevelsIP, device, identity conditionsDry-run ModeTest before enforcementBridge PerimetersCross-perimeter accessCloud Armorβ€’ DDoS Protectionβ€’ WAF Rules (OWASP)β€’ IP Allow/Deny Listsβ€’ Adaptive ProtectionCloud DLP APIβ€’ Data Classificationβ€’ Sensitive Data Detectionβ€’ De-identificationβ€’ InfoType TemplatesCloud KMSβ€’ Key Rings & Keysβ€’ HSM / External KMSβ€’ Key Rotationβ€’ IAM for KeysSHARED RESPONSIBILITY MODELGoogle Manages: Physical security, network, hypervisorYou Manage: Data, IAM, configs, application codeDATA ENGINEERING SECURITY CHECKLISTβœ“ Enable CMEK for BigQueryβœ“ VPC-SC for data projectsβœ“ DLP for sensitive dataβœ“ Audit logs enabled
Interview Tip: GCP follows a shared responsibility model β€” Google secures the infrastructure, you secure your data. Enable encryption at rest (default), use CMEK for sensitive data, implement VPC Service Controls for data exfiltration prevention, and use Cloud DLP to classify and protect PII.

Implementation

from google.cloud import dlp_v2

client = dlp_v2.DlpServiceClient()

# Inspect data for PII
def inspect_data(project_id, content):
    """Inspect data for sensitive information."""
    parent = f"projects/{project_id}"

    inspect_config = {
        "info_types": [
            {"name": "EMAIL_ADDRESS"},
            {"name": "PHONE_NUMBER"},
            {"name": "CREDIT_CARD_NUMBER"},
            {"name": "US_SOCIAL_SECURITY_NUMBER"},
            {"name": "PERSON_NAME"}
        ],
        "min_likelihood": "LIKELY",
        "max_findings": 100
    }

    response = client.inspect_content(
        request={
            "parent": parent,
            "inspect_config": inspect_config,
            "item": {"value": content}
        }
    )

    return response.result.info_type_inspectations

# De-identify sensitive data
def deidentify_data(project_id, content):
    """De-identify sensitive data."""
    parent = f"projects/{project_id}"

    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "info_types": [{"name": "EMAIL_ADDRESS"}],
                    "primitive_transformation": {
                        "character_mask_config": {
                            "masking_character": "*",
                            "number_to_mask": 0,
                            "reverse_order": False
                        }
                    }
                },
                {
                    "info_types": [{"name": "PHONE_NUMBER"}],
                    "primitive_transformation": {
                        "replace_config": {
                            "new_value": {
                                "string_value": "[PHONE REDACTED]"
                            }
                        }
                    }
                }
            ]
        }
    }

    response = client.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "item": {"value": content}
        }
    )

    return response.result.item.value

# Risk analysis
def analyze_risk(project_id, table_reference):
    """Analyze re-identification risk."""
    parent = f"projects/{project_id}"

    risk_analysis_config = {
        "quasi_ids": [
            {"name": "age", "type_": "INTEGER"},
            {"name": "zip_code", "type_": "STRING"},
            {"name": "gender", "type_": "STRING"}
        ],
        "numeric_stats_result_method": "MEAN",
        "categorical_stats_result_method": "KY_ANONYMITY",
        "k_anonymity_config": {
            "quasi_ids_field": [
                {"name": "age"},
                {"name": "zip_code"},
                {"name": "gender"}
            ]
        }
    }

    response = client.reidentify_content(
        request={
            "parent": parent,
            "reidentify_config": deidentify_config,
            "item": {"table": table_reference}
        }
    )

    return response.result

✨

Best Practice: Use Cloud DLP to scan data lakes for PII. Implement de-identification for non-production environments. Use risk analysis to identify quasi-identifiers. Create inspection templates for consistency. Schedule regular scans for compliance.

πŸ’¬

Common Interview Questions

Q1: What are the de-identification methods in Cloud DLP?

Answer: 1) Masking (character masking), 2) Redaction (replace with placeholder), 3) Tokenization (reversible), 4) Bucketing (numeric ranges), 5) Crypto hashing (irreversible). Choose based on reversibility and compliance requirements.

Q2: When should you use tokenization vs. masking?

Answer: Tokenization is reversible - use when you need to recover original data (testing, analytics). Masking is irreversible - use for permanent de-identification (GDPR compliance). Tokenization requires a token vault; masking is simpler.

Q3: What is k-anonymity?

Answer: K-anonymity ensures each record is indistinguishable from at least k-1 other records based on quasi-identifiers. Higher k values mean lower re-identification risk. Cloud DLP can calculate k-anonymity for datasets.

Q4: How do you scan BigQuery tables for PII?

Answer: Use Cloud DLP inspection jobs to scan BigQuery tables. Configure info types and minimum likelihood. Results can be exported to BigQuery for analysis. Schedule scans for ongoing monitoring.

Q5: What are the compliance benefits of Cloud DLP?

Answer: 1) Automated PII detection, 2) Consistent de-identification, 3) Risk analysis for quasi-identifiers, 4) Audit logging, 5) Template-based policies, 6) Integration with Dataplex for governance.

Advertisement