πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Security & Compliance: Protecting Data Assets

Module 4: Advanced DE & CareerAdvanced Data Engineering🟒 Free Lesson

Advertisement

Data Security: Protecting the Crown Jewels

Data security encompasses the policies, technologies, and practices that protect data from unauthorized access, corruption, and theft throughout its lifecycle.

Why Data Security Matters


The Cost of Data Breaches:

  • Average cost: $4.45M per incident (IBM, 2023)
  • Regulatory fines: up to 4% of global revenue (GDPR, CCPA)
  • Reputation damage and lost trust

Data Engineer Responsibility:

  • Build security into every layer of the data platform
  • Implement encryption, access control, and monitoring
  • Ensure compliance with regulations

Key Insight: Data engineers must build security into every layer of the data platform to protect against unauthorized access, corruption, and theft.


Architecture Overview

Six Defense Layers1. NetworkVPC/SubnetsFirewallsPrivate LinkVPN GatewayPerimeterDDoS ProtectWAF RulesNetwork ACLs2. IdentitySSO / MFARBAC / ABACAPI KeysOAuth 2.0Who are you?What can you do?Least PrivilegeSession Mgmt3. DataAt-Rest EncryptDynamic MaskingTokenizationKey ManagementAES-256-GCMColumn-levelPII ProtectionHSM / KMS4. ApplicationInput ValidationSQL InjectionCSRF ProtectionOutput EncodingSanitizeParameterizeRate LimitCORS Policy5. MonitoringAudit LoggingAnomaly DetectionAlertingSIEM IntegrationCloudTrailQuery PatternsAccess AnomaliesData Exfil Detect6. ComplianceGDPRCCPAHIPAASOXRight to EraseData PortabilityConsent MgmtBreach Notice

Data security is implemented across six defense layers:

Layer 1: Network β€” VPC/Subnets, Firewalls, Private Link Layer 2: Identity β€” SSO/MFA, RBAC/ABAC, API Keys Layer 3: Data β€” Encryption at Rest, Dynamic Masking, Tokenization Layer 4: Application β€” Input Validation, SQL Injection Prevention, CSRF Protection Layer 5: Monitoring β€” Audit Logging, Anomaly Detection, Alerting Layer 6: Compliance β€” GDPR, CCPA, HIPAA


Encryption

Encryption transforms plaintext data into ciphertext using an algorithm and key. At-rest encryption protects stored data; in-transit encryption protects data moving between systems; in-use encryption (confidential computing) protects data during processing.

Encryption Overhead

  • At-Rest: AES-256-GCM, overhead = ~3-5% storage, ~1-2% CPU
  • In-Transit: TLS 1.3, overhead = ~5-10ms latency per connection
  • In-Use: SGX/SEV, overhead = ~10-30% CPU, ~20-50% memory
  • Key Management Cost: HSM = 1βˆ’5/key/month,KMS=1-5/key/month, KMS =1/key + API calls
  • Encryption at Scale: 1 PB Γ— 3% overhead = 3 TB additional storage
-- Snowflake: Column-level encryption
CREATE TABLE customers_encrypted (
    customer_id     INT PRIMARY KEY,
    email           VARBINARY(256) ENCRYPT,  -- Always encrypted
    phone           VARBINARY(128) ENCRYPT,
    full_name       VARCHAR(200),            -- Not encrypted
    ssn_last_four   VARCHAR(4) MASKED WITH (
        FUNCTION = 'partial_mask(0, "***-**-****")'
    )
);

-- BigQuery: Column-level security
CREATE TABLE customers_masked (
    customer_id     INT64,
    email           STRING,
    ssn             STRING
);

-- Apply column-level access policy
CREATE OR REPLACE ROW ACCESS POLICY ssn_policy
ON `project.dataset.customers_masked`
GRANT TO ('role:pii_viewer')
FILTER USING (TRUE);

CREATE OR REPLACE ROW ACCESS POLICY ssn_masking
ON `project.dataset.customers_masked`
GRANT TO ('role:regular_analyst')
FILTER USING (TRUE);

-- Redshift: Encryption at rest
CREATE TABLE customers_secure (
    customer_id     INT ENCODE az64,
    email           VARCHAR(300) ENCODE az64,
    full_name       VARCHAR(200) ENCODE az64
)
ENCRYPTED  -- Enable at-rest encryption
SORTKEY(customer_id);

-- AWS KMS key management
-- aws kms create-key --description "Data warehouse encryption"
-- aws kms create-alias --alias-name alias/data-warehouse --target-key-id <key-id>

Access Control

RBAC vs ABAC ComparisonRBAC (Role-Based Access Control)Users {'\u2192'} Roles {'\u2192'} Permissionsanalyst1analystSELECT on martsengineer1engineerALL on martsadmin1adminALL on DBRole: fixed set of permissions assigned to usersBest for: Simple, static permissionsABAC (Attribute-Based Access Control)User + Resource + Environment {'\u2192'} PolicyUser: deptResource: classEnv: IP rangeIF dept = finance AND class = internalGRANT SELECTIF dept = eng AND IP = internalGRANT ALLDynamic policies based on attributesBest for: Complex, dynamic scenarios

Role-Based Access Control (RBAC) assigns permissions to roles, and users are assigned to roles. Users inherit all permissions of their assigned roles.

Attribute-Based Access Control (ABAC) assigns permissions based on attributes of the user, resource, and environment. Policies evaluate conditions like department, classification, and time of access.

-- Snowflake RBAC Implementation
-- Create roles
CREATE ROLE data_analyst;
CREATE ROLE data_engineer;
CREATE ROLE data_admin;
CREATE ROLE pii_viewer;

-- Grant database access
GRANT USAGE ON DATABASE analytics TO ROLE data_analyst;
GRANT USAGE ON DATABASE analytics TO ROLE data_engineer;
GRANT ALL ON DATABASE analytics TO ROLE data_admin;

-- Grant schema access
GRANT USAGE ON SCHEMA analytics.marts TO ROLE data_analyst;
GRANT ALL ON SCHEMA analytics.marts TO ROLE data_engineer;

-- Grant table access
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.marts TO ROLE data_analyst;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA analytics.marts TO ROLE data_engineer;

-- Grant PII access
GRANT SELECT ON TABLE analytics.raw.customers TO ROLE pii_viewer;

-- Assign roles to users
GRANT ROLE data_analyst TO USER analyst1;
GRANT ROLE data_engineer TO USER engineer1;
GRANT ROLE pii_viewer TO USER compliance_officer;

-- Row-level security
CREATE ROW ACCESS POLICY region_policy AS (region_col VARCHAR) RETURNS BOOLEAN ->
    CASE
        WHEN CURRENT_ROLE() = 'data_admin' THEN TRUE
        WHEN CURRENT_ROLE() = 'data_analyst' AND region_col = 'US' THEN TRUE
        WHEN CURRENT_ROLE() = 'data_engineer' THEN TRUE
        ELSE FALSE
    END;

-- Apply policy to table
ALTER TABLE fact_orders ADD ROW ACCESS POLICY region_policy ON (region);

-- Dynamic data masking
CREATE MASKING POLICY email_mask AS (val VARCHAR) RETURNS VARCHAR ->
    CASE
        WHEN CURRENT_ROLE() IN ('data_admin', 'pii_viewer') THEN val
        WHEN CURRENT_ROLE() = 'data_analyst' THEN REGEXP_REPLACE(val, '.+@', '***@')
        ELSE '***@***.com'
    END;

ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;

-- ABAC: Attribute-based access
CREATE NETWORK POLYGON company_vpcs AS $$
    10.0.0.0/8
    172.16.0.0/12
    192.168.0.0/16
$$;

CREATE ACCESS POLICY abac_policy AS (resource_tags VARIANT) RETURNS BOOLEAN ->
    CASE
        WHEN ARRAY_CONTAINS('public'::VARIANT, resource_tags['classification']) THEN TRUE
        WHEN CURRENT_ROLE() = 'data_admin' THEN TRUE
        WHEN CURRENT_ROLE() = 'data_analyst'
            AND ARRAY_CONTAINS('internal'::VARIANT, resource_tags['classification'])
            AND CURRENT_IP() IN (SELECT value FROM TABLE(split_to_table('10.0.0.0/8,172.16.0.0/12'))) THEN TRUE
        ELSE FALSE
    END;

Data Masking and Tokenization

Data masking transforms sensitive data to protect it while maintaining its format and usability for non-sensitive operations. Types include static masking (permanent), dynamic masking (role-dependent), and deterministic masking (consistent across systems).

Tokenization replaces sensitive data elements with non-sensitive equivalents (tokens) that have no exploitable value. The mapping between tokens and original values is stored in a secure token vault.

# Data Masking Implementation
import hashlib
import re
from typing import Dict, Any

class DataMasker:
    """Apply various masking strategies to sensitive data."""

    def __init__(self, salt: str = "default-salt"):
        self.salt = salt
        self.token_vault: Dict[str, str] = {}

    def mask_email(self, email: str, mask_type: str = "partial") -> str:
        """Mask email address."""
        if mask_type == "full":
            return "redacted@example.com"
        elif mask_type == "partial":
            local, domain = email.split("@")
            return f"{local[0]}***@{domain}"
        elif mask_type == "hash":
            return hashlib.sha256(f"{email}{self.salt}".encode()).hexdigest()[:16] + "@masked.com"
        return email

    def mask_ssn(self, ssn: str, mask_type: str = "last_four") -> str:
        """Mask Social Security Number."""
        if mask_type == "last_four":
            return f"***-**-{ssn[-4:]}"
        elif mask_type == "full":
            return "***-**-****"
        elif mask_type == "hash":
            return hashlib.sha256(f"{ssn}{self.salt}".encode()).hexdigest()[:11]
        return ssn

    def mask_phone(self, phone: str, mask_type: str = "last_four") -> str:
        """Mask phone number."""
        digits = re.sub(r'\D', '', phone)
        if mask_type == "last_four":
            return f"({digits[:3]}) ***-{digits[-4:]}"
        elif mask_type == "full":
            return "(***) ***-****"
        return phone

    def tokenize(self, value: str, category: str) -> str:
        """Replace value with a token, storing mapping in vault."""
        if value in self.token_vault:
            return self.token_vault[value]

        token = hashlib.sha256(f"{value}{self.salt}{category}".encode()).hexdigest()[:16]
        self.token_vault[token] = value
        return token

    def detokenize(self, token: str) -> str:
        """Retrieve original value from token (requires vault access)."""
        return self.token_vault.get(token, "TOKEN_NOT_FOUND")

    def mask_dataframe(self, df, masking_rules: Dict[str, Dict]) -> Any:
        """Apply masking rules to an entire DataFrame."""
        import pandas as pd
        masked_df = df.copy()

        for column, rules in masking_rules.items():
            if column in masked_df.columns:
                mask_type = rules.get("type", "partial")
                if mask_type == "email":
                    masked_df[column] = masked_df[column].apply(
                        lambda x: self.mask_email(str(x), rules.get("level", "partial"))
                    )
                elif mask_type == "ssn":
                    masked_df[column] = masked_df[column].apply(
                        lambda x: self.mask_ssn(str(x), rules.get("level", "last_four"))
                    )
                elif mask_type == "phone":
                    masked_df[column] = masked_df[column].apply(
                        lambda x: self.mask_phone(str(x), rules.get("level", "last_four"))
                    )
                elif mask_type == "hash":
                    masked_df[column] = masked_df[column].apply(
                        lambda x: hashlib.sha256(f"{x}{self.salt}".encode()).hexdigest()[:16]
                    )

        return masked_df

# Usage
masker = DataMasker(salt="company-specific-salt-2025")

print(masker.mask_email("john.doe@company.com", "partial"))  # j***@company.com
print(masker.mask_ssn("123-45-6789", "last_four"))           # ***-**-6789
print(masker.tokenize("john.doe@company.com", "email"))      # Token
print(masker.detokenize("token_value"))                       # Original

GDPR and CCPA Compliance

The General Data Protection Regulation (GDPR) is a European Union regulation that governs the collection, processing, and storage of personal data of EU residents. Key requirements include consent, right to erasure, data portability, and breach notification.

RequirementGDPR ArticleImplementationData Engineer Role
ConsentArt. 6, 7Consent management platformTrack consent metadata
Right to ErasureArt. 17Data deletion pipelineImplement soft/hard delete
Data PortabilityArt. 20Export APIBuild export endpoints
Breach NotificationArt. 33Incident responseMonitor and alert
Data MinimizationArt. 5(1)(c)Schema designOnly collect needed data
Purpose LimitationArt. 5(1)(b)Access controlsEnforce usage policies
Storage LimitationArt. 5(1)(e)Retention policiesAuto-expire data
Privacy by DesignArt. 25ArchitectureBuild privacy in
-- GDPR: Right to Erasure implementation
-- Soft delete approach
CREATE TABLE customer_erasure_log (
    erasure_id     UUID PRIMARY KEY DEFAULT UUID(),
    customer_id    VARCHAR(50) NOT NULL,
    requested_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at   TIMESTAMP,
    status         VARCHAR(20) DEFAULT 'pending',
    tables_affected INT DEFAULT 0
);

-- Erasure stored procedure
CREATE OR REPLACE PROCEDURE erase_customer_data(customer_id_param VARCHAR)
RETURNS TABLE (tables_affected INT, status VARCHAR)
AS $$
DECLARE
    tables_affected INT := 0;
BEGIN
    -- Log erasure request
    INSERT INTO customer_erasure_log (customer_id, status)
    VALUES (customer_id_param, 'in_progress');

    -- Erase from all tables (soft delete)
    UPDATE dim_customer SET
        full_name = 'REDACTED',
        email = 'REDACTED',
        phone = 'REDACTED',
        address_line1 = 'REDACTED',
        is_current = FALSE,
        valid_to = CURRENT_TIMESTAMP
    WHERE customer_id = customer_id_param;

    tables_affected := tables_affected + 1;

    UPDATE fact_orders SET
        customer_key = NULL
    WHERE customer_key IN (
        SELECT customer_key FROM dim_customer WHERE customer_id = customer_id_param
    );
    tables_affected := tables_affected + 1;

    -- Update erasure log
    UPDATE customer_erasure_log SET
        status = 'completed',
        completed_at = CURRENT_TIMESTAMP,
        tables_affected = tables_affected
    WHERE customer_id = customer_id_param;

    RETURN TABLE(SELECT tables_affected, 'completed');
END;
$$;

-- CCPA: Opt-out of sale
CREATE TABLE customer_opt_out (
    customer_id    VARCHAR(50) PRIMARY KEY,
    opt_out_date   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    opt_out_method VARCHAR(50),
    is_opted_out   BOOLEAN DEFAULT TRUE
);

-- Data retention policy
CREATE OR REPLACE TASK enforce_data_retention
    WAREHOUSE = 'etl_wh'
    SCHEDULE = 'USING CRON 0 2 * * * UTC'
AS
BEGIN
    -- Delete data older than retention period
    DELETE FROM raw_events WHERE event_date < DATEADD(year, -3, CURRENT_DATE());
    DELETE FROM staging_events WHERE event_date < DATEADD(year, -2, CURRENT_DATE());

    -- Archive old data to cold storage
    COPY INTO @archive_stage/retention/
    FROM fact_orders
    WHERE order_date < DATEADD(year, -7, CURRENT_DATE())
    FILE_FORMAT = (TYPE = PARQUET);
END;

Key Concepts Summary

ConceptDescriptionThreat MitigatedImplementation
At-Rest EncryptionEncrypt stored dataPhysical theftAES-256, cloud KMS
In-Transit EncryptionEncrypt data in motionMan-in-the-middleTLS 1.3
RBACRole-based permissionsUnauthorized accessSnowflake roles, IAM
ABACAttribute-based policiesComplex access scenariosCustom policies
Data MaskingTransform sensitive dataUnauthorized viewingDynamic masking
TokenizationReplace with tokensData exposure in breachToken vault
Audit LoggingTrack all accessAccountabilityCloudTrail, audit tables
Data RetentionAuto-expire old dataRegulatory non-complianceScheduled deletion
Right to ErasureDelete personal dataGDPR violationErasure pipelines
Breach DetectionMonitor for anomaliesUndetected breachesSIEM, alerting

Performance Metrics

Security MeasureImplementation CostOverheadRisk Reduction
At-Rest EncryptionLow (built-in)1-2%High
TLS 1.3Low (built-in)5-10msHigh
RBACMediumNegligibleHigh
Dynamic MaskingMedium5-15%Medium
TokenizationHigh10-20%Very High
Audit LoggingMedium5-10%Medium
Data RetentionLowNegligibleMedium
Erasure PipelineHighVariableVery High

10 Best Practices

  1. Enable encryption at rest by default on all cloud storage and databases
  2. Use TLS 1.3 for all data in transit β€” never allow unencrypted connections
  3. Implement least-privilege RBAC β€” users get minimum required permissions
  4. Apply dynamic data masking on PII columns β€” analysts see masked data by default
  5. Maintain audit logs for all data access β€” track who queried what and when
  6. Implement data retention policies β€” auto-expire data per regulatory requirements
  7. Build GDPR erasure pipelines β€” test the right-to-erasure workflow regularly
  8. Use tokenization for highly sensitive fields (SSN, credit card numbers)
  9. Monitor for anomalous access patterns β€” alert on unusual query volumes or patterns
  10. Conduct regular security audits β€” review access logs and permissions quarterly

  • Defense-in-depth strategy applies security at every layer (network, identity, data, application)
  • RBAC provides role-based access; ABAC adds attribute-based policy flexibility
  • Dynamic data masking protects PII while maintaining data usability
  • GDPR/CCPA compliance requires data retention, erasure, and consent management
  • Encryption at rest and in transit are baseline requirements, not optional

See Also

⭐

Premium Content

Data Security & Compliance: Protecting Data Assets

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Engineering Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement