Data Security: Protecting the Crown Jewels
Data security encompasses the policies, technologies, and practices that protect data from unauthorized access, corruption, and theft throughout its lifecycle.
Why Data Security Matters
The Cost of Data Breaches:
- Average cost: $4.45M per incident (IBM, 2023)
- Regulatory fines: up to 4% of global revenue (GDPR, CCPA)
- Reputation damage and lost trust
Data Engineer Responsibility:
- Build security into every layer of the data platform
- Implement encryption, access control, and monitoring
- Ensure compliance with regulations
Key Insight: Data engineers must build security into every layer of the data platform to protect against unauthorized access, corruption, and theft.
Architecture Overview
Data security is implemented across six defense layers:
Layer 1: Network β VPC/Subnets, Firewalls, Private Link Layer 2: Identity β SSO/MFA, RBAC/ABAC, API Keys Layer 3: Data β Encryption at Rest, Dynamic Masking, Tokenization Layer 4: Application β Input Validation, SQL Injection Prevention, CSRF Protection Layer 5: Monitoring β Audit Logging, Anomaly Detection, Alerting Layer 6: Compliance β GDPR, CCPA, HIPAA
Encryption
Encryption transforms plaintext data into ciphertext using an algorithm and key. At-rest encryption protects stored data; in-transit encryption protects data moving between systems; in-use encryption (confidential computing) protects data during processing.
Encryption Overhead
- At-Rest: AES-256-GCM, overhead = ~3-5% storage, ~1-2% CPU
- In-Transit: TLS 1.3, overhead = ~5-10ms latency per connection
- In-Use: SGX/SEV, overhead = ~10-30% CPU, ~20-50% memory
- Key Management Cost: HSM = 1/key + API calls
- Encryption at Scale: 1 PB Γ 3% overhead = 3 TB additional storage
-- Snowflake: Column-level encryption
CREATE TABLE customers_encrypted (
customer_id INT PRIMARY KEY,
email VARBINARY(256) ENCRYPT, -- Always encrypted
phone VARBINARY(128) ENCRYPT,
full_name VARCHAR(200), -- Not encrypted
ssn_last_four VARCHAR(4) MASKED WITH (
FUNCTION = 'partial_mask(0, "***-**-****")'
)
);
-- BigQuery: Column-level security
CREATE TABLE customers_masked (
customer_id INT64,
email STRING,
ssn STRING
);
-- Apply column-level access policy
CREATE OR REPLACE ROW ACCESS POLICY ssn_policy
ON `project.dataset.customers_masked`
GRANT TO ('role:pii_viewer')
FILTER USING (TRUE);
CREATE OR REPLACE ROW ACCESS POLICY ssn_masking
ON `project.dataset.customers_masked`
GRANT TO ('role:regular_analyst')
FILTER USING (TRUE);
-- Redshift: Encryption at rest
CREATE TABLE customers_secure (
customer_id INT ENCODE az64,
email VARCHAR(300) ENCODE az64,
full_name VARCHAR(200) ENCODE az64
)
ENCRYPTED -- Enable at-rest encryption
SORTKEY(customer_id);
-- AWS KMS key management
-- aws kms create-key --description "Data warehouse encryption"
-- aws kms create-alias --alias-name alias/data-warehouse --target-key-id <key-id>
Access Control
Role-Based Access Control (RBAC) assigns permissions to roles, and users are assigned to roles. Users inherit all permissions of their assigned roles.
Attribute-Based Access Control (ABAC) assigns permissions based on attributes of the user, resource, and environment. Policies evaluate conditions like department, classification, and time of access.
-- Snowflake RBAC Implementation
-- Create roles
CREATE ROLE data_analyst;
CREATE ROLE data_engineer;
CREATE ROLE data_admin;
CREATE ROLE pii_viewer;
-- Grant database access
GRANT USAGE ON DATABASE analytics TO ROLE data_analyst;
GRANT USAGE ON DATABASE analytics TO ROLE data_engineer;
GRANT ALL ON DATABASE analytics TO ROLE data_admin;
-- Grant schema access
GRANT USAGE ON SCHEMA analytics.marts TO ROLE data_analyst;
GRANT ALL ON SCHEMA analytics.marts TO ROLE data_engineer;
-- Grant table access
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.marts TO ROLE data_analyst;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA analytics.marts TO ROLE data_engineer;
-- Grant PII access
GRANT SELECT ON TABLE analytics.raw.customers TO ROLE pii_viewer;
-- Assign roles to users
GRANT ROLE data_analyst TO USER analyst1;
GRANT ROLE data_engineer TO USER engineer1;
GRANT ROLE pii_viewer TO USER compliance_officer;
-- Row-level security
CREATE ROW ACCESS POLICY region_policy AS (region_col VARCHAR) RETURNS BOOLEAN ->
CASE
WHEN CURRENT_ROLE() = 'data_admin' THEN TRUE
WHEN CURRENT_ROLE() = 'data_analyst' AND region_col = 'US' THEN TRUE
WHEN CURRENT_ROLE() = 'data_engineer' THEN TRUE
ELSE FALSE
END;
-- Apply policy to table
ALTER TABLE fact_orders ADD ROW ACCESS POLICY region_policy ON (region);
-- Dynamic data masking
CREATE MASKING POLICY email_mask AS (val VARCHAR) RETURNS VARCHAR ->
CASE
WHEN CURRENT_ROLE() IN ('data_admin', 'pii_viewer') THEN val
WHEN CURRENT_ROLE() = 'data_analyst' THEN REGEXP_REPLACE(val, '.+@', '***@')
ELSE '***@***.com'
END;
ALTER TABLE customers MODIFY COLUMN email SET MASKING POLICY email_mask;
-- ABAC: Attribute-based access
CREATE NETWORK POLYGON company_vpcs AS $$
10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
$$;
CREATE ACCESS POLICY abac_policy AS (resource_tags VARIANT) RETURNS BOOLEAN ->
CASE
WHEN ARRAY_CONTAINS('public'::VARIANT, resource_tags['classification']) THEN TRUE
WHEN CURRENT_ROLE() = 'data_admin' THEN TRUE
WHEN CURRENT_ROLE() = 'data_analyst'
AND ARRAY_CONTAINS('internal'::VARIANT, resource_tags['classification'])
AND CURRENT_IP() IN (SELECT value FROM TABLE(split_to_table('10.0.0.0/8,172.16.0.0/12'))) THEN TRUE
ELSE FALSE
END;
Data Masking and Tokenization
Data masking transforms sensitive data to protect it while maintaining its format and usability for non-sensitive operations. Types include static masking (permanent), dynamic masking (role-dependent), and deterministic masking (consistent across systems).
Tokenization replaces sensitive data elements with non-sensitive equivalents (tokens) that have no exploitable value. The mapping between tokens and original values is stored in a secure token vault.
# Data Masking Implementation
import hashlib
import re
from typing import Dict, Any
class DataMasker:
"""Apply various masking strategies to sensitive data."""
def __init__(self, salt: str = "default-salt"):
self.salt = salt
self.token_vault: Dict[str, str] = {}
def mask_email(self, email: str, mask_type: str = "partial") -> str:
"""Mask email address."""
if mask_type == "full":
return "redacted@example.com"
elif mask_type == "partial":
local, domain = email.split("@")
return f"{local[0]}***@{domain}"
elif mask_type == "hash":
return hashlib.sha256(f"{email}{self.salt}".encode()).hexdigest()[:16] + "@masked.com"
return email
def mask_ssn(self, ssn: str, mask_type: str = "last_four") -> str:
"""Mask Social Security Number."""
if mask_type == "last_four":
return f"***-**-{ssn[-4:]}"
elif mask_type == "full":
return "***-**-****"
elif mask_type == "hash":
return hashlib.sha256(f"{ssn}{self.salt}".encode()).hexdigest()[:11]
return ssn
def mask_phone(self, phone: str, mask_type: str = "last_four") -> str:
"""Mask phone number."""
digits = re.sub(r'\D', '', phone)
if mask_type == "last_four":
return f"({digits[:3]}) ***-{digits[-4:]}"
elif mask_type == "full":
return "(***) ***-****"
return phone
def tokenize(self, value: str, category: str) -> str:
"""Replace value with a token, storing mapping in vault."""
if value in self.token_vault:
return self.token_vault[value]
token = hashlib.sha256(f"{value}{self.salt}{category}".encode()).hexdigest()[:16]
self.token_vault[token] = value
return token
def detokenize(self, token: str) -> str:
"""Retrieve original value from token (requires vault access)."""
return self.token_vault.get(token, "TOKEN_NOT_FOUND")
def mask_dataframe(self, df, masking_rules: Dict[str, Dict]) -> Any:
"""Apply masking rules to an entire DataFrame."""
import pandas as pd
masked_df = df.copy()
for column, rules in masking_rules.items():
if column in masked_df.columns:
mask_type = rules.get("type", "partial")
if mask_type == "email":
masked_df[column] = masked_df[column].apply(
lambda x: self.mask_email(str(x), rules.get("level", "partial"))
)
elif mask_type == "ssn":
masked_df[column] = masked_df[column].apply(
lambda x: self.mask_ssn(str(x), rules.get("level", "last_four"))
)
elif mask_type == "phone":
masked_df[column] = masked_df[column].apply(
lambda x: self.mask_phone(str(x), rules.get("level", "last_four"))
)
elif mask_type == "hash":
masked_df[column] = masked_df[column].apply(
lambda x: hashlib.sha256(f"{x}{self.salt}".encode()).hexdigest()[:16]
)
return masked_df
# Usage
masker = DataMasker(salt="company-specific-salt-2025")
print(masker.mask_email("john.doe@company.com", "partial")) # j***@company.com
print(masker.mask_ssn("123-45-6789", "last_four")) # ***-**-6789
print(masker.tokenize("john.doe@company.com", "email")) # Token
print(masker.detokenize("token_value")) # Original
GDPR and CCPA Compliance
The General Data Protection Regulation (GDPR) is a European Union regulation that governs the collection, processing, and storage of personal data of EU residents. Key requirements include consent, right to erasure, data portability, and breach notification.
| Requirement | GDPR Article | Implementation | Data Engineer Role |
|---|---|---|---|
| Consent | Art. 6, 7 | Consent management platform | Track consent metadata |
| Right to Erasure | Art. 17 | Data deletion pipeline | Implement soft/hard delete |
| Data Portability | Art. 20 | Export API | Build export endpoints |
| Breach Notification | Art. 33 | Incident response | Monitor and alert |
| Data Minimization | Art. 5(1)(c) | Schema design | Only collect needed data |
| Purpose Limitation | Art. 5(1)(b) | Access controls | Enforce usage policies |
| Storage Limitation | Art. 5(1)(e) | Retention policies | Auto-expire data |
| Privacy by Design | Art. 25 | Architecture | Build privacy in |
-- GDPR: Right to Erasure implementation
-- Soft delete approach
CREATE TABLE customer_erasure_log (
erasure_id UUID PRIMARY KEY DEFAULT UUID(),
customer_id VARCHAR(50) NOT NULL,
requested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP,
status VARCHAR(20) DEFAULT 'pending',
tables_affected INT DEFAULT 0
);
-- Erasure stored procedure
CREATE OR REPLACE PROCEDURE erase_customer_data(customer_id_param VARCHAR)
RETURNS TABLE (tables_affected INT, status VARCHAR)
AS $$
DECLARE
tables_affected INT := 0;
BEGIN
-- Log erasure request
INSERT INTO customer_erasure_log (customer_id, status)
VALUES (customer_id_param, 'in_progress');
-- Erase from all tables (soft delete)
UPDATE dim_customer SET
full_name = 'REDACTED',
email = 'REDACTED',
phone = 'REDACTED',
address_line1 = 'REDACTED',
is_current = FALSE,
valid_to = CURRENT_TIMESTAMP
WHERE customer_id = customer_id_param;
tables_affected := tables_affected + 1;
UPDATE fact_orders SET
customer_key = NULL
WHERE customer_key IN (
SELECT customer_key FROM dim_customer WHERE customer_id = customer_id_param
);
tables_affected := tables_affected + 1;
-- Update erasure log
UPDATE customer_erasure_log SET
status = 'completed',
completed_at = CURRENT_TIMESTAMP,
tables_affected = tables_affected
WHERE customer_id = customer_id_param;
RETURN TABLE(SELECT tables_affected, 'completed');
END;
$$;
-- CCPA: Opt-out of sale
CREATE TABLE customer_opt_out (
customer_id VARCHAR(50) PRIMARY KEY,
opt_out_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
opt_out_method VARCHAR(50),
is_opted_out BOOLEAN DEFAULT TRUE
);
-- Data retention policy
CREATE OR REPLACE TASK enforce_data_retention
WAREHOUSE = 'etl_wh'
SCHEDULE = 'USING CRON 0 2 * * * UTC'
AS
BEGIN
-- Delete data older than retention period
DELETE FROM raw_events WHERE event_date < DATEADD(year, -3, CURRENT_DATE());
DELETE FROM staging_events WHERE event_date < DATEADD(year, -2, CURRENT_DATE());
-- Archive old data to cold storage
COPY INTO @archive_stage/retention/
FROM fact_orders
WHERE order_date < DATEADD(year, -7, CURRENT_DATE())
FILE_FORMAT = (TYPE = PARQUET);
END;
Key Concepts Summary
| Concept | Description | Threat Mitigated | Implementation |
|---|---|---|---|
| At-Rest Encryption | Encrypt stored data | Physical theft | AES-256, cloud KMS |
| In-Transit Encryption | Encrypt data in motion | Man-in-the-middle | TLS 1.3 |
| RBAC | Role-based permissions | Unauthorized access | Snowflake roles, IAM |
| ABAC | Attribute-based policies | Complex access scenarios | Custom policies |
| Data Masking | Transform sensitive data | Unauthorized viewing | Dynamic masking |
| Tokenization | Replace with tokens | Data exposure in breach | Token vault |
| Audit Logging | Track all access | Accountability | CloudTrail, audit tables |
| Data Retention | Auto-expire old data | Regulatory non-compliance | Scheduled deletion |
| Right to Erasure | Delete personal data | GDPR violation | Erasure pipelines |
| Breach Detection | Monitor for anomalies | Undetected breaches | SIEM, alerting |
Performance Metrics
| Security Measure | Implementation Cost | Overhead | Risk Reduction |
|---|---|---|---|
| At-Rest Encryption | Low (built-in) | 1-2% | High |
| TLS 1.3 | Low (built-in) | 5-10ms | High |
| RBAC | Medium | Negligible | High |
| Dynamic Masking | Medium | 5-15% | Medium |
| Tokenization | High | 10-20% | Very High |
| Audit Logging | Medium | 5-10% | Medium |
| Data Retention | Low | Negligible | Medium |
| Erasure Pipeline | High | Variable | Very High |
10 Best Practices
- Enable encryption at rest by default on all cloud storage and databases
- Use TLS 1.3 for all data in transit β never allow unencrypted connections
- Implement least-privilege RBAC β users get minimum required permissions
- Apply dynamic data masking on PII columns β analysts see masked data by default
- Maintain audit logs for all data access β track who queried what and when
- Implement data retention policies β auto-expire data per regulatory requirements
- Build GDPR erasure pipelines β test the right-to-erasure workflow regularly
- Use tokenization for highly sensitive fields (SSN, credit card numbers)
- Monitor for anomalous access patterns β alert on unusual query volumes or patterns
- Conduct regular security audits β review access logs and permissions quarterly
- Defense-in-depth strategy applies security at every layer (network, identity, data, application)
- RBAC provides role-based access; ABAC adds attribute-based policy flexibility
- Dynamic data masking protects PII while maintaining data usability
- GDPR/CCPA compliance requires data retention, erasure, and consent management
- Encryption at rest and in transit are baseline requirements, not optional
See Also
- Data Governance & Catalog β Metadata and lineage for compliance
- Data Lakehouse β Unity Catalog row and column-level security
- Snowflake Fundamentals β Snowflake security features
- Infrastructure as Code β Security policies as code
- CI/CD for Data Pipelines β Security scanning in CI
- Cost Optimization β Security-related cost considerations