πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Quality on AWS

AWS Data EngineeringGlue DataBrew & Validation Patterns⭐ Premium

Advertisement

βœ… Data Quality on AWS

Master Glue DataBrew, validation patterns, and data quality frameworks.

Module: AWS Data Engineering β€’ Topic 24 of 65 β€’ Premium Content

Data Quality Framework

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA QUALITY FRAMEWORK                                     β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  VALIDATION LAYERS                                                   β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  1. COMPLETENESS: Are all expected records present?                β”‚    β”‚
β”‚  β”‚  2. ACCURACY: Do values match expected ranges?                     β”‚    β”‚
β”‚  β”‚  3. CONSISTENCY: Are values consistent across sources?             β”‚    β”‚
β”‚  β”‚  4. TIMELINESS: Is data arriving within SLA?                       β”‚    β”‚
β”‚  β”‚  5. UNIQUENESS: Are there duplicate records?                       β”‚    β”‚
β”‚  β”‚  6. VALIDITY: Do values conform to formats/rules?                  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  AWS SERVICES                                                        β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚    β”‚
β”‚  β”‚  β”‚ Glue         β”‚  β”‚ Lambda       β”‚  β”‚ CloudWatch   β”‚             β”‚    β”‚
β”‚  β”‚  β”‚ DataBrew     β”‚  β”‚ Validation   β”‚  β”‚ Metrics      β”‚             β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DataBrew Profile Job

import boto3

databrew = boto3.client('databrew')

# Create profile job
response = databrew.create_profile_job(
    DatasetName='sales-data',
    Name='sales-quality-profile',
    RoleArn='arn:aws:iam::123456789012:role/DataBrewRole',
    OutputLocation={
        'Bucket': 'data-quality-reports',
        'Key': 'profiles/sales/'
    },
    ConfigurationOptions={
        'Locale': 'en-US',
        'MaxSizeGb': 50
    },
    ValidationConfig={
        'ValidationMode': 'PROFILE_JOB',
        'RulesetArn': 'arn:aws:databrew:us-east-1:123456789012:ruleset/sales-rules'
    },
    Tags={'Team': 'data-engineering'}
)

# Start profile job
databrew.start_job_run(
    Name='sales-quality-profile'
)

Validation Rules

import boto3

databrew = boto3.client('databrew')

# Create ruleset
databrew.create_ruleset(
    Name='sales-validation-rules',
    Description='Validation rules for sales data',
    Rules=[
        {
            'Name': 'not-null-amount',
            'Expression': '$amount IS NOT NULL',
            'SubstitutionMap': {},
            'Threshold': {
                'Value': 99.0,
                'Type': 'PERCENTAGE',
                'Comparison': 'GREATER_THAN_OR_EQUAL'
            }
        },
        {
            'Name': 'positive-amount',
            'Expression': '$amount > 0',
            'SubstitutionMap': {},
            'Threshold': {
                'Value': 99.5,
                'Type': 'PERCENTAGE',
                'Comparison': 'GREATER_THAN_OR_EQUAL'
            }
        },
        {
            'Name': 'valid-date',
            'Expression': '$sale_date BETWEEN \"2020-01-01\" AND \"2025-12-31\"',
            'SubstitutionMap': {},
            'Threshold': {
                'Value': 100.0,
                'Type': 'PERCENTAGE',
                'Comparison': 'EQUAL'
            }
        }
    ]
)

Interview Q&A

Q1: What are the 6 dimensions of data quality?

Answer: Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity. Each measures a different aspect of data trustworthiness.

Q2: How does DataBrew help with data quality?

Answer: DataBrew provides data profiling, built-in transformations, and validation rules. Profile jobs identify data quality issues automatically.

Q3: How do you implement data quality in Glue ETL?

Answer: Add validation steps in ETL jobs, use DataBrew rulesets, implement Lambda validation functions, and set CloudWatch alarms for quality metrics.

Summary

  • Dimensions: Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity
  • DataBrew: Profiling, validation rules, built-in transformations
  • Rulesets: Define validation expressions with pass/fail thresholds
  • Monitoring: CloudWatch metrics for quality scores and alerts
  • Best Practice: Validate at ingestion, transformation, and serving layers

Advertisement