πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Quality: Purview, Great Expectations & ADF

Azure Data EngineeringData Quality⭐ Premium

Advertisement

Data Quality: Purview, Great Expectations & ADF

Enterprise data quality management with automated validation, monitoring, and remediation

Data Quality Framework

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA QUALITY FRAMEWORK                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  QUALITY DIMENSIONS                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ β€’ Completeness: No missing values                            β”‚   β”‚
β”‚  β”‚ β€’ Accuracy: Values match real-world entities                 β”‚   β”‚
β”‚  β”‚ β€’ Consistency: Same data across systems                      β”‚   β”‚
β”‚  β”‚ β€’ Timeliness: Data is current and available when needed      β”‚   β”‚
β”‚  β”‚ β€’ Uniqueness: No duplicate records                           β”‚   β”‚
β”‚  β”‚ β€’ Validity: Values conform to defined formats/ranges         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                     β”‚
β”‚  MONITORING TOOLS:                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                                                               β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚   β”‚
β”‚  β”‚  β”‚ Purview  β”‚  β”‚ Great        β”‚  β”‚ ADF Data     β”‚          β”‚   β”‚
β”‚  β”‚  β”‚ Data     β”‚  β”‚ Expectations β”‚  β”‚ Flows        β”‚          β”‚   β”‚
β”‚  β”‚  β”‚ Quality  β”‚  β”‚ (Python)     β”‚  β”‚ (Visual)     β”‚          β”‚   β”‚
β”‚  β”‚  β”‚ Rules    β”‚  β”‚              β”‚  β”‚              β”‚          β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚   β”‚
β”‚  β”‚                                                               β”‚   β”‚
β”‚  β”‚  AUTOMATED QUALITY CHECKS:                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚ Ingestion ──> Validate ──> Transform ──> Load        β”‚    β”‚   β”‚
β”‚  β”‚  β”‚              β”‚                                      β”‚    β”‚   β”‚
β”‚  β”‚  β”‚              β”œβ”€ Null check                           β”‚    β”‚   β”‚
β”‚  β”‚  β”‚              β”œβ”€ Format validation                    β”‚    β”‚   β”‚
β”‚  β”‚  β”‚              β”œβ”€ Range validation                     β”‚    β”‚   β”‚
β”‚  β”‚  β”‚              β”œβ”€ Referential integrity                β”‚    β”‚   β”‚
β”‚  β”‚  β”‚              β”œβ”€ Uniqueness check                     β”‚    β”‚   β”‚
β”‚  β”‚  β”‚              └─ Business rule validation             β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Great Expectations Implementation

# Great Expectations suite for sales data
import great_expectations as gx
from great_expectations.core import ExpectationSuite
from great_expectations.dataset import SparkDFDataset

context = gx.get_context()

# Create expectation suite
suite = ExpectationSuite(expectation_suite_name="sales_data_quality")

# Add expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="sale_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="sale_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="quantity", min_value=1, max_value=10000
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToMatchRegex(
        column="email", regex=r"^[\w\.-]+@[\w\.-]+\.\w+$"
    )
)
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=1000, max_value=1000000
    )
)

# Run validation
validator = context.sources.spark.read.parquet("abfss://raw@stdatalake001.dfs.core.windows.net/sales/")
result = validator.validate(expectation_suite=suite)

# Generate report
print(f"Success: {result.success}")
print(f"Statistics: {result.statistics}")

ADF Data Flow Quality Rules

{
  "name": "DataQualityCheck",
  "type": "Filter",
  "typeProperties": {
    "filterExpression": {
      "value": "!isNull(sale_id) && !isNull(customer_id) && quantity > 0 && unit_price > 0",
      "type": "Expression"
    }
  }
}

ℹ️

Pro Tip: Implement data quality checks at multiple stages: ingestion (schema validation), transformation (business rules), and loading (referential integrity).

Interview Questions

Q1: What are the six dimensions of data quality? A: 1) Completeness (no missing values), 2) Accuracy (correct values), 3) Consistency (uniform across systems), 4) Timeliness (current data), 5) Uniqueness (no duplicates), 6) Validity (format compliance).

Q2: How do you handle data quality failures in production? A: 1) Quarantine failed records, 2) Alert data owners, 3) Log failures for analysis, 4) Implement automated remediation where possible, 5) Track quality metrics over time, 6) Escalate critical failures.

Q3: What is the difference between data validation and data profiling? A: Validation checks data against predefined rules (pass/fail). Profiling analyzes data to understand its characteristics (distribution, patterns, anomalies). Both are essential for maintaining data quality.

Advertisement