Data Quality: Purview, Great Expectations & ADF
Enterprise data quality management with automated validation, monitoring, and remediation
Data Quality Framework
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA QUALITY FRAMEWORK β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β QUALITY DIMENSIONS β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Completeness: No missing values β β
β β β’ Accuracy: Values match real-world entities β β
β β β’ Consistency: Same data across systems β β
β β β’ Timeliness: Data is current and available when needed β β
β β β’ Uniqueness: No duplicate records β β
β β β’ Validity: Values conform to defined formats/ranges β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β MONITORING TOOLS: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Purview β β Great β β ADF Data β β β
β β β Data β β Expectations β β Flows β β β
β β β Quality β β (Python) β β (Visual) β β β
β β β Rules β β β β β β β
β β ββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β β
β β AUTOMATED QUALITY CHECKS: β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Ingestion ββ> Validate ββ> Transform ββ> Load β β β
β β β β β β β
β β β ββ Null check β β β
β β β ββ Format validation β β β
β β β ββ Range validation β β β
β β β ββ Referential integrity β β β
β β β ββ Uniqueness check β β β
β β β ββ Business rule validation β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Great Expectations Implementation
# Great Expectations suite for sales data
import great_expectations as gx
from great_expectations.core import ExpectationSuite
from great_expectations.dataset import SparkDFDataset
context = gx.get_context()
# Create expectation suite
suite = ExpectationSuite(expectation_suite_name="sales_data_quality")
# Add expectations
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="sale_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="sale_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="quantity", min_value=1, max_value=10000
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToMatchRegex(
column="email", regex=r"^[\w\.-]+@[\w\.-]+\.\w+$"
)
)
suite.add_expectation(
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=1000, max_value=1000000
)
)
# Run validation
validator = context.sources.spark.read.parquet("abfss://raw@stdatalake001.dfs.core.windows.net/sales/")
result = validator.validate(expectation_suite=suite)
# Generate report
print(f"Success: {result.success}")
print(f"Statistics: {result.statistics}")
ADF Data Flow Quality Rules
{
"name": "DataQualityCheck",
"type": "Filter",
"typeProperties": {
"filterExpression": {
"value": "!isNull(sale_id) && !isNull(customer_id) && quantity > 0 && unit_price > 0",
"type": "Expression"
}
}
}
βΉοΈ
Pro Tip: Implement data quality checks at multiple stages: ingestion (schema validation), transformation (business rules), and loading (referential integrity).
Interview Questions
Q1: What are the six dimensions of data quality? A: 1) Completeness (no missing values), 2) Accuracy (correct values), 3) Consistency (uniform across systems), 4) Timeliness (current data), 5) Uniqueness (no duplicates), 6) Validity (format compliance).
Q2: How do you handle data quality failures in production? A: 1) Quarantine failed records, 2) Alert data owners, 3) Log failures for analysis, 4) Implement automated remediation where possible, 5) Track quality metrics over time, 6) Escalate critical failures.
Q3: What is the difference between data validation and data profiling? A: Validation checks data against predefined rules (pass/fail). Profiling analyzes data to understand its characteristics (distribution, patterns, anomalies). Both are essential for maintaining data quality.