Data Governance & Catalog: Managing Data at Scale

Data Governance: Trust, discoverability, and compliance at scale

Data governance is the collection of processes, policies, and standards that ensure data is managed as a strategic asset.

Why Data Governance Matters

Problems Without Governance:

Duplicated datasets with conflicting definitions
Unknown data lineage making compliance impossible
Stale data eroding trust
Security breaches from ungoverned access

Key Insight: Data governance ensures data is trustworthy, discoverable, secure, and compliant.

Architecture Overview

The Data Governance architecture has three interconnected pillars:

Data Catalog — Central repository for metadata store, search & discovery, lineage tracking, and quality monitoring.

Governance Components — Manages metadata (technical, business, operational), access control (RBAC, ABAC, masking), quality management (rules, alerts, scoring), and lineage tracking (column, table, pipeline level).

Policy Engine — Enforces naming conventions, schema standards, SLA requirements, compliance (GDPR, CCPA), retention policies, and classification rules.

Metadata Management

Metadata is data about data. It includes technical metadata (schema, types, size), operational metadata (freshness, quality, lineage), and business metadata (definitions, owners, glossary terms).

Metadata Type	Content	Consumer	Update Frequency
Technical	Schema, types, partitions	Engineers	On schema change
Operational	Freshness, quality, lineage	Engineers	Real-time
Business	Definitions, owners, glossary	Analysts	Weekly
Social	Ratings, reviews, usage	Everyone	Continuous
Administrative	Access logs, cost, retention	Governance	Daily

# Metadata Collection Service
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List, Optional
import hashlib
import json

@dataclass
class TableMetadata:
    """Comprehensive metadata for a data asset."""
    table_id: str
    database: str
    schema_name: str
    table_name: str
    description: str = ""
    owner_team: str = ""
    owner_email: str = ""
    domain: str = ""
    classification: str = "internal"  # public, internal, confidential, restricted
    tags: List[str] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)
    updated_at: datetime = field(default_factory=datetime.now)
    row_count: int = 0
    size_bytes: int = 0
    last_updated: Optional[datetime] = None
    freshness_sla_hours: int = 24
    quality_score: float = 0.0
    lineage_upstream: List[str] = field(default_factory=list)
    lineage_downstream: List[str] = field(default_factory=list)

    def to_dict(self) -> Dict:
        return {
            "table_id": self.table_id,
            "fqn": f"{self.database}.{self.schema_name}.{self.table_name}",
            "description": self.description,
            "owner": {"team": self.owner_team, "email": self.owner_email},
            "domain": self.domain,
            "classification": self.classification,
            "tags": self.tags,
            "statistics": {
                "row_count": self.row_count,
                "size_bytes": self.size_bytes,
                "last_updated": self.last_updated.isoformat() if self.last_updated else None
            },
            "quality": {"score": self.quality_score},
            "lineage": {
                "upstream": self.lineage_upstream,
                "downstream": self.lineage_downstream
            }
        }

    def compute_hash(self) -> str:
        """Compute hash for change detection."""
        content = json.dumps(self.to_dict(), sort_keys=True, default=str)
        return hashlib.sha256(content.encode()).hexdigest()

class MetadataCatalog:
    """Central metadata catalog for all data assets."""

    def __init__(self):
        self.assets: Dict[str, TableMetadata] = {}
        self.lineage_edges: List[Dict] = []

    def register_asset(self, metadata: TableMetadata):
        """Register a data asset in the catalog."""
        self.assets[metadata.table_id] = metadata
        self._update_lineage(metadata)

    def search(self, query: str, domain: str = None) -> List[TableMetadata]:
        """Search assets by description, tags, or name."""
        results = []
        for asset in self.assets.values():
            if query.lower() in asset.description.lower() or \
               query.lower() in asset.table_name.lower() or \
               query.lower() in [t.lower() for t in asset.tags]:
                if domain is None or asset.domain == domain:
                    results.append(asset)
        return results

    def get_lineage(self, table_id: str, direction: str = "both") -> Dict:
        """Get upstream and downstream lineage."""
        asset = self.assets.get(table_id)
        if not asset:
            return {"error": "Asset not found"}

        return {
            "table_id": table_id,
            "upstream": asset.lineage_upstream,
            "downstream": asset.lineage_downstream,
            "depth": self._calculate_lineage_depth(table_id)
        }

    def _update_lineage(self, metadata: TableMetadata):
        """Update lineage graph with new edges."""
        for upstream in metadata.lineage_upstream:
            self.lineage_edges.append({
                "source": upstream,
                "target": metadata.table_id,
                "type": "data_flow"
            })

    def _calculate_lineage_depth(self, table_id: str) -> int:
        """Calculate maximum lineage depth from source."""
        visited = set()
        stack = [(table_id, 0)]
        max_depth = 0

        while stack:
            current, depth = stack.pop()
            if current in visited:
                continue
            visited.add(current)
            max_depth = max(max_depth, depth)

            for edge in self.lineage_edges:
                if edge["target"] == current:
                    stack.append((edge["source"], depth + 1))

        return max_depth

# Usage
catalog = MetadataCatalog()

catalog.register_asset(TableMetadata(
    table_id="orders_001",
    database="analytics",
    schema_name="marts",
    table_name="fact_orders",
    description="Fact table containing all customer orders",
    owner_team="data-platform",
    owner_email="data-platform@company.com",
    domain="sales",
    classification="internal",
    tags=["finance", "daily", "production"],
    row_count=15000000,
    size_bytes=2_500_000_000,
    quality_score=0.998,
    lineage_upstream=["staging.stg_shopify_orders", "staging.stg_stripe_payments"],
    lineage_downstream=["mart.revenue_dashboard", "mart.customer_360"]
))

results = catalog.search("orders", domain="sales")

Data Lineage

Data lineage tracks the flow of data from its origin through transformations to its final destination. It answers: where did this data come from, what transformations were applied, and who consumes it?

Lineage Complexity

Tables: T
Edges: E (average dependencies per table = E/T)
Lineage Depth: D = max path length from source to sink
Impact Analysis Cost: O(E × D) for full downstream traversal
Metadata Storage: M = T × (avg_metadata_size) + E × (avg_edge_size)

-- Lineage query: Find all upstream sources
WITH RECURSIVE lineage_upstream AS (
    -- Base: target table
    SELECT
        target_table,
        source_table,
        1 AS depth
    FROM data_lineage
    WHERE target_table = 'mart.fact_orders'

    UNION ALL

    -- Recursive: follow upstream
    SELECT
        l.target_table,
        dl.source_table,
        l.depth + 1
    FROM lineage_upstream l
    JOIN data_lineage dl ON l.source_table = dl.target_table
    WHERE l.depth < 10  -- Prevent infinite loops
)

SELECT DISTINCT
    source_table,
    depth
FROM lineage_upstream
ORDER BY depth;

-- Lineage query: Impact analysis
WITH RECURSIVE impact AS (
    SELECT target_table, source_table, 1 AS depth
    FROM data_lineage
    WHERE source_table = 'staging.stg_orders'

    UNION ALL

    SELECT i.target_table, dl.target_table, i.depth + 1
    FROM impact i
    JOIN data_lineage dl ON i.target_table = dl.source_table
    WHERE i.depth < 10
)

SELECT DISTINCT target_table, depth
FROM impact
ORDER BY depth;

-- Column-level lineage
SELECT
    source_table,
    source_column,
    target_table,
    target_column,
    transformation_type
FROM column_lineage
WHERE target_table = 'mart.fact_orders'
ORDER BY target_column;

Data Catalog Tools Comparison

A data catalog is an organized inventory of data assets that provides metadata management, data discovery, data lineage, and governance capabilities. It serves as the central interface for users to find, understand, and trust data.

Tool	Type	Key Features	Cost	Best For
DataHub	Open Source	Metadata, lineage, discovery	Free	Self-hosted, custom
Amundsen	Open Source	Search, discovery, lineage	Free	Lyft-style architecture
OpenMetadata	Open Source	Metadata, lineage, quality	Free	Modern, all-in-one
Alation	Enterprise	Collaboration, governance	<MathBlock tex=\ /> <MathBlock tex=\ />	Large enterprises
Collibra	Enterprise	Governance, cataloging	<MathBlock tex=\ /> <MathBlock tex=\ />	Compliance-heavy
AWS Glue Catalog	Managed	Schema, partitioning	Pay-per-use	AWS-native
Databricks Unity	Managed	Governance, lineage	Included	Databricks users
Atlan	Commercial	Modern UI, automation	<MathBlock tex=\ /> $	Data teams

# OpenMetadata catalog integration
from metadata.ingestion.ometa.openmetadata_rest import OpenMetadata

# Connect to OpenMetadata
server_config = {
    "api_endpoint": "http://localhost:8585/api",
    "auth_provider": "no-auth"
}
metadata = OpenMetadata(server_config)

# Create table entity
from metadata.generated.schema.api.data.createTable import CreateTableRequest
from metadata.generated.schema.entity.data.table import Column, DataType

table_request = CreateTableRequest(
    name="fact_orders",
    description="Fact table containing all customer orders",
    columns=[
        Column(name="order_key", dataType=DataType.BIGINT, description="Surrogate key"),
        Column(name="order_id", dataType=DataType.STRING, description="Natural key"),
        Column(name="customer_key", dataType=DataType.INT, description="FK to dim_customers"),
        Column(name="order_date", dataType=DataType.DATE, description="Order date"),
        Column(name="net_amount", dataType=DataType.DECIMAL, description="Net order amount"),
    ],
    databaseSchema="analytics.marts",
    tags=[{"tagFQN": "Finance"}, {"tagFQN": "Daily"}],
    owner={"type": "team", "name": "data-platform"}
)

# Search catalog
results = metadata.search("orders")
for result in results:
    print(f"{result.fully_qualified}: {result.description}")

Key Concepts Summary

Concept	Description	Tool/Implementation	Metric
Metadata Catalog	Central repository for data metadata	DataHub, Amundsen, OpenMetadata	Catalog coverage
Data Lineage	Track data flow and transformations	dbt, Apache Atlas, Marquez	Lineage completeness
Data Quality	Automated quality monitoring	Great Expectations, Soda	Quality score
Access Control	RBAC/ABAC for data access	Unity Catalog, Purview	Policy compliance
Data Classification	Sensitivity labeling	Manual + ML-assisted	Classification coverage
Schema Registry	Schema versioning and evolution	Confluent, AWS Glue	Schema compliance
Data Catalog Search	Discoverable data assets	Custom UI + Elasticsearch	Findability score
Policy Engine	Automated governance enforcement	OPA, Custom rules	Violation count

Performance Metrics

Metric	Without Governance	With Governance	Target
Dataset Discovery Time	Hours-Days	Seconds-Minutes	< 1 min
Data Quality Score	60-80%	95-99%	> 98%
Duplicate Datasets	30-50%	< 5%	< 3%
Compliance Violations	Unknown	Tracked	0 critical
Lineage Coverage	20-40%	90-100%	> 95%
Time to Trust New Data	Weeks	Hours	< 1 day
Security Incidents	Reactive	Proactive	0 breaches
Data Literacy Score	Low	Medium-High	> 80%

10 Best Practices

Implement a data catalog from day one — retroactive cataloging is 10x harder
Automate metadata collection — use hooks in dbt, Airflow, and ingestion tools
Enforce data quality at ingestion — reject bad data before it enters the lake
Track column-level lineage — understand impact of schema changes
Implement data classification — label all datasets by sensitivity level
Use a policy engine for automated governance enforcement
Create a business glossary — define metrics consistently across the organization
Implement data SLAs — define freshness, quality, and availability requirements
Monitor governance compliance — track metrics and alert on violations
Make governance self-serve — provide tools and templates for domain teams

Metadata catalogs enable data discovery, trust, and governance at scale
Column-level lineage tracks data flow from source to consumer for impact analysis
Automated quality monitoring catches issues before they reach consumers
RBAC/ABAC policies enforce access controls computationally
Governance must be self-serve for domain teams in decentralized architectures