Data Mesh Architecture on GCP
Dataplex Implementation
from google.cloud import dataplex_v1
client = dataplex_v1.DataplexServiceClient()
# Create lake
lake = client.create_lake(
request={
"parent": "projects/my-project/locations/us-central1",
"lake_id": "data-mesh-lake",
"lake": {"display_name": "Data Mesh Lake", "description": "Central data mesh lake"}
}
)
# Create zone for sales domain
zone = client.create_zone(
request={
"parent": lake.name,
"zone_id": "sales-zone",
"zone": {
"display_name": "Sales Zone",
"type_": dataplex_v1.Zone.Type.SCURATED,
"resource_spec": {
"location_type": dataplex_v1.Zone.ResourceSpec.LocationType.SINGLE_REGION
}
}
}
)
# Create asset for BigQuery dataset
asset = client.create_asset(
request={
"parent": zone.name,
"asset_id": "sales-bigquery",
"asset": {
"display_name": "Sales BigQuery Dataset",
"resource_spec": {
"name": "projects/my-project/datasets/sales",
"type_": dataplex_v1.Asset.ResourceSpec.Type.BIGQUERY_DATASET
},
"discovery_spec": {
"enabled": True,
"schedule": "0 */6 * * *"
}
}
}
)
Data Products
# Define data product metadata
data_product = {
"name": "sales-revenue",
"domain": "sales",
"description": "Daily sales revenue aggregated by product and region",
"owner": "sales-data-team@company.com",
"sla": {
"freshness": "daily",
"availability": "99.9%",
"quality": "99.5% completeness"
},
"assets": [
{
"type": "bigquery_table",
"resource": "project_sales.analytics.daily_revenue",
"format": "BigQuery table"
},
{
"type": "gcs_bucket",
"resource": "gs://data-mesh/sales/daily_revenue/",
"format": "Parquet files"
}
],
"quality_checks": [
{"type": "completeness", "threshold": 0.995},
{"type": "freshness", "threshold_hours": 24},
{"type": "accuracy", "threshold": 0.99}
],
"lineage": {
"upstream": ["raw_sales", "raw_customers"],
"downstream": ["executive_dashboard", "sales_forecast_ml"]
}
}
β¨
Best Practice: Data products should be self-describing, discoverable, and addressable. Include SLAs, quality metrics, and lineage information. Use Dataplex for automated discovery and quality monitoring. Each domain should own and manage their data products.
Common Interview Questions
Q1: What is Data Mesh and why is it important?
Answer: Data Mesh is a decentralized data architecture where domain teams own their data products. It eliminates the bottleneck of centralized data teams, improves data quality through domain expertise, and scales data operations. On GCP, Dataplex provides the governance layer for federated data management.
Q2: How does Dataplex support Data Mesh?
Answer: Dataplex provides: 1) Zones for domain isolation, 2) Assets for resource discovery, 3) Data quality rules per zone, 4) Lineage tracking across domains, 5) Central governance with domain autonomy, 6) Integration with Data Catalog for metadata.
Q3: What makes a good data product?
Answer: A good data product is: 1) Self-describing (metadata, documentation), 2) Discoverable (cataloged in Data Catalog), 3) Addressable (unique identifier, stable endpoint), 4) Trustworthy (SLAs, quality checks), 5) Interoperable (standard formats), 6) Secure (access controls).
Q4: How do you implement cross-domain data sharing?
Answer: 1) Define data products with clear interfaces, 2) Use Dataplex zones for access control, 3) Implement data contracts (SLAs, schemas), 4) Use BigQuery authorized datasets for cross-project access, 5) Track lineage for impact analysis.
Q5: What are the challenges of implementing Data Mesh?
Answer: 1) Cultural shift from centralized to federated, 2) Standardization across domains, 3) Governance without bureaucracy, 4) Cross-domain data discovery, 5) Maintaining data quality across teams, 6) Tooling and platform investment.