Data Lake Architecture on GCP
BigLake Tables
-- Create Iceberg table via BigLake
CREATE TABLE `project.dataset.iceberg_sales`
WITH CONNECTION `us-central1.my-connection`
OPTIONS (
format = 'ICEBERG',
uris = ['gs://my-data-lake/iceberg/sales/metadata/v1.metadata.json']
);
-- Create Delta Lake table via BigLake
CREATE TABLE `project.dataset.delta_sales`
WITH CONNECTION `us-central1.my-connection`
OPTIONS (
format = 'DELTA_LAKE',
uris = ['gs://my-data-lake/delta/sales/_delta_log/']
);
-- Create Hive table via BigLake
CREATE TABLE `project.dataset.hive_sales`
WITH CONNECTION `us-central1.my-connection`
OPTIONS (
format = 'HIVE',
uris = ['gs://my-data-lake/hive/sales/'],
hive_partition_uri_prefix = 'gs://my-data-lake/hive/sales/'
);
Open Table Format Comparison
β¨
Best Practice: For new data lake projects on GCP, prefer Apache Iceberg. It provides hidden partitioning, partition evolution, and better GCP integration via BigLake. For existing Delta Lake workloads, BigLake provides unified access.
Common Interview Questions
Q1: What is the difference between a data lake and a data warehouse?
Answer: A data lake stores raw, unstructured data at low cost (GCS). A data warehouse stores structured, processed data optimized for analytics (BigQuery). Data lakes are for exploration; warehouses are for reporting. Modern lakehouses combine both capabilities.
Q2: What is BigLake and why is it important?
Answer: BigLake provides unified governance across data lakes and warehouses. It supports open table formats (Iceberg, Delta Lake), fine-grained access control, and works with BigQuery, Dataproc, and Dataflow. It eliminates data silos while maintaining governance.
Q3: When would you use Iceberg vs. Delta Lake?
Answer: Iceberg is recommended for new projects on GCP due to better integration, hidden partitioning, and partition evolution. Delta Lake is preferred for existing Spark-based workloads or when using Databricks. Both provide ACID transactions and time travel.
Q4: How do you govern a data lake on GCS?
Answer: 1) Use Dataplex for data discovery and lineage, 2) Implement BigLake for fine-grained access control, 3) Use Data Catalog for metadata management, 4) Apply Cloud DLP for sensitive data detection, 5) Enable audit logging, 6) Use policy tags for column-level security.
Q5: What are the cost benefits of a lakehouse architecture?
Answer: 1) Store raw data cheaply in GCS, 2) Compute scales independently, 3) Open formats avoid vendor lock-in, 4) BigQuery charges only for queries, 5) Lifecycle policies reduce storage costs, 6) Shared infrastructure reduces operational overhead.