Looker Architecture
Looker is a modern BI and data platform that runs on a semantic modeling layer (LookML). It connects to databases directly, providing real-time analytics without data movement.
Architecture Overview
LookML Modeling
View Definition
# views/orders.view
view: orders {
sql_table_name: `project.analytics.orders` ;;
dimension: order_id {
primary_key: yes
type: string
sql: ${TABLE}.order_id ;;
description: "Unique order identifier"
}
dimension: customer_id {
type: string
sql: ${TABLE}.customer_id ;;
hidden: yes
}
dimension_group: order_date {
type: time
timeframes: [
raw, time, date, week, month, quarter, year
]
sql: ${TABLE}.order_date ;;
description: "Date when order was placed"
}
dimension: amount {
type: number
sql: ${TABLE}.amount ;;
description: "Order amount in USD"
}
dimension: amount_tier {
type: tier
tiers: [0, 100, 500, 1000, 5000]
style: integer
sql: ${amount} ;;
description: "Order amount buckets"
}
dimension: status {
type: string
sql: ${TABLE}.status ;;
description: "Order status (pending, completed, cancelled)"
}
# Measures
measure: total_revenue {
type: sum
sql: ${amount} ;;
description: "Total revenue from orders"
value_format_name: usd
drill_fields: [order_detail*]
}
measure: average_order_value {
type: average
sql: ${amount} ;;
description: "Average order value"
value_format_name: usd
}
measure: order_count {
type: count
drill_fields: [order_detail*]
description: "Total number of orders"
}
measure: unique_customers {
type: count_distinct
sql: ${customer_id} ;;
description: "Number of unique customers"
}
# Sets for drilling
set: order_detail {
fields: [
order_id,
order_date_date,
amount,
status,
customers.customer_name
]
}
}
Explore Definition
# explores/orders.explore
explore: orders {
label: "Sales Orders"
description: "Explore for analyzing sales orders and revenue"
group_label: "Sales"
join: customers {
type: left_outer
sql_on: ${orders.customer_id} = ${customers.customer_id} ;;
relationship: many_to_one
}
join: order_items {
type: left_outer
sql_on: ${orders.order_id} = ${order_items.order_id} ;;
relationship: one_to_many
}
join: products {
type: left_outer
sql_on: ${order_items.product_id} = ${products.product_id} ;;
relationship: many_to_one
}
# Always filter
always_filter: {
filters: [order_date_date: "30 days"]
}
# Aggregate awareness
aggregate_table: daily_orders {
query: {
dimensions: [order_date_date]
measures: [order_count, total_revenue]
}
materialization: {
datagroup_trigger: daily_etl
}
}
# Access control
access_filter: {
field: customers.region
user_attribute: allowed_regions
}
}
β¨
Best Practice: Use derived tables sparingly β prefer native database views or materialized views for performance. Create aggregate tables for commonly accessed aggregations. Use always_filter to prevent full-table scans. Document all dimensions and measures for user self-service.
Derived Tables
# Derived table for daily sales summary
view: daily_sales_summary {
derived_table: {
sql:
SELECT
DATE(order_date) as order_date,
COUNT(*) as order_count,
SUM(amount) as total_revenue,
COUNT(DISTINCT customer_id) as unique_customers,
AVG(amount) as avg_order_value
FROM `project.analytics.orders`
WHERE status = 'completed'
GROUP BY 1
;;
datagroup_trigger: daily_etl
distribution: "order_date"
sortkeys: ["order_date"]
}
dimension: order_date {
type: date
sql: ${TABLE}.order_date ;;
}
dimension: order_count {
type: number
sql: ${TABLE}.order_count ;;
}
dimension: total_revenue {
type: number
sql: ${TABLE}.total_revenue ;;
value_format_name: usd
}
measure: total_orders {
type: sum
sql: ${order_count} ;;
}
measure: total_revenue_sum {
type: sum
sql: ${total_revenue} ;;
value_format_name: usd
}
}
Data Engineering Integration
BigQuery + Looker Pipeline
# Example: Materialized view in BigQuery for Looker
from google.cloud import bigquery
client = bigquery.Client()
# Create materialized view optimized for Looker
query = """
CREATE MATERIALIZED VIEW `project.analytics.looker_sales_mv`
PARTITION BY DATE(order_date)
CLUSTER BY product_category, region
AS
SELECT
DATE(o.order_date) as order_date,
p.product_category,
c.region,
COUNT(*) as order_count,
SUM(o.amount) as total_revenue,
COUNT(DISTINCT o.customer_id) as unique_customers
FROM `project.analytics.orders` o
JOIN `project.analytics.products` p ON o.product_id = p.product_id
JOIN `project.analytics.customers` c ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
GROUP BY 1, 2, 3
"""
job = client.query(query)
job.result()
print("Materialized view created for Looker")
Looker + Dataflow Integration
# Looker API for data export
import looker_api
def export_lookml_to_dataflow():
"""Export LookML metadata for Dataflow pipeline."""
client = looker_api.looker30SDK(
base_url="https://my-looker.company.com:19999",
client_id="my_client_id",
client_secret="my_client_secret"
)
# Get all explores
explores = client.all_lookml_models()
# Export for Dataflow pipeline configuration
for model in explores:
for explore in model.explores:
print(f"Model: {model.name}, Explore: {explore.name}")
return explores
Embedded Analytics
# Looker Embed SDK
import looker_sdk
def create_embed_url(user_id, dashboard_id):
"""Create embedded dashboard URL."""
sdk = looker_sdk.init31("looker.ini")
# Create embed user
embed_user = sdk.create_sso_embed_user(
body=looker_sdk.models314.CreateSsoEmbedUser(
first_name="Embedded",
last_name="User",
external_user_id=user_id,
session_length=3600,
forced_params=[
{"name": "filter_field", "value": "user_id"}
]
)
)
# Generate embed URL
url = sdk.embed_url(
target_url=f"/embed/dashboards/{dashboard_id}",
session_length=3600
)
return url
Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.
Cost Optimization
# Looker pricing model
pricing = {
"platform_fee": "Custom pricing based on users/features",
"bigquery_costs": "Standard BigQuery pricing applies",
"optimization": {
"use_materialized_views": "Reduce query costs",
"aggregate_tables": "Pre-compute common aggregations",
"cache_results": "Leverage Looker caching layer",
"use_pdt": "Persistent derived tables for expensive queries"
}
}
βΉοΈ
Cost Tip: Looker costs are primarily driven by platform licensing and underlying database query costs. Optimize by: 1) Using materialized views in BigQuery, 2) Creating aggregate tables for dashboards, 3) Implementing caching, 4) Using persistent derived tables for expensive computations.
Common Interview Questions
Q1: What is LookML and why is it important?
Answer: LookML is Looker's semantic modeling language that defines business logic. It provides a centralized layer for dimensions, measures, and relationships. LookML ensures consistent metrics across all reports and dashboards. Changes to LookML automatically propagate to all dependent content.
Q2: What is the difference between derived tables and persistent derived tables?
Answer: Derived tables are virtual tables computed at query time. Persistent derived tables (PDTs) are materialized in the database and refreshed on a schedule. PDTs improve performance for expensive queries but require database storage and refresh time.
Q3: How does Looker integrate with BigQuery?
Answer: Looker connects directly to BigQuery using the BigQuery SQL dialect. It supports native BigQuery features like partitioning, clustering, and nested/repeated fields. Looker can create materialized views and aggregate tables for performance optimization.
Q4: What is embedded analytics in Looker?
Answer: Embedded analytics allows you to integrate Looker dashboards and reports into your own applications. Users interact with Looker content without leaving your application. It supports single sign-on, custom styling, and parameter passing for personalized experiences.
Q5: How do you optimize Looker performance?
Answer: 1) Create aggregate tables for common queries, 2) Use persistent derived tables for expensive computations, 3) Implement caching with datagroups, 4) Optimize LookML joins and explores, 5) Use BigQuery materialized views, 6) Limit dashboard tile complexity.