Looker Deep Dive for Data Engineering

Master Looker including LookML modeling, explores, derived tables, dashboards, embedded analytics, and data engineering integration patterns.

18 min readAdvanced

Looker Architecture

Looker is a modern BI and data platform that runs on a semantic modeling layer (LookML). It connects to databases directly, providing real-time analytics without data movement.

Architecture Overview

🏗️ GCP Data Engineering Reference Architecture

Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

LookML Modeling

View Definition

# views/orders.view
view: orders {
  sql_table_name: `project.analytics.orders` ;;

  dimension: order_id {
    primary_key: yes
    type: string
    sql: ${TABLE}.order_id ;;
    description: "Unique order identifier"
  }

  dimension: customer_id {
    type: string
    sql: ${TABLE}.customer_id ;;
    hidden: yes
  }

  dimension_group: order_date {
    type: time
    timeframes: [
      raw, time, date, week, month, quarter, year
    ]
    sql: ${TABLE}.order_date ;;
    description: "Date when order was placed"
  }

  dimension: amount {
    type: number
    sql: ${TABLE}.amount ;;
    description: "Order amount in USD"
  }

  dimension: amount_tier {
    type: tier
    tiers: [0, 100, 500, 1000, 5000]
    style: integer
    sql: ${amount} ;;
    description: "Order amount buckets"
  }

  dimension: status {
    type: string
    sql: ${TABLE}.status ;;
    description: "Order status (pending, completed, cancelled)"
  }

  # Measures
  measure: total_revenue {
    type: sum
    sql: ${amount} ;;
    description: "Total revenue from orders"
    value_format_name: usd
    drill_fields: [order_detail*]
  }

  measure: average_order_value {
    type: average
    sql: ${amount} ;;
    description: "Average order value"
    value_format_name: usd
  }

  measure: order_count {
    type: count
    drill_fields: [order_detail*]
    description: "Total number of orders"
  }

  measure: unique_customers {
    type: count_distinct
    sql: ${customer_id} ;;
    description: "Number of unique customers"
  }

  # Sets for drilling
  set: order_detail {
    fields: [
      order_id,
      order_date_date,
      amount,
      status,
      customers.customer_name
    ]
  }
}

Explore Definition

# explores/orders.explore
explore: orders {
  label: "Sales Orders"
  description: "Explore for analyzing sales orders and revenue"
  group_label: "Sales"

  join: customers {
    type: left_outer
    sql_on: ${orders.customer_id} = ${customers.customer_id} ;;
    relationship: many_to_one
  }

  join: order_items {
    type: left_outer
    sql_on: ${orders.order_id} = ${order_items.order_id} ;;
    relationship: one_to_many
  }

  join: products {
    type: left_outer
    sql_on: ${order_items.product_id} = ${products.product_id} ;;
    relationship: many_to_one
  }

  # Always filter
  always_filter: {
    filters: [order_date_date: "30 days"]
  }

  # Aggregate awareness
  aggregate_table: daily_orders {
    query: {
      dimensions: [order_date_date]
      measures: [order_count, total_revenue]
    }
    materialization: {
      datagroup_trigger: daily_etl
    }
  }

  # Access control
  access_filter: {
    field: customers.region
    user_attribute: allowed_regions
  }
}

✨

Best Practice: Use derived tables sparingly — prefer native database views or materialized views for performance. Create aggregate tables for commonly accessed aggregations. Use always_filter to prevent full-table scans. Document all dimensions and measures for user self-service.

Derived Tables

# Derived table for daily sales summary
view: daily_sales_summary {
  derived_table: {
    sql:
      SELECT
        DATE(order_date) as order_date,
        COUNT(*) as order_count,
        SUM(amount) as total_revenue,
        COUNT(DISTINCT customer_id) as unique_customers,
        AVG(amount) as avg_order_value
      FROM `project.analytics.orders`
      WHERE status = 'completed'
      GROUP BY 1
    ;;
    datagroup_trigger: daily_etl
    distribution: "order_date"
    sortkeys: ["order_date"]
  }

  dimension: order_date {
    type: date
    sql: ${TABLE}.order_date ;;
  }

  dimension: order_count {
    type: number
    sql: ${TABLE}.order_count ;;
  }

  dimension: total_revenue {
    type: number
    sql: ${TABLE}.total_revenue ;;
    value_format_name: usd
  }

  measure: total_orders {
    type: sum
    sql: ${order_count} ;;
  }

  measure: total_revenue_sum {
    type: sum
    sql: ${total_revenue} ;;
    value_format_name: usd
  }
}

Data Engineering Integration

BigQuery + Looker Pipeline

# Example: Materialized view in BigQuery for Looker
from google.cloud import bigquery

client = bigquery.Client()

# Create materialized view optimized for Looker
query = """
CREATE MATERIALIZED VIEW `project.analytics.looker_sales_mv`
PARTITION BY DATE(order_date)
CLUSTER BY product_category, region
AS
SELECT
    DATE(o.order_date) as order_date,
    p.product_category,
    c.region,
    COUNT(*) as order_count,
    SUM(o.amount) as total_revenue,
    COUNT(DISTINCT o.customer_id) as unique_customers
FROM `project.analytics.orders` o
JOIN `project.analytics.products` p ON o.product_id = p.product_id
JOIN `project.analytics.customers` c ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
GROUP BY 1, 2, 3
"""

job = client.query(query)
job.result()
print("Materialized view created for Looker")

Looker + Dataflow Integration

# Looker API for data export
import looker_api

def export_lookml_to_dataflow():
    """Export LookML metadata for Dataflow pipeline."""
    client = looker_api.looker30SDK(
        base_url="https://my-looker.company.com:19999",
        client_id="my_client_id",
        client_secret="my_client_secret"
    )

    # Get all explores
    explores = client.all_lookml_models()

    # Export for Dataflow pipeline configuration
    for model in explores:
        for explore in model.explores:
            print(f"Model: {model.name}, Explore: {explore.name}")

    return explores

Embedded Analytics

# Looker Embed SDK
import looker_sdk

def create_embed_url(user_id, dashboard_id):
    """Create embedded dashboard URL."""
    sdk = looker_sdk.init31("looker.ini")

    # Create embed user
    embed_user = sdk.create_sso_embed_user(
        body=looker_sdk.models314.CreateSsoEmbedUser(
            first_name="Embedded",
            last_name="User",
            external_user_id=user_id,
            session_length=3600,
            forced_params=[
                {"name": "filter_field", "value": "user_id"}
            ]
        )
    )

    # Generate embed URL
    url = sdk.embed_url(
        target_url=f"/embed/dashboards/{dashboard_id}",
        session_length=3600
    )

    return url

⚠️ Cost Alert

Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.

Cost Optimization

# Looker pricing model
pricing = {
    "platform_fee": "Custom pricing based on users/features",
    "bigquery_costs": "Standard BigQuery pricing applies",
    "optimization": {
        "use_materialized_views": "Reduce query costs",
        "aggregate_tables": "Pre-compute common aggregations",
        "cache_results": "Leverage Looker caching layer",
        "use_pdt": "Persistent derived tables for expensive queries"
    }
}

ℹ️

Cost Tip: Looker costs are primarily driven by platform licensing and underlying database query costs. Optimize by: 1) Using materialized views in BigQuery, 2) Creating aggregate tables for dashboards, 3) Implementing caching, 4) Using persistent derived tables for expensive computations.

💬

Common Interview Questions

Q1: What is LookML and why is it important?

Answer: LookML is Looker's semantic modeling language that defines business logic. It provides a centralized layer for dimensions, measures, and relationships. LookML ensures consistent metrics across all reports and dashboards. Changes to LookML automatically propagate to all dependent content.

Q2: What is the difference between derived tables and persistent derived tables?

Answer: Derived tables are virtual tables computed at query time. Persistent derived tables (PDTs) are materialized in the database and refreshed on a schedule. PDTs improve performance for expensive queries but require database storage and refresh time.

Q3: How does Looker integrate with BigQuery?

Answer: Looker connects directly to BigQuery using the BigQuery SQL dialect. It supports native BigQuery features like partitioning, clustering, and nested/repeated fields. Looker can create materialized views and aggregate tables for performance optimization.

Q4: What is embedded analytics in Looker?

Answer: Embedded analytics allows you to integrate Looker dashboards and reports into your own applications. Users interact with Looker content without leaving your application. It supports single sign-on, custom styling, and parameter passing for personalized experiences.

Q5: How do you optimize Looker performance?

Answer: 1) Create aggregate tables for common queries, 2) Use persistent derived tables for expensive computations, 3) Implement caching with datagroups, 4) Optimize LookML joins and explores, 5) Use BigQuery materialized views, 6) Limit dashboard tile complexity.

Looker: Data Modeling, Dashboards & Embedded Analytics