Recursive CTEs: Hierarchical Queries, Graph Traversal, Scheduling

Interview Question: "Write a recursive CTE to find all ancestors of a node in a hierarchy. How do you prevent infinite loops? What's the difference between depth-first and breadth-first traversal in SQL?" — Asked at Netflix, Spotify, Uber for Staff Engineer roles

ℹ️

Difficulty: Advanced | Companies: Netflix, Spotify, Uber, Airbnb, Stripe | Time: 45-60 minutes

Recursive CTE Anatomy

A recursive CTE has three components:

\text{Recursive CTE} = \text{Anchor Member} \cup \text{Recursive Member} \mid \text{Base Case} \cup \text{Inductive Step}

-- Create organizational hierarchy
CREATE TABLE employees (
    emp_id INT PRIMARY KEY,
    name VARCHAR(100),
    manager_id INT,
    title VARCHAR(100),
    hire_date DATE,
    salary DECIMAL(12,2)
);

INSERT INTO employees VALUES
(1, 'CEO', NULL, 'Chief Executive Officer', '2020-01-01', 250000.00),
(2, 'CTO', 1, 'Chief Technology Officer', '2020-02-01', 220000.00),
(3, 'CFO', 1, 'Chief Financial Officer', '2020-03-01', 200000.00),
(4, 'VP Eng', 2, 'VP of Engineering', '2020-04-01', 180000.00),
(5, 'VP Product', 2, 'VP of Product', '2020-05-01', 175000.00),
(6, 'Director', 4, 'Director of Engineering', '2021-01-01', 150000.00),
(7, 'Manager 1', 6, 'Engineering Manager', '2021-06-01', 130000.00),
(8, 'Manager 2', 6, 'Engineering Manager', '2021-07-01', 125000.00),
(9, 'Sr Dev 1', 7, 'Senior Developer', '2022-01-01', 120000.00),
(10, 'Sr Dev 2', 7, 'Senior Developer', '2022-02-01', 115000.00),
(11, 'Dev 1', 8, 'Developer', '2022-03-01', 100000.00),
(12, 'Dev 2', 8, 'Developer', '2022-04-01', 95000.00),
(13, 'Intern 1', 9, 'Intern', '2023-06-01', 50000.00);

Depth-First Traversal with Path

-- Recursive CTE for complete hierarchy with path
WITH RECURSIVE emp_hierarchy AS (
    -- Anchor: Top-level managers (CEO)
    SELECT 
        emp_id,
        name,
        manager_id,
        title,
        hire_date,
        salary,
        1 AS depth,
        name::TEXT AS path,
        ARRAY[emp_id] AS path_ids,
        name::TEXT AS search_path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive: Find all subordinates
    SELECT 
        e.emp_id,
        e.name,
        e.manager_id,
        e.title,
        e.hire_date,
        e.salary,
        h.depth + 1,
        h.path || ' → ' || e.name,
        h.path_ids || e.emp_id,
        h.search_path || '/' || e.name
    FROM employees e
    INNER JOIN emp_hierarchy h ON e.manager_id = h.emp_id
    WHERE h.depth < 10  -- Prevent infinite recursion
)
SELECT 
    emp_id,
    name,
    title,
    depth,
    path,
    salary,
    -- Calculate path length
    array_length(path_ids, 1) - 1 AS hierarchy_level,
    -- Check if leaf node
    CASE 
        WHEN NOT EXISTS (
            SELECT 1 FROM employees WHERE manager_id = emp_hierarchy.emp_id
        ) THEN 'LEAF'
        ELSE 'BRANCH'
    END AS node_type
FROM emp_hierarchy
ORDER BY path_ids;

Output:

emp_id	name	title	depth	path	salary	hierarchy_level	node_type
1	CEO	Chief Executive Officer	1	CEO	250000.00	0	BRANCH
2	CTO	Chief Technology Officer	2	CEO → CTO	220000.00	1	BRANCH
3	CFO	Chief Financial Officer	2	CEO → CFO	200000.00	1	LEAF
4	VP Eng	VP of Engineering	3	CEO → CTO → VP Eng	180000.00	2	BRANCH
5	VP Product	VP of Product	3	CEO → CTO → VP Product	175000.00	2	LEAF
6	Director	Director of Engineering	4	CEO → CTO → VP Eng → Director	150000.00	3	BRANCH
7	Manager 1	Engineering Manager	5	CEO → CTO → VP Eng → Director → Manager 1	130000.00	4	BRANCH
8	Manager 2	Engineering Manager	5	CEO → CTO → VP Eng → Director → Manager 2	125000.00	4	BRANCH
9	Sr Dev 1	Senior Developer	6	CEO → CTO → VP Eng → Director → Manager 1 → Sr Dev 1	120000.00	5	BRANCH
10	Sr Dev 2	Senior Developer	6	CEO → CTO → VP Eng → Director → Manager 1 → Sr Dev 2	115000.00	5	LEAF
11	Dev 1	Developer	6	CEO → CTO → VP Eng → Director → Manager 2 → Dev 1	100000.00	5	LEAF
12	Dev 2	Developer	6	CEO → CTO → VP Eng → Director → Manager 2 → Dev 2	95000.00	5	LEAF
13	Intern 1	Intern	7	CEO → CTO → VP Eng → Director → Manager 1 → Sr Dev 1 → Intern 1	50000.00	6	LEAF

⚠️

Infinite Recursion Prevention: Always include a depth limit or cycle detection. Without it, circular references cause infinite loops.

Breadth-First Search Pattern

-- BFS traversal with level-by-level processing
WITH RECURSIVE bfs_traversal AS (
    SELECT 
        emp_id,
        name,
        manager_id,
        1 AS level,
        ARRAY[emp_id] AS visited,
        name::TEXT AS level_path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    SELECT 
        e.emp_id,
        e.name,
        e.manager_id,
        b.level + 1,
        b.visited || e.emp_id,
        b.level_path || ' → ' || e.name
    FROM employees e
    INNER JOIN bfs_traversal b ON e.manager_id = b.emp_id
    WHERE e.emp_id != ALL(b.visited)  -- Cycle detection
)
SELECT 
    level,
    COUNT(*) AS nodes_at_level,
    STRING_AGG(name, ', ' ORDER BY name) AS members,
    MAX(salary) AS max_salary,
    ROUND(AVG(salary), 2) AS avg_salary
FROM bfs_traversal
JOIN employees USING (emp_id)
GROUP BY level
ORDER BY level;

Output:

level	nodes_at_level	members	max_salary	avg_salary
1	1	CEO	250000.00	250000.00
2	2	CFO, CTO	220000.00	210000.00
3	2	VP Eng, VP Product	180000.00	177500.00
4	1	Director	150000.00	150000.00
5	2	Manager 1, Manager 2	130000.00	127500.00
6	4	Dev 1, Dev 2, Sr Dev 1, Sr Dev 2	120000.00	107500.00
7	1	Intern 1	50000.00	50000.00

Graph Traversal: Shortest Path

-- Create graph table for network connections
CREATE TABLE network_connections (
    source_node INT,
    target_node INT,
    connection_cost DECIMAL(10,2),
    connection_type VARCHAR(50)
);

INSERT INTO network_connections VALUES
(1, 2, 10.00, 'direct'),
(1, 3, 15.00, 'direct'),
(2, 3, 5.00, 'indirect'),
(2, 4, 20.00, 'direct'),
(3, 4, 8.00, 'direct'),
(3, 5, 12.00, 'direct'),
(4, 5, 3.00, 'direct'),
(4, 6, 18.00, 'direct'),
(5, 6, 7.00, 'direct');

-- Find shortest path using recursive CTE
WITH RECURSIVE shortest_path AS (
    -- Start from source node 1
    SELECT 
        source_node,
        target_node,
        connection_cost AS total_cost,
        1 AS hops,
        ARRAY[source_node, target_node] AS path,
        source_node || ' → ' || target_node::TEXT AS path_text
    FROM network_connections
    WHERE source_node = 1
    
    UNION ALL
    
    -- Extend path by one hop
    SELECT 
        sp.source_node,
        nc.target_node,
        sp.total_cost + nc.connection_cost,
        sp.hops + 1,
        sp.path || nc.target_node,
        sp.path_text || ' → ' || nc.target_node
    FROM shortest_path sp
    INNER JOIN network_connections nc 
        ON sp.target_node = nc.source_node
    WHERE nc.target_node != ALL(sp.path)  -- No cycles
        AND sp.hops < 10  -- Max depth
)
SELECT DISTINCT ON (source_node, target_node)
    source_node,
    target_node,
    total_cost,
    hops,
    path,
    path_text
FROM shortest_path
WHERE target_node = 6  -- Destination
ORDER BY source_node, target_node, total_cost;

Output:

source_node	target_node	total_cost	hops	path	path_text
1	6	30.00	3	{1,2,4,6}	1 → 2 → 4 → 6
1	6	31.00	3	{1,3,4,6}	1 → 3 → 4 → 6
1	6	34.00	4	{1,2,3,4,6}	1 → 2 → 3 → 4 → 6
1	6	37.00	4	{1,2,4,5,6}	1 → 2 → 4 → 5 → 6

Scheduling with Recursive CTE

-- Create project tasks with dependencies
CREATE TABLE project_tasks (
    task_id INT PRIMARY KEY,
    task_name VARCHAR(100),
    duration_days INT,
    depends_on INT,
    priority INT
);

INSERT INTO project_tasks VALUES
(1, 'Requirements', 5, NULL, 1),
(2, 'Design', 10, 1, 2),
(3, 'Database Schema', 7, 2, 3),
(4, 'API Development', 15, 2, 3),
(5, 'Frontend Dev', 12, 2, 3),
(6, 'Integration', 8, 3, 4),
(7, 'Testing', 10, 4, 4),
(8, 'Deployment', 3, 5, 5),
(9, 'Documentation', 5, 4, 5),
(10, 'Go Live', 1, 6, 6);

-- Calculate critical path using recursive CTE
WITH RECURSIVE task_schedule AS (
    -- Anchor: Tasks with no dependencies
    SELECT 
        task_id,
        task_name,
        duration_days,
        depends_on,
        priority,
        0 AS start_day,
        duration_days AS end_day,
        ARRAY[task_id] AS dependency_chain,
        task_id::TEXT AS chain_text,
        0 AS max_chain_length
    FROM project_tasks
    WHERE depends_on IS NULL
    
    UNION ALL
    
    -- Recursive: Calculate start/end based on dependencies
    SELECT 
        t.task_id,
        t.task_name,
        t.duration_days,
        t.depends_on,
        t.priority,
        ts.end_day AS start_day,
        ts.end_day + t.duration_days AS end_day,
        ts.dependency_chain || t.task_id,
        ts.chain_text || ' → ' || t.task_name,
        array_length(ts.dependency_chain, 1)
    FROM project_tasks t
    INNER JOIN task_schedule ts ON t.depends_on = ts.task_id
    WHERE t.task_id != ALL(ts.dependency_chain)  -- No cycles
)
SELECT 
    task_id,
    task_name,
    duration_days,
    start_day,
    end_day,
    end_day - start_day AS actual_duration,
    dependency_chain,
    -- Calculate total project duration
    MAX(end_day) OVER () AS total_project_days,
    -- Identify critical path tasks
    CASE 
        WHEN end_day = MAX(end_day) OVER () THEN 'CRITICAL'
        WHEN end_day >= MAX(end_day) OVER () - 5 THEN 'NEAR CRITICAL'
        ELSE 'NORMAL'
    END AS status
FROM task_schedule
ORDER BY start_day, task_id;

Output:

task_id	task_name	duration_days	start_day	end_day	actual_duration	dependency_chain	total_project_days	status
1	Requirements	5	0	5	5	{1}	53	CRITICAL
2	Design	10	5	15	10	{1,2}	53	CRITICAL
3	Database Schema	7	15	22	7	{1,2,3}	53	CRITICAL
4	API Development	15	15	30	15	{1,2,4}	53	CRITICAL
5	Frontend Dev	12	15	27	12	{1,2,5}	53	NORMAL
6	Integration	8	30	38	8	{1,2,4,6}	53	CRITICAL
7	Testing	10	30	40	10	{1,2,4,7}	53	NORMAL
8	Deployment	3	27	30	3	{1,2,5,8}	53	NORMAL
9	Documentation	5	30	35	5	{1,2,4,9}	53	NORMAL
10	Go Live	1	38	39	1	{1,2,4,6,10}	53	CRITICAL

ℹ️

Critical Path: The longest path through the project network determines minimum project duration. Tasks on this path have zero slack.

Cycle Detection Pattern

-- Detect cycles in graph
WITH RECURSIVE cycle_detection AS (
    SELECT 
        source_node,
        target_node,
        1 AS depth,
        ARRAY[source_node, target_node] AS path,
        FALSE AS has_cycle
    FROM network_connections
    
    UNION ALL
    
    SELECT 
        cd.source_node,
        nc.target_node,
        cd.depth + 1,
        cd.path || nc.target_node,
        nc.target_node = ANY(cd.path)  -- Cycle detected!
    FROM cycle_detection cd
    INNER JOIN network_connections nc ON cd.target_node = nc.source_node
    WHERE NOT cd.has_cycle
        AND cd.depth < 100  -- Safety limit
)
SELECT 
    source_node,
    target_node,
    depth,
    path,
    has_cycle
FROM cycle_detection
WHERE has_cycle = TRUE;

Mathematical Formulation

For a directed graph $G = (V, E)$ , the shortest path problem seeks:

\min_{p \in P(s,t)} \sum_{(u,v) \in p} w(u,v)

Where $P(s,t)$ is the set of all paths from $s$ to $t$ and $w(u,v)$ is the edge weight.

The recursive CTE implements a modified Dijkstra's algorithm:

d(v) = \min_{(u,v) \in E} \{d(u) + w(u,v)\}

Time complexity: $O(V \cdot E)$ in worst case with recursive CTE.

⚠️

Performance Warning: Recursive CTEs can be expensive for deep hierarchies. Consider materializing results or using iterative approaches for production systems.