LLM Usage

Chain-of-Thought Reasoning — Making LLMs Think Step by Step

Chain-of-thought prompting enables LLMs to solve complex problems by decomposing them into intermediate reasoning steps. This guide covers CoT variants, self-consistency, tree-of-thought methods, and practical applications.

Zero-Shot CoT — "Let's think step by step" unlocks reasoning capabilities
Self-Consistency — Sample multiple reasoning paths and majority-vote the answer
Tree-of-Thought — Explore branching reasoning strategies for harder problems

Thinking step by step is not just for humans anymore.

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting enables LLMs to solve complex problems by decomposing them into intermediate reasoning steps. This tutorial covers CoT variants, theoretical foundations, and practical applications.

DfChain-of-Thought (CoT) Prompting

Chain-of-thought prompting elicits multi-step reasoning from language models by providing or requesting intermediate reasoning steps before the final answer. This technique dramatically improves performance on tasks requiring logical, arithmetic, or commonsense reasoning.

CoT Variants

Zero-Shot CoT

Simply append "Let's think step by step" to the prompt:

`python prompt = """Q: A juggeler can juggle 16 balls. Half are golf balls, and half the golf balls are blue. How many blue golf balls? A: Let's think step by step.

Total balls = 16
Golf balls = 16 / 2 = 8
Blue golf balls = 8 / 2 = 4 The answer is 4.""" `

Few-Shot CoT

Provide reasoning examples in the prompt:

`python prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have now? A: Roger started with 5 balls. 2 cans of 3 is 6 balls. 5 + 6 = 11. The answer is 11.

Q: The school cafeteria ordered 42 apples for the lunches. They used 6 for Monday's lunches. Then they bought 4 more cases of 3 apples each. How many apples do they have now? A: The cafeteria started with 42 apples. They used 6, leaving 42 - 6 = 36. They bought 4 cases of 3 = 12 apples. 36 + 12 = 48. The answer is 48.

Q: {question} A: Let's think step by step.""" `

Self-Consistency

Generate multiple reasoning paths and select the most common answer:

Self-Consistency Voting

\hat{y} = \\arg\max_y \sum_{i=1}^{n} \mathbb{1}[y_i = y]

Here,

$\hat{y}$ =Final answer (majority vote)
$n$ =Number of sampled reasoning paths
$y_i$ =Answer from reasoning path i

Self-Consistency Probability

P(\hat{y} | x) = \sum_{r \in \mathcal{R}_y} P(r | x) \cdot P(y | r, x)

Here,

$\mathcal{R}_y$ =Set of reasoning paths leading to answer y
$P(r | x)$ =Probability of reasoning path r given input x

Tree-of-Thought (ToT)

DfTree-of-Thought

Tree-of-Thought explores multiple reasoning branches at each step, evaluates them, and prunes unpromising paths. It uses a search algorithm (BFS or DFS) to find the best reasoning trajectory.

ToT State Evaluation

V(s) = \text{LLM}(\text{'Evaluate if state s is promising'})

Here,

$s$ =Current reasoning state
$V(s)$ =Value/quality score of the state

When CoT Helps

Tasks Where CoT Excels

Task Type	Example	CoT Improvement
Arithmetic	Multi-step math	+30-40%
Logic	Syllogisms	+20-30%
Commonsense	Physical reasoning	+15-25%
Symbolic	Variable tracking	+25-35%
Multi-hop QA	Reading comprehension	+10-20%

Tasks Where CoT Hurts

Simple factual recall: CoT adds unnecessary complexity
Classification: CoT can lead to overthinking
Creative writing: CoT constrains creativity
Translation: CoT is not helpful for direct mapping

CoT is most effective when the problem requires multiple reasoning steps that cannot be easily compressed into a single inference. If the answer can be retrieved from memory, CoT may actually hurt performance.

Mathematical Foundation

CoT as Search

CoT can be viewed as a search problem in the space of reasoning sequences:

CoT Search Space

\mathcal{R} = \\{(r_1, r_2, \\ldots, r_k) : P(r_{t+1} | r_1, \\ldots, r_t, x) > \tau\\}

Here,

$\mathcal{R}$ =Space of valid reasoning sequences
$r_t$ =Reasoning step at position t
$\tau$ =Probability threshold for pruning

Self-Consistency as Ensemble

Self-consistency can be viewed as an ensemble of different reasoning strategies:

Ensemble View of Self-Consistency

P(y | x) = \frac{1}{n} \sum_{i=1}^{n} P(y | r_i, x) \cdot P(r_i | x)

Here,

$n$ =Number of sampled reasoning paths
$r_i$ =Reasoning path i
$P(y | r_i, x)$ =Answer probability given reasoning path

Implementation

`python import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

def generate_cot(question, n_samples=5, temperature=0.7): prompt = f"Q: {question}\nA: Let's think step by step.\n" answers = []

for _ in range(n_samples): inputs = tokenizer(prompt, return_tensors="pt") output = model.generate( **inputs, max_new_tokens=512, temperature=temperature, do_sample=True, ) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

Extract final answer

if "The answer is" in response: answer = response.split("The answer is")[-1].strip().split(".")[0] answers.append(answer)

Majority vote

from collections import Counter counts = Counter(answers) return counts.most_common(1)[0][0], dict(counts)

question = "If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the total distance?" answer, distribution = generate_cot(question) print(f"Answer: {answer}") print(f"Distribution: {distribution}") `

For more on prompting strategies, see our module on Prompt Engineering.

Practice Exercises

Empirical: Compare zero-shot, few-shot, and CoT prompting on 10 arithmetic problems. Measure accuracy for each.
Self-Consistency: Implement self-consistency with 3, 5, 10, and 20 samples. At what point does accuracy plateau?
ToT: Implement a simple tree-of-thought system for a planning problem. Compare with standard CoT.
Analysis: Identify 5 problems where CoT helps and 5 where it hurts. What pattern emerges?

Key Takeaways:

CoT prompting enables multi-step reasoning by decomposing problems
Zero-shot CoT ("Let's think step by step") provides easy improvement
Self-consistency selects the majority answer from multiple reasoning paths
Tree-of-thought explores and prunes reasoning branches
CoT excels at arithmetic, logic, commonsense, and multi-hop reasoning
CoT can hurt performance on simple recall and classification tasks

Advanced CoT Methods

Auto-CoT

Auto-CoT automatically generates chain-of-thought demonstrations by clustering questions and selecting representative examples from each cluster. This eliminates the need for manual CoT example crafting.

Program-of-Thought

Instead of natural language reasoning, generate executable code that solves the problem. The code is executed to produce the answer. This combines the reasoning capabilities of LLMs with the precision of program execution.

Graph-of-Thought

Extends tree-of-thought by allowing reasoning paths to merge and form a graph structure. This enables sharing of intermediate results across different reasoning branches, improving efficiency and solution quality.

Evaluating CoT Quality

When evaluating CoT, assess both the reasoning process and the final answer. Key evaluation criteria include:

Logical coherence of the reasoning steps
Faithfulness to the provided context or problem
Completeness of the reasoning chain
Accuracy of the final answer
Conciseness of the reasoning process

What to Learn Next

-> In-Context Learning Teaching LLMs new tasks without training—purely through prompts.

-> Prompt Engineering Getting the most out of language models through effective input design.

-> RAG System Design Building production-ready retrieval systems for grounded generation.

-> Retrieval-Augmented Generation Combining LLMs with external knowledge for accurate, cited answers.

-> LLM Agent Frameworks Building autonomous agents that reason, plan, and act.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

Chain-of-Thought Reasoning

Chain-of-Thought Reasoning — Making LLMs Think Step by Step

Chain-of-Thought Reasoning

DfChain-of-Thought (CoT) Prompting

CoT Variants

Zero-Shot CoT

Few-Shot CoT

Self-Consistency

Self-Consistency Probability

Tree-of-Thought (ToT)

DfTree-of-Thought

ToT State Evaluation

When CoT Helps

Tasks Where CoT Excels

Tasks Where CoT Hurts

Mathematical Foundation

CoT as Search

CoT Search Space

Self-Consistency as Ensemble

Ensemble View of Self-Consistency

Implementation

Extract final answer

Majority vote

Practice Exercises

Advanced CoT Methods

Auto-CoT

Program-of-Thought

Graph-of-Thought

Evaluating CoT Quality

What to Learn Next

Premium Content

Need Expert LLM Help?