πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Chain-of-Thought Reasoning

InferenceReasoning🟒 Free Lesson

Advertisement

LLM Usage

Chain-of-Thought Reasoning β€” Making LLMs Think Step by Step

Chain-of-thought prompting enables LLMs to solve complex problems by decomposing them into intermediate reasoning steps. This guide covers CoT variants, self-consistency, tree-of-thought methods, and practical applications.

  • Zero-Shot CoT β€” "Let's think step by step" unlocks reasoning capabilities
  • Self-Consistency β€” Sample multiple reasoning paths and majority-vote the answer
  • Tree-of-Thought β€” Explore branching reasoning strategies for harder problems

Thinking step by step is not just for humans anymore.

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting enables LLMs to solve complex problems by decomposing them into intermediate reasoning steps. This tutorial covers CoT variants, theoretical foundations, and practical applications.

DfChain-of-Thought (CoT) Prompting

Chain-of-thought prompting elicits multi-step reasoning from language models by providing or requesting intermediate reasoning steps before the final answer. This technique dramatically improves performance on tasks requiring logical, arithmetic, or commonsense reasoning.

CoT Variants

Zero-Shot CoT

Simply append "Let's think step by step" to the prompt:

`python prompt = """Q: A juggeler can juggle 16 balls. Half are golf balls, and half the golf balls are blue. How many blue golf balls? A: Let's think step by step.

  1. Total balls = 16
  2. Golf balls = 16 / 2 = 8
  3. Blue golf balls = 8 / 2 = 4 The answer is 4.""" `

Few-Shot CoT

Provide reasoning examples in the prompt:

`python prompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have now? A: Roger started with 5 balls. 2 cans of 3 is 6 balls. 5 + 6 = 11. The answer is 11.

Q: The school cafeteria ordered 42 apples for the lunches. They used 6 for Monday's lunches. Then they bought 4 more cases of 3 apples each. How many apples do they have now? A: The cafeteria started with 42 apples. They used 6, leaving 42 - 6 = 36. They bought 4 cases of 3 = 12 apples. 36 + 12 = 48. The answer is 48.

Q: {question} A: Let's think step by step.""" `

Self-Consistency

Generate multiple reasoning paths and select the most common answer:

Self-Consistency Voting
y^=argmax⁑yβˆ‘i=1n1[yi=y]\hat{y} = \\arg\max_y \sum_{i=1}^{n} \mathbb{1}[y_i = y]

Here,

  • y^\hat{y}=Final answer (majority vote)
  • nn=Number of sampled reasoning paths
  • yiy_i=Answer from reasoning path i

Self-Consistency Probability

P(y^∣x)=βˆ‘r∈RyP(r∣x)β‹…P(y∣r,x)P(\hat{y} | x) = \sum_{r \in \mathcal{R}_y} P(r | x) \cdot P(y | r, x)

Here,

  • Ry\mathcal{R}_y=Set of reasoning paths leading to answer y
  • P(r∣x)P(r | x)=Probability of reasoning path r given input x

Tree-of-Thought (ToT)

DfTree-of-Thought

Tree-of-Thought explores multiple reasoning branches at each step, evaluates them, and prunes unpromising paths. It uses a search algorithm (BFS or DFS) to find the best reasoning trajectory.

ToT State Evaluation

V(s)=LLM(’EvaluateΒ ifΒ stateΒ sΒ isΒ promising’)V(s) = \text{LLM}(\text{'Evaluate if state s is promising'})

Here,

  • ss=Current reasoning state
  • V(s)V(s)=Value/quality score of the state

When CoT Helps

Tasks Where CoT Excels

Task TypeExampleCoT Improvement
ArithmeticMulti-step math+30-40%
LogicSyllogisms+20-30%
CommonsensePhysical reasoning+15-25%
SymbolicVariable tracking+25-35%
Multi-hop QAReading comprehension+10-20%

Tasks Where CoT Hurts

  • Simple factual recall: CoT adds unnecessary complexity
  • Classification: CoT can lead to overthinking
  • Creative writing: CoT constrains creativity
  • Translation: CoT is not helpful for direct mapping

CoT is most effective when the problem requires multiple reasoning steps that cannot be easily compressed into a single inference. If the answer can be retrieved from memory, CoT may actually hurt performance.

Mathematical Foundation

CoT as Search

CoT can be viewed as a search problem in the space of reasoning sequences:

CoT Search Space

R=(r1,r2,ldots,rk):P(rt+1∣r1,ldots,rt,x)>Ο„\mathcal{R} = \\{(r_1, r_2, \\ldots, r_k) : P(r_{t+1} | r_1, \\ldots, r_t, x) > \tau\\}

Here,

  • R\mathcal{R}=Space of valid reasoning sequences
  • rtr_t=Reasoning step at position t
  • Ο„\tau=Probability threshold for pruning

Self-Consistency as Ensemble

Self-consistency can be viewed as an ensemble of different reasoning strategies:

Ensemble View of Self-Consistency

P(y∣x)=1nβˆ‘i=1nP(y∣ri,x)β‹…P(ri∣x)P(y | x) = \frac{1}{n} \sum_{i=1}^{n} P(y | r_i, x) \cdot P(r_i | x)

Here,

  • nn=Number of sampled reasoning paths
  • rir_i=Reasoning path i
  • P(y∣ri,x)P(y | r_i, x)=Answer probability given reasoning path

Implementation

`python import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

def generate_cot(question, n_samples=5, temperature=0.7): prompt = f"Q: {question}\nA: Let's think step by step.\n" answers = []

for _ in range(n_samples): inputs = tokenizer(prompt, return_tensors="pt") output = model.generate( **inputs, max_new_tokens=512, temperature=temperature, do_sample=True, ) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

Extract final answer

if "The answer is" in response: answer = response.split("The answer is")[-1].strip().split(".")[0] answers.append(answer)

Majority vote

from collections import Counter counts = Counter(answers) return counts.most_common(1)[0][0], dict(counts)

question = "If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the total distance?" answer, distribution = generate_cot(question) print(f"Answer: {answer}") print(f"Distribution: {distribution}") `

For more on prompting strategies, see our module on Prompt Engineering.

Practice Exercises

  1. Empirical: Compare zero-shot, few-shot, and CoT prompting on 10 arithmetic problems. Measure accuracy for each.
  2. Self-Consistency: Implement self-consistency with 3, 5, 10, and 20 samples. At what point does accuracy plateau?
  3. ToT: Implement a simple tree-of-thought system for a planning problem. Compare with standard CoT.
  4. Analysis: Identify 5 problems where CoT helps and 5 where it hurts. What pattern emerges?

Key Takeaways:

  • CoT prompting enables multi-step reasoning by decomposing problems
  • Zero-shot CoT ("Let's think step by step") provides easy improvement
  • Self-consistency selects the majority answer from multiple reasoning paths
  • Tree-of-thought explores and prunes reasoning branches
  • CoT excels at arithmetic, logic, commonsense, and multi-hop reasoning
  • CoT can hurt performance on simple recall and classification tasks

Advanced CoT Methods

Auto-CoT

Auto-CoT automatically generates chain-of-thought demonstrations by clustering questions and selecting representative examples from each cluster. This eliminates the need for manual CoT example crafting.

Program-of-Thought

Instead of natural language reasoning, generate executable code that solves the problem. The code is executed to produce the answer. This combines the reasoning capabilities of LLMs with the precision of program execution.

Graph-of-Thought

Extends tree-of-thought by allowing reasoning paths to merge and form a graph structure. This enables sharing of intermediate results across different reasoning branches, improving efficiency and solution quality.

Evaluating CoT Quality

When evaluating CoT, assess both the reasoning process and the final answer. Key evaluation criteria include:

  • Logical coherence of the reasoning steps
  • Faithfulness to the provided context or problem
  • Completeness of the reasoning chain
  • Accuracy of the final answer
  • Conciseness of the reasoning process

What to Learn Next

-> In-Context Learning Teaching LLMs new tasks without trainingβ€”purely through prompts.

-> Prompt Engineering Getting the most out of language models through effective input design.

-> RAG System Design Building production-ready retrieval systems for grounded generation.

-> Retrieval-Augmented Generation Combining LLMs with external knowledge for accurate, cited answers.

-> LLM Agent Frameworks Building autonomous agents that reason, plan, and act.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

⭐

Premium Content

Chain-of-Thought Reasoning

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert LLM Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement