🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

In-Context Learning

InferenceICL🟢 Free Lesson

Advertisement

LLM Usage

In-Context Learning — Teaching LLMs New Tasks Without Training

In-context learning is one of the most remarkable emergent capabilities of LLMs—the ability to learn new tasks from examples provided in the prompt. This guide explores what ICL is, why it works, and how to use it effectively.

  • No Training Required — Adapt models to new tasks purely through prompts
  • Bayesian Inference — ICL implicitly performs posterior inference over task hypotheses
  • Production Deployment — Balance example count, latency, and cost for real-world use

The most elegant learning happens without changing a single weight.

In-Context Learning

In-context learning (ICL) is one of the most remarkable emergent capabilities of LLMs. This tutorial explores what ICL is, why it works, and how to use it effectively.

DfIn-Context Learning (ICL)

In-context learning is the ability of a language model to learn new tasks from examples provided in the prompt, without any gradient updates to the model parameters. The model adapts its behavior based solely on the input context.

How ICL Works

The Bayesian Inference Hypothesis

ICL as Implicit Bayesian Inference

P(ytestxtest,D)=θP(ytestxtest,θ)P(θD)P(y_{\text{test}} | x_{\text{test}}, D) = \sum_{\theta} P(y_{\text{test}} | x_{\text{test}}, \theta) P(\theta | D)

Here,

  • DD=In-context examples (x_1, y_1), ..., (x_n, y_n)
  • θ\theta=Implicit task hypothesis
  • P(θD)P(\theta | D)=Posterior over hypotheses given examples

The model effectively performs Bayesian inference over task hypotheses, using the in-context examples to update its beliefs about which task is being performed.

The Task Vector Hypothesis

DfTask Vectors

Task vectors are directions in activation space that encode the task defined by in-context examples. These vectors emerge from the attention mechanism and guide the model's predictions for new inputs.

The Grokking Hypothesis

DfGrokking

Grokking is the phenomenon where a model suddenly generalizes after overfitting. For ICL, this suggests that large models have implicitly learned to perform gradient descent during the forward pass, effectively "grokking" how to learn from examples.

ICL as Implicit Bayes
P(ytestxtest,D)=θP(ytestxtest,θ)P(θD)P(y_{test}|x_{test}, D) = \sum_\theta P(y_{test}|x_{test}, \theta) P(\theta|D)

Here,

  • DD=In-context examples
  • hetaheta=Implicit task hypothesis
  • P(hetaD)P( heta|D)=Posterior over hypotheses given examples

Impact of Example Ordering

The order of in-context examples significantly affects performance:

ICL Performance by Ordering

Accuracy(Dπ)Accuracy(Dσ)πσ\text{Accuracy}(D_{\pi}) \neq \text{Accuracy}(D_{\sigma}) \quad \forall \pi \neq \sigma

Here,

  • DπD_{\pi}=Examples ordered by permutation \pi
  • DσD_{\sigma}=Examples ordered by permutation \sigma

Ordering Strategies

  1. Random ordering: Baseline, moderate performance
  2. Similarity-based: Most similar examples last (best average)
  3. Label-balanced: Equal representation of each class
  4. Difficulty-based: Easy examples first, hard examples last

For classification tasks, placing the most similar example last consistently improves performance. For generation tasks, order matters less.

Impact of Example Selection

Example Selection Score

score(xi)=sim(embed(xi),embed(xtest))\text{score}(x_i) = \text{sim}(\text{embed}(x_i), \text{embed}(x_{\text{test}}))

Here,

  • xix_i=Candidate in-context example
  • xtestx_{\text{test}}=Test input
  • simsim=Similarity function (cosine or dot product)

Selection Strategies

  • Random: No selection bias, but may include irrelevant examples
  • Top-k retrieval: Select k most similar examples from a pool
  • Diverse selection: Balance similarity with diversity
  • Label-aware: Ensure balanced label distribution in selected examples

ICL vs Fine-tuning

AspectICLFine-tuning
Data required2-32 examples100-10,000+ examples
ComputeForward pass onlyGradient updates
Task switchingChange promptRetrain model
Performance80-95% of fine-tuning100% (baseline)
LatencyHigher (longer prompts)Lower (shorter prompts)
Knowledge accessFull pre-trained knowledgeMay forget pre-trained knowledge

ICL-Fine-tuning Tradeoff

Use ICL if: NexamplesNparams<τicl\text{Use ICL if: } \frac{N_{\text{examples}}}{N_{\text{params}}} < \tau_{\text{icl}}

Here,

  • NexamplesN_{\text{examples}}=Number of labeled examples
  • NparamsN_{\text{params}}=Number of model parameters
  • τicl\tau_{\text{icl}}=Threshold (typically 1e-4)

Practical ICL Implementation

`python from transformers import AutoModelForCausalLM, AutoTokenizer import torch

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

Few-shot examples

examples = [ ("This movie was fantastic!", "Positive"), ("Terrible waste of time.", "Negative"), ("The food was okay.", "Neutral"), ]

def build_icl_prompt(test_input, examples): prompt = "Classify the sentiment of each review.\n\n" for text, label in examples: prompt += f"Review: "{text}" -> {label}\n" prompt += f"Review: "{test_input}" -> " return prompt

test = "Absolutely loved every minute of it!" prompt = build_icl_prompt(test, examples)

inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=10, temperature=0.0) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(f"Prediction: {response}") # Expected: Positive `

For more on prompting techniques, see our module on Prompt Engineering.

Practice Exercises

  1. Empirical: Test ICL on a 3-class classification task with 1, 2, 4, 8, and 16 examples. Plot accuracy vs number of examples.
  2. Ordering: Compare random, similarity-based, and reverse ordering of examples. Which is most robust?
  3. Selection: Implement a retrieval-based example selector using sentence embeddings. Compare with random selection.
  4. Theory: Explain why ICL works for decoder-only models but not for encoder-only models like BERT.

Key Takeaways:

  • ICL enables learning from examples without gradient updates
  • The model performs implicit Bayesian inference over task hypotheses
  • Example ordering and selection significantly affect performance
  • ICL uses 80-95% of fine-tuning performance with 100x fewer examples
  • Similarity-based example selection (most similar last) is generally best
  • ICL and fine-tuning are complementary; use ICL when data is scarce

Theoretical Foundations

ICL as Gradient Descent

Recent research suggests that transformer attention mechanisms can implicitly perform gradient descent during the forward pass. The attention computation effectively computes a linear regression over the in-context examples, updating the model internal representations without explicit parameter updates.

Mechanistic Interpretability of ICL

Mechanistic interpretability studies have identified specific circuits responsible for ICL. The induction head circuit, composed of two attention heads, learns to perform pattern completion by copying from previous contexts. This circuit emerges during training and is essential for few-shot learning.

Limitations of ICL

  1. Context window limits: ICL is constrained by the maximum sequence length. Long contexts increase latency and cost.
  2. Distribution shift: ICL performance degrades when test examples differ significantly from the pre-training distribution.
  3. Instability: Small changes in example ordering or selection can cause large performance swings.
  4. Limited complexity: ICL struggles with tasks requiring deep reasoning or memorization of complex patterns.

Advanced ICL Techniques

Retrieval-Augmented ICL

Instead of selecting examples randomly, use a retrieval system to find the most relevant in-context examples for each query. This combines the benefits of RAG with ICL, improving accuracy on diverse inputs.

Learned Prompting

Rather than selecting natural examples, learn continuous prompt embeddings that maximize task performance. This is the basis of prefix tuning and prompt tuning methods, which bridge the gap between ICL and fine-tuning.

Task-Aware ICL

Analyze the task structure and design ICL prompts that explicitly communicate the task type, input format, and output format. This reduces ambiguity and improves consistency across diverse inputs.

The most effective ICL systems combine retrieval (finding relevant examples), ordering (presenting them optimally), and calibration (adjusting for bias). Invest in all three components for best results.

ICL in Production

When deploying ICL in production, consider:

  • Latency: Longer prompts mean slower inference. Balance example count with speed requirements.
  • Cost: API calls are priced by token count. Fewer, more relevant examples reduce cost.
  • Consistency: Use deterministic example selection for reproducible outputs.
  • Fallback: Have a fine-tuned model as fallback when ICL performance is insufficient.

What to Learn Next

-> Chain-of-Thought Reasoning Making LLMs think step by step for complex reasoning problems.

-> Prompt Engineering Getting the most out of language models through effective input design.

-> RAG System Design Building production-ready retrieval systems for grounded generation.

-> Retrieval-Augmented Generation Combining LLMs with external knowledge for accurate, cited answers.

-> LLM Agent Frameworks Building autonomous agents that reason, plan, and act.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

Premium Content

In-Context Learning

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert LLM Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement