LLM Usage

In-Context Learning — Teaching LLMs New Tasks Without Training

In-context learning is one of the most remarkable emergent capabilities of LLMs—the ability to learn new tasks from examples provided in the prompt. This guide explores what ICL is, why it works, and how to use it effectively.

No Training Required — Adapt models to new tasks purely through prompts
Bayesian Inference — ICL implicitly performs posterior inference over task hypotheses
Production Deployment — Balance example count, latency, and cost for real-world use

The most elegant learning happens without changing a single weight.

In-Context Learning

In-context learning (ICL) is one of the most remarkable emergent capabilities of LLMs. This tutorial explores what ICL is, why it works, and how to use it effectively.

DfIn-Context Learning (ICL)

In-context learning is the ability of a language model to learn new tasks from examples provided in the prompt, without any gradient updates to the model parameters. The model adapts its behavior based solely on the input context.

How ICL Works

The Bayesian Inference Hypothesis

ICL as Implicit Bayesian Inference

P(y_{\text{test}} | x_{\text{test}}, D) = \sum_{\theta} P(y_{\text{test}} | x_{\text{test}}, \theta) P(\theta | D)

Here,

$D$ =In-context examples (x_1, y_1), ..., (x_n, y_n)
$\theta$ =Implicit task hypothesis
$P(\theta | D)$ =Posterior over hypotheses given examples

The model effectively performs Bayesian inference over task hypotheses, using the in-context examples to update its beliefs about which task is being performed.

The Task Vector Hypothesis

DfTask Vectors

Task vectors are directions in activation space that encode the task defined by in-context examples. These vectors emerge from the attention mechanism and guide the model's predictions for new inputs.

The Grokking Hypothesis

DfGrokking

Grokking is the phenomenon where a model suddenly generalizes after overfitting. For ICL, this suggests that large models have implicitly learned to perform gradient descent during the forward pass, effectively "grokking" how to learn from examples.

ICL as Implicit Bayes

P(y_{test}|x_{test}, D) = \sum_\theta P(y_{test}|x_{test}, \theta) P(\theta|D)

Here,

$D$ =In-context examples
$heta$ =Implicit task hypothesis
$P( heta|D)$ =Posterior over hypotheses given examples

Impact of Example Ordering

The order of in-context examples significantly affects performance:

ICL Performance by Ordering

\text{Accuracy}(D_{\pi}) \neq \text{Accuracy}(D_{\sigma}) \quad \forall \pi \neq \sigma

Here,

$D_{\pi}$ =Examples ordered by permutation \pi
$D_{\sigma}$ =Examples ordered by permutation \sigma

Ordering Strategies

Random ordering: Baseline, moderate performance
Similarity-based: Most similar examples last (best average)
Label-balanced: Equal representation of each class
Difficulty-based: Easy examples first, hard examples last

For classification tasks, placing the most similar example last consistently improves performance. For generation tasks, order matters less.

Impact of Example Selection

Example Selection Score

\text{score}(x_i) = \text{sim}(\text{embed}(x_i), \text{embed}(x_{\text{test}}))

Here,

$x_i$ =Candidate in-context example
$x_{\text{test}}$ =Test input
$sim$ =Similarity function (cosine or dot product)

Selection Strategies

Random: No selection bias, but may include irrelevant examples
Top-k retrieval: Select k most similar examples from a pool
Diverse selection: Balance similarity with diversity
Label-aware: Ensure balanced label distribution in selected examples

ICL vs Fine-tuning

Aspect	ICL	Fine-tuning
Data required	2-32 examples	100-10,000+ examples
Compute	Forward pass only	Gradient updates
Task switching	Change prompt	Retrain model
Performance	80-95% of fine-tuning	100% (baseline)
Latency	Higher (longer prompts)	Lower (shorter prompts)
Knowledge access	Full pre-trained knowledge	May forget pre-trained knowledge

ICL-Fine-tuning Tradeoff

\text{Use ICL if: } \frac{N_{\text{examples}}}{N_{\text{params}}} < \tau_{\text{icl}}

Here,

$N_{\text{examples}}$ =Number of labeled examples
$N_{\text{params}}$ =Number of model parameters
$\tau_{\text{icl}}$ =Threshold (typically 1e-4)

Practical ICL Implementation

`python from transformers import AutoModelForCausalLM, AutoTokenizer import torch

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

Few-shot examples

examples = [ ("This movie was fantastic!", "Positive"), ("Terrible waste of time.", "Negative"), ("The food was okay.", "Neutral"), ]

def build_icl_prompt(test_input, examples): prompt = "Classify the sentiment of each review.\n\n" for text, label in examples: prompt += f"Review: "{text}" -> {label}\n" prompt += f"Review: "{test_input}" -> " return prompt

test = "Absolutely loved every minute of it!" prompt = build_icl_prompt(test, examples)

inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=10, temperature=0.0) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(f"Prediction: {response}") # Expected: Positive `

For more on prompting techniques, see our module on Prompt Engineering.

Practice Exercises

Empirical: Test ICL on a 3-class classification task with 1, 2, 4, 8, and 16 examples. Plot accuracy vs number of examples.
Ordering: Compare random, similarity-based, and reverse ordering of examples. Which is most robust?
Selection: Implement a retrieval-based example selector using sentence embeddings. Compare with random selection.
Theory: Explain why ICL works for decoder-only models but not for encoder-only models like BERT.

Key Takeaways:

ICL enables learning from examples without gradient updates
The model performs implicit Bayesian inference over task hypotheses
Example ordering and selection significantly affect performance
ICL uses 80-95% of fine-tuning performance with 100x fewer examples
Similarity-based example selection (most similar last) is generally best
ICL and fine-tuning are complementary; use ICL when data is scarce

Theoretical Foundations

ICL as Gradient Descent

Recent research suggests that transformer attention mechanisms can implicitly perform gradient descent during the forward pass. The attention computation effectively computes a linear regression over the in-context examples, updating the model internal representations without explicit parameter updates.

Mechanistic Interpretability of ICL

Mechanistic interpretability studies have identified specific circuits responsible for ICL. The induction head circuit, composed of two attention heads, learns to perform pattern completion by copying from previous contexts. This circuit emerges during training and is essential for few-shot learning.

Limitations of ICL

Context window limits: ICL is constrained by the maximum sequence length. Long contexts increase latency and cost.
Distribution shift: ICL performance degrades when test examples differ significantly from the pre-training distribution.
Instability: Small changes in example ordering or selection can cause large performance swings.
Limited complexity: ICL struggles with tasks requiring deep reasoning or memorization of complex patterns.

Advanced ICL Techniques

Retrieval-Augmented ICL

Instead of selecting examples randomly, use a retrieval system to find the most relevant in-context examples for each query. This combines the benefits of RAG with ICL, improving accuracy on diverse inputs.

Learned Prompting

Rather than selecting natural examples, learn continuous prompt embeddings that maximize task performance. This is the basis of prefix tuning and prompt tuning methods, which bridge the gap between ICL and fine-tuning.

Task-Aware ICL

Analyze the task structure and design ICL prompts that explicitly communicate the task type, input format, and output format. This reduces ambiguity and improves consistency across diverse inputs.

The most effective ICL systems combine retrieval (finding relevant examples), ordering (presenting them optimally), and calibration (adjusting for bias). Invest in all three components for best results.

ICL in Production

When deploying ICL in production, consider:

Latency: Longer prompts mean slower inference. Balance example count with speed requirements.
Cost: API calls are priced by token count. Fewer, more relevant examples reduce cost.
Consistency: Use deterministic example selection for reproducible outputs.
Fallback: Have a fine-tuned model as fallback when ICL performance is insufficient.

What to Learn Next

-> Chain-of-Thought Reasoning Making LLMs think step by step for complex reasoning problems.

-> Prompt Engineering Getting the most out of language models through effective input design.

-> RAG System Design Building production-ready retrieval systems for grounded generation.

-> Retrieval-Augmented Generation Combining LLMs with external knowledge for accurate, cited answers.

-> LLM Agent Frameworks Building autonomous agents that reason, plan, and act.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

In-Context Learning

In-Context Learning — Teaching LLMs New Tasks Without Training

In-Context Learning

DfIn-Context Learning (ICL)

How ICL Works

The Bayesian Inference Hypothesis

ICL as Implicit Bayesian Inference

The Task Vector Hypothesis

DfTask Vectors

The Grokking Hypothesis

DfGrokking

Impact of Example Ordering

ICL Performance by Ordering

Ordering Strategies

Impact of Example Selection

Example Selection Score

Selection Strategies

ICL vs Fine-tuning

ICL-Fine-tuning Tradeoff

Practical ICL Implementation

Few-shot examples

Practice Exercises

Theoretical Foundations

ICL as Gradient Descent

Mechanistic Interpretability of ICL

Limitations of ICL

Advanced ICL Techniques

Retrieval-Augmented ICL

Learned Prompting

Task-Aware ICL

ICL in Production

What to Learn Next

Premium Content

Need Expert LLM Help?