Prompt Engineering
Prompt engineering is the art and science of designing inputs that guide large language models toward desired outputs. Effective prompts dramatically improve model performance without any parameter updates.
Why Prompt Engineering Matters
The same model can produce vastly different outputs depending on how it is prompted. Prompt engineering bridges the gap between model capability and user intent.
| Approach | Description | Cost | Performance |
|---|---|---|---|
| Zero-shot | No examples provided | Lowest | Baseline |
| One-shot | Single example | Low | Improved |
| Few-shot | Multiple examples | Moderate | Good |
| Chain-of-thought | Step-by-step reasoning | Moderate | Excellent |
| Self-consistency | Multiple reasoning paths | High | Best |
Prompt Taxonomy
Zero-Shot Prompting
Zero-shot prompting relies entirely on the model's pre-trained knowledge with no task-specific examples.
# Zero-shot classification
def zero_shot_classify(text, labels, model):
prompt = f"""Classify the following text into one of these categories: {', '.join(labels)}.
Text: {text}
Category:"""
response = model.generate(prompt)
return response.strip()
# Zero-shot sentiment analysis
text = "The movie had stunning visuals but a predictable plot."
labels = ["positive", "negative", "neutral"]
result = zero_shot_classify(text, labels, llm)
print(result) # Expected: "neutral"
Zero-Shot Effectiveness: The probability of correct zero-shot prediction scales with:
where is the pre-training data distribution and is the target task distribution.
Few-Shot Prompting
Few-shot prompting provides task examples that demonstrate the desired input-output mapping.
# Few-shot prompt template
def create_few_shot_prompt(examples, query, task_description=""):
"""
Construct a few-shot prompt with task description and examples.
Args:
examples: list of (input, output) tuples
query: the target input to classify/generate
task_description: optional task explanation
"""
prompt = ""
if task_description:
prompt += f"{task_description}\n\n"
for i, (inp, out) in enumerate(examples, 1):
prompt += f"Example {i}:\nInput: {inp}\nOutput: {out}\n\n"
prompt += f"Input: {query}\nOutput:"
return prompt
# Sentiment analysis examples
examples = [
("This product is amazing! Best purchase ever.", "Positive"),
("Terrible quality. Broke after one day.", "Negative"),
("It works as described. Nothing special.", "Neutral"),
]
query = "Not bad, but I expected better for the price."
prompt = create_few_shot_prompt(
examples, query,
task_description="Classify each review as Positive, Negative, or Neutral."
)
print(prompt)
Example Selection Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Random sampling | Randomly select from training data | General purpose |
| Stratified | Balanced representation of classes | Classification |
| Semantic similarity | Select examples similar to query | Domain-specific |
| Diversity-based | Maximize coverage of input space | Complex tasks |
| Hard negatives | Include challenging examples | Edge cases |
import numpy as np
from sentence_transformers import SentenceTransformer
class FewShotExampleSelector:
def __init__(self, examples, model_name="all-MiniLM-L6-v2"):
self.examples = examples
self.encoder = SentenceTransformer(model_name)
self.embeddings = self.encoder.encode([ex[0] for ex in examples])
def select(self, query, k=4, strategy="similarity"):
query_embedding = self.encoder.encode([query])
if strategy == "similarity":
# Select most similar examples
similarities = np.dot(self.embeddings, query_embedding.T).flatten()
indices = np.argsort(similarities)[-k:][::-1]
elif strategy == "diversity":
# Maximal marginal relevance for diversity
selected = []
candidates = list(range(len(self.examples)))
for _ in range(k):
best_idx = None
best_score = -np.inf
for idx in candidates:
sim_to_query = np.dot(self.embeddings[idx], query_embedding.T).item()
sim_to_selected = max(
[np.dot(self.embeddings[idx], self.embeddings[s]).item() for s in selected]
) if selected else 0
score = sim_to_query - 0.5 * sim_to_selected
if score > best_score:
best_score = score
best_idx = idx
selected.append(best_idx)
candidates.remove(best_idx)
indices = selected
return [self.examples[i] for i in indices]
# Usage
selector = FewShotExampleSelector(training_examples)
selected = selector.select("This phone has an incredible camera!", k=3)
Chain-of-Thought (CoT) Prompting
CoT prompting elicits step-by-step reasoning, dramatically improving performance on complex tasks.
DfCoT Reasoning Process
Given input , the model generates a reasoning chain before producing the final answer :
where is the probability of the reasoning chain given the input.
# Standard prompt vs Chain-of-Thought
standard_prompt = """Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer: 11"""
cot_prompt = """Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Let's think step by step:
1. Roger starts with 5 tennis balls.
2. He buys 2 cans, each containing 3 tennis balls.
3. Total new balls: 2 Γ 3 = 6 tennis balls.
4. Total balls: 5 + 6 = 11 tennis balls.
Answer: 11"""
Zero-Shot CoT
def zero_shot_cot(question, model):
"""Elicit reasoning with 'Let's think step by step'."""
# Step 1: Generate reasoning
reasoning_prompt = f"""Question: {question}
Let's think step by step:"""
reasoning = model.generate(reasoning_prompt)
# Step 2: Extract answer
answer_prompt = f"""Question: {question}
{reasoning}
Therefore, the answer is:"""
answer = model.generate(answer_prompt)
return {"reasoning": reasoning, "answer": answer}
question = "A train travels 60 mph for 2.5 hours. How far does it travel?"
result = zero_shot_cot(question, llm)
print(f"Reasoning: {result['reasoning']}")
print(f"Answer: {result['answer']}")
Self-Consistency
Self-consistency generates multiple reasoning paths and selects the most common answer through majority voting.
DfSelf-Consistency Decoding
Generate reasoning paths , each producing answer . The final answer is:
With temperature sampling to encourage diverse reasoning paths.
import torch
from collections import Counter
class SelfConsistencyDecoder:
def __init__(self, model, tokenizer, num_paths=5, temperature=0.7):
self.model = model
self.tokenizer = tokenizer
self.num_paths = num_paths
self.temperature = temperature
def decode(self, prompt):
"""Generate multiple reasoning paths and vote on answer."""
answers = []
reasoning_paths = []
for _ in range(self.num_paths):
# Generate with temperature for diversity
input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
output = self.model.generate(
input_ids,
max_new_tokens=256,
temperature=self.temperature,
top_p=0.9,
do_sample=True
)
response = self.tokenizer.decode(output[0], skip_special_tokens=True)
answer = self.extract_answer(response)
answers.append(answer)
reasoning_paths.append(response)
# Majority voting
vote_counts = Counter(answers)
final_answer = vote_counts.most_common(1)[0][0]
confidence = vote_counts[final_answer] / self.num_paths
return {
"answer": final_answer,
"confidence": confidence,
"vote_distribution": dict(vote_counts),
"reasoning_paths": reasoning_paths
}
def extract_answer(self, response):
"""Extract the final answer from the response."""
lines = response.strip().split("\n")
for line in reversed(lines):
if "answer" in line.lower():
# Extract text after the last colon or "is"
if ":" in line:
return line.split(":")[-1].strip()
elif "is" in line:
return line.split("is")[-1].strip()
return lines[-1].strip()
ReAct (Reasoning + Acting)
ReAct interleaves reasoning traces with actions, enabling models to interact with external tools.
# ReAct prompt template
REACT_TEMPLATE = """Answer the following questions as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {question}
Thought:"""
# Example ReAct interaction
react_example = """
Question: What is the capital of the country where the Eiffel Tower is located?
Thought: I need to find where the Eiffel Tower is located first.
Action: search
Action Input: Eiffel Tower location
Observation: The Eiffel Tower is located in Paris, France.
Thought: Now I know the country is France. I need to find the capital of France.
Action: search
Action Input: capital of France
Observation: The capital of France is Paris.
Thought: I now know the final answer.
Final Answer: Paris
"""
Prompt Optimization
DfAutomatic Prompt Optimization
Given a set of training examples , optimize prompt to maximize:
where is the space of possible prompts.
from typing import List, Dict
import random
class PromptOptimizer:
"""Simple prompt optimizer using beam search over prompt components."""
def __init__(self, model, eval_data):
self.model = model
self.eval_data = eval_data
self.templates = [
"Classify: {input}\nCategory:",
"What category does this belong to?\n{input}\nAnswer:",
"Task: Classify the following text.\nText: {input}\nClass:",
]
self.demonstration_sets = [...] # Pre-generated
def optimize(self, num_rounds=5, beam_width=3):
"""Find optimal prompt components."""
best_score = 0
best_template = None
best_demos = None
for template in self.templates:
for demo_set in self.demonstration_sets:
score = self.evaluate(template, demo_set)
if score > best_score:
best_score = score
best_template = template
best_demos = demo_set
return {
"template": best_template,
"demonstrations": best_demos,
"score": best_score
}
def evaluate(self, template, demos, k=50):
"""Evaluate prompt on held-out data."""
correct = 0
for x, y in self.eval_data[:k]:
prompt = self.format_prompt(template, demos, x)
pred = self.model.generate(prompt)
if self.match(pred, y):
correct += 1
return correct / min(k, len(self.eval_data))
Best Practices Summary
| Principle | Description |
|---|---|
| Be specific | Clear, unambiguous instructions |
| Provide examples | Few-shot demonstrations help |
| Structure output | Define expected format explicitly |
| Use delimiters | Separate instructions from content |
| Iterate | Test and refine prompts empirically |
| Chain reasoning | CoT for complex multi-step tasks |
| Self-consistency | Vote across multiple reasoning paths |
Key Takeaways
- Zero-shot works well when the model has strong task priors
- Few-shot examples should be diverse and representative
- Chain-of-thought dramatically improves reasoning tasks
- Self-consistency improves reliability through majority voting
- ReAct enables tool use and grounded reasoning
- Always test prompts systematically on held-out examples