Prompt Engineering

Prompt engineering is the art and science of designing inputs that guide large language models toward desired outputs. Effective prompts dramatically improve model performance without any parameter updates.

Why Prompt Engineering Matters

The same model can produce vastly different outputs depending on how it is prompted. Prompt engineering bridges the gap between model capability and user intent.

Approach	Description	Cost	Performance
Zero-shot	No examples provided	Lowest	Baseline
One-shot	Single example	Low	Improved
Few-shot	Multiple examples	Moderate	Good
Chain-of-thought	Step-by-step reasoning	Moderate	Excellent
Self-consistency	Multiple reasoning paths	High	Best

Prompt Taxonomy

Zero-Shot Prompting

Zero-shot prompting relies entirely on the model's pre-trained knowledge with no task-specific examples.

# Zero-shot classification
def zero_shot_classify(text, labels, model):
    prompt = f"""Classify the following text into one of these categories: {', '.join(labels)}.

Text: {text}

Category:"""

    response = model.generate(prompt)
    return response.strip()

# Zero-shot sentiment analysis
text = "The movie had stunning visuals but a predictable plot."
labels = ["positive", "negative", "neutral"]

result = zero_shot_classify(text, labels, llm)
print(result)  # Expected: "neutral"

Zero-Shot Effectiveness: The probability of correct zero-shot prediction scales with:

P(\text{correct}) \propto \frac{|D_{\text{pretrain}} \cap D_{\text{task}}|}{|D_{\text{task}}|}

where $D_{\text{pretrain}}$ is the pre-training data distribution and $D_{\text{task}}$ is the target task distribution.

Few-Shot Prompting

Few-shot prompting provides task examples that demonstrate the desired input-output mapping.

# Few-shot prompt template
def create_few_shot_prompt(examples, query, task_description=""):
    """
    Construct a few-shot prompt with task description and examples.

    Args:
        examples: list of (input, output) tuples
        query: the target input to classify/generate
        task_description: optional task explanation
    """
    prompt = ""
    if task_description:
        prompt += f"{task_description}\n\n"

    for i, (inp, out) in enumerate(examples, 1):
        prompt += f"Example {i}:\nInput: {inp}\nOutput: {out}\n\n"

    prompt += f"Input: {query}\nOutput:"
    return prompt

# Sentiment analysis examples
examples = [
    ("This product is amazing! Best purchase ever.", "Positive"),
    ("Terrible quality. Broke after one day.", "Negative"),
    ("It works as described. Nothing special.", "Neutral"),
]

query = "Not bad, but I expected better for the price."
prompt = create_few_shot_prompt(
    examples, query,
    task_description="Classify each review as Positive, Negative, or Neutral."
)
print(prompt)

Example Selection Strategies

Strategy	Description	Use Case
Random sampling	Randomly select from training data	General purpose
Stratified	Balanced representation of classes	Classification
Semantic similarity	Select examples similar to query	Domain-specific
Diversity-based	Maximize coverage of input space	Complex tasks
Hard negatives	Include challenging examples	Edge cases

import numpy as np
from sentence_transformers import SentenceTransformer

class FewShotExampleSelector:
    def __init__(self, examples, model_name="all-MiniLM-L6-v2"):
        self.examples = examples
        self.encoder = SentenceTransformer(model_name)
        self.embeddings = self.encoder.encode([ex[0] for ex in examples])

    def select(self, query, k=4, strategy="similarity"):
        query_embedding = self.encoder.encode([query])

        if strategy == "similarity":
            # Select most similar examples
            similarities = np.dot(self.embeddings, query_embedding.T).flatten()
            indices = np.argsort(similarities)[-k:][::-1]

        elif strategy == "diversity":
            # Maximal marginal relevance for diversity
            selected = []
            candidates = list(range(len(self.examples)))

            for _ in range(k):
                best_idx = None
                best_score = -np.inf
                for idx in candidates:
                    sim_to_query = np.dot(self.embeddings[idx], query_embedding.T).item()
                    sim_to_selected = max(
                        [np.dot(self.embeddings[idx], self.embeddings[s]).item() for s in selected]
                    ) if selected else 0
                    score = sim_to_query - 0.5 * sim_to_selected
                    if score > best_score:
                        best_score = score
                        best_idx = idx
                selected.append(best_idx)
                candidates.remove(best_idx)
            indices = selected

        return [self.examples[i] for i in indices]

# Usage
selector = FewShotExampleSelector(training_examples)
selected = selector.select("This phone has an incredible camera!", k=3)

Chain-of-Thought (CoT) Prompting

CoT prompting elicits step-by-step reasoning, dramatically improving performance on complex tasks.

DfCoT Reasoning Process

Given input $x$ , the model generates a reasoning chain $r = (r_1, r_2, \ldots, r_k)$ before producing the final answer $a$ :

P(a | x) = \sum_{r} P(a | x, r) \cdot P(r | x)

where $P(r | x)$ is the probability of the reasoning chain given the input.

# Standard prompt vs Chain-of-Thought
standard_prompt = """Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

Answer: 11"""

cot_prompt = """Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

Let's think step by step:
1. Roger starts with 5 tennis balls.
2. He buys 2 cans, each containing 3 tennis balls.
3. Total new balls: 2 × 3 = 6 tennis balls.
4. Total balls: 5 + 6 = 11 tennis balls.

Answer: 11"""

Zero-Shot CoT

def zero_shot_cot(question, model):
    """Elicit reasoning with 'Let's think step by step'."""

    # Step 1: Generate reasoning
    reasoning_prompt = f"""Question: {question}

Let's think step by step:"""

    reasoning = model.generate(reasoning_prompt)

    # Step 2: Extract answer
    answer_prompt = f"""Question: {question}

{reasoning}

Therefore, the answer is:"""

    answer = model.generate(answer_prompt)
    return {"reasoning": reasoning, "answer": answer}

question = "A train travels 60 mph for 2.5 hours. How far does it travel?"
result = zero_shot_cot(question, llm)
print(f"Reasoning: {result['reasoning']}")
print(f"Answer: {result['answer']}")

Self-Consistency

Self-consistency generates multiple reasoning paths and selects the most common answer through majority voting.

DfSelf-Consistency Decoding

Generate $M$ reasoning paths $\{r^{(1)}, r^{(2)}, \ldots, r^{(M)}\}$ , each producing answer $a^{(i)}$ . The final answer is:

a^* = \arg\max_{a} \sum_{i=1}^{M} \mathbb{1}[a^{(i)} = a]

With temperature sampling $T > 0$ to encourage diverse reasoning paths.

import torch
from collections import Counter

class SelfConsistencyDecoder:
    def __init__(self, model, tokenizer, num_paths=5, temperature=0.7):
        self.model = model
        self.tokenizer = tokenizer
        self.num_paths = num_paths
        self.temperature = temperature

    def decode(self, prompt):
        """Generate multiple reasoning paths and vote on answer."""
        answers = []
        reasoning_paths = []

        for _ in range(self.num_paths):
            # Generate with temperature for diversity
            input_ids = self.tokenizer.encode(prompt, return_tensors="pt")

            output = self.model.generate(
                input_ids,
                max_new_tokens=256,
                temperature=self.temperature,
                top_p=0.9,
                do_sample=True
            )

            response = self.tokenizer.decode(output[0], skip_special_tokens=True)
            answer = self.extract_answer(response)

            answers.append(answer)
            reasoning_paths.append(response)

        # Majority voting
        vote_counts = Counter(answers)
        final_answer = vote_counts.most_common(1)[0][0]
        confidence = vote_counts[final_answer] / self.num_paths

        return {
            "answer": final_answer,
            "confidence": confidence,
            "vote_distribution": dict(vote_counts),
            "reasoning_paths": reasoning_paths
        }

    def extract_answer(self, response):
        """Extract the final answer from the response."""
        lines = response.strip().split("\n")
        for line in reversed(lines):
            if "answer" in line.lower():
                # Extract text after the last colon or "is"
                if ":" in line:
                    return line.split(":")[-1].strip()
                elif "is" in line:
                    return line.split("is")[-1].strip()
        return lines[-1].strip()

ReAct (Reasoning + Acting)

ReAct interleaves reasoning traces with actions, enabling models to interact with external tools.

# ReAct prompt template
REACT_TEMPLATE = """Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {question}
Thought:"""

# Example ReAct interaction
react_example = """
Question: What is the capital of the country where the Eiffel Tower is located?
Thought: I need to find where the Eiffel Tower is located first.
Action: search
Action Input: Eiffel Tower location
Observation: The Eiffel Tower is located in Paris, France.
Thought: Now I know the country is France. I need to find the capital of France.
Action: search
Action Input: capital of France
Observation: The capital of France is Paris.
Thought: I now know the final answer.
Final Answer: Paris
"""

Prompt Optimization

DfAutomatic Prompt Optimization

Given a set of training examples $\mathcal{D} = \{(x_i, y_i)\}$ , optimize prompt $p^*$ to maximize:

p^* = \arg\max_{p \in \mathcal{P}} \sum_{(x_i, y_i) \in \mathcal{D}} \log P(y_i | p, x_i)

where $\mathcal{P}$ is the space of possible prompts.

from typing import List, Dict
import random

class PromptOptimizer:
    """Simple prompt optimizer using beam search over prompt components."""

    def __init__(self, model, eval_data):
        self.model = model
        self.eval_data = eval_data
        self.templates = [
            "Classify: {input}\nCategory:",
            "What category does this belong to?\n{input}\nAnswer:",
            "Task: Classify the following text.\nText: {input}\nClass:",
        ]
        self.demonstration_sets = [...]  # Pre-generated

    def optimize(self, num_rounds=5, beam_width=3):
        """Find optimal prompt components."""
        best_score = 0
        best_template = None
        best_demos = None

        for template in self.templates:
            for demo_set in self.demonstration_sets:
                score = self.evaluate(template, demo_set)
                if score > best_score:
                    best_score = score
                    best_template = template
                    best_demos = demo_set

        return {
            "template": best_template,
            "demonstrations": best_demos,
            "score": best_score
        }

    def evaluate(self, template, demos, k=50):
        """Evaluate prompt on held-out data."""
        correct = 0
        for x, y in self.eval_data[:k]:
            prompt = self.format_prompt(template, demos, x)
            pred = self.model.generate(prompt)
            if self.match(pred, y):
                correct += 1
        return correct / min(k, len(self.eval_data))

Best Practices Summary

Principle	Description
Be specific	Clear, unambiguous instructions
Provide examples	Few-shot demonstrations help
Structure output	Define expected format explicitly
Use delimiters	Separate instructions from content
Iterate	Test and refine prompts empirically
Chain reasoning	CoT for complex multi-step tasks
Self-consistency	Vote across multiple reasoning paths

Key Takeaways

Zero-shot works well when the model has strong task priors
Few-shot examples should be diverse and representative
Chain-of-thought dramatically improves reasoning tasks
Self-consistency improves reliability through majority voting
ReAct enables tool use and grounded reasoning
Always test prompts systematically on held-out examples

Prompt Engineering