🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Constitutional AI

AlignmentConstitutional AI🟢 Free Lesson

Advertisement

LLM Training

Constitutional AI — Training LLMs with Principles, Not Just Data

Constitutional AI replaces human feedback with AI-driven self-critique guided by explicit principles, making alignment more scalable, transparent, and reproducible.

  • Self-Critique Loop — The model critiques and revises its own outputs against a set of constitutional principles
  • RLAIF — Reinforcement learning from AI feedback eliminates the need for expensive human labelers
  • Two-Phase Training — Supervised learning on revised responses followed by RL on AI-generated preferences

"Explicit principles make alignment transparent and reproducible compared to implicit human preferences."

Constitutional AI

Constitutional AI (CAI) is a method developed by Anthropic for aligning language models to human values without relying on extensive human feedback. It uses a set of principles (a "constitution") to guide the model's self-critique and revision process, making alignment more scalable and transparent.

An alignment framework where an AI system is trained to follow a set of explicit principles (a constitution) through a two-phase process: (1) supervised learning from self-critique and revision, and (2) reinforcement learning from AI feedback (RLAIF) rather than human feedback.

Motivation

Traditional RLHF has scalability limitations:

  • Human feedback is expensive and slow to collect
  • Human preferences can be inconsistent
  • Red-teaming requires human effort
  • Safety guidelines are difficult to codify in reward models

CAI addresses these by using AI itself to provide feedback based on explicit principles.

The Constitution

A constitution is a set of natural language principles that guide model behavior. Example principles:

Architecture Diagram
1. Choose the response that is least likely to be considered harmful.
2. Choose the response that is most helpful and harmless.
3. Choose the response that is most ethical and least likely to cause harm.
4. Choose the response that is most aligned with the values of a helpful assistant.

Phase 1: Supervised Learning from Self-Critique (SL-CAI)

The SL-CAI phase generates training data through a self-critique loop:

Step 1: Generate initial responses

  • Sample a prompt from the training data
  • Generate an initial response from the base model

Step 2: Self-critique

  • Ask the model to critique its own response against the constitution
  • "Identify specific ways in which the response might violate the principle: [principle]"

Step 3: Revision

  • Ask the model to revise its response based on the critique
  • "Please rewrite the response to address the issues identified above"

Step 4: Collect revised responses

  • Use the revised responses as supervised training data

SL-CAI Loss Function

LSL-CAI=i=1NlogPθ(yirevisedxi)\mathcal{L}_{\text{SL-CAI}} = -\sum_{i=1}^{N} \log P_\theta(\mathbf{y}_i^{\text{revised}} \mid \mathbf{x}_i)

Here,

  • yirevisedy_i^{\text{revised}}=revised response after self-critique
  • xix_i=input prompt
  • θ\theta=model parameters
  • NN=number of training examples

The model is fine-tuned on the revised responses using standard supervised learning.

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

In the RLAIF phase, AI-generated preferences replace human preferences:

Step 1: Generate response pairs

  • For each prompt, generate two candidate responses

Step 2: AI preference labeling

  • Ask the model (or a separate model) to choose which response is better according to the constitution
  • "Considering the following principles: [constitution], which response is better?"

Step 3: Train reward model

  • Train a reward model on the AI-generated preferences

Step 4: PPO optimization

  • Use PPO to optimize the language model against the learned reward

RL-CAI Objective

maxπθExD,yπθ(x)[Rϕ(x,y)βKL(πθ(x)πref(x))]\max_{\pi_\theta} \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)} \left[ R_\phi(x, y) - \beta \, \text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)) \right]

Here,

  • πθ\pi_\theta=policy being optimized
  • πref\pi_{\text{ref}}=reference policy (initial model)
  • RϕR_\phi=reward model trained on AI preferences
  • β\beta=KL penalty coefficient
  • DD=distribution of prompts
CAI Training Objective
LCAI=LSL-CAI+αLRL-CAI\mathcal{L}_{\text{CAI}} = \mathcal{L}_{\text{SL-CAI}} + \alpha \cdot \mathcal{L}_{\text{RL-CAI}}

Here,

  • LSL-CAI\mathcal{L}_{\text{SL-CAI}}=supervised loss on revised responses
  • LRL-CAI\mathcal{L}_{\text{RL-CAI}}=reinforcement learning objective
  • α\alpha=weighting coefficient

RLAIF vs RLHF

RLAIF achieves comparable or superior alignment performance to RLHF while requiring zero human feedback labels. The AI critic provides consistent, scalable preference signals that can be aligned with explicit principles.

AspectRLHFRLAIF
Feedback sourceHuman annotatorsAI model
CostHigh ($15-25/hour per annotator)Low (compute only)
ConsistencyVariable inter-annotator agreementConsistent within model
ScalabilityLimited by annotator poolVirtually unlimited
TransparencyImplicit preferencesExplicit constitutional principles
BiasHuman biasesModel biases

Implementation Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class ConstitutionalPrinciple:
    name: str
    critique_prompt: str
    revision_prompt: str

class ConstitutionalAI:
    def __init__(self, model_name: str, principles: List[ConstitutionalPrinciple]):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.principles = principles
        self.device = next(self.model.parameters()).device
    
    def generate_response(self, prompt: str, max_new_tokens: int = 256) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True
            )
        return self.tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    
    def critique(self, prompt: str, response: str, principle: ConstitutionalPrinciple) -> str:
        critique_prompt = f"""Human: {prompt}
        
Assistant: {response}

{principle.critique_prompt}

Human: Please provide your critique:"""
        return self.generate_response(critique_prompt)
    
    def revise(self, prompt: str, response: str, critique: str, principle: ConstitutionalPrinciple) -> str:
        revision_prompt = f"""Human: {prompt}
        
Assistant: {response}

Critique: {critique}

{principle.revision_prompt}

Human: Please provide the revised response:"""
        return self.generate_response(revision_prompt)
    
    def self_critique_loop(self, prompt: str, max_iterations: int = 3) -> str:
        response = self.generate_response(prompt)
        
        for i in range(max_iterations):
            principle = self.principles[i % len(self.principles)]
            critique = self.critique(prompt, response, principle)
            
            if "no issues" in critique.lower() or "response is appropriate" in critique.lower():
                break
            
            response = self.revise(prompt, response, critique, principle)
        
        return response
    
    def generate_preference_pair(self, prompt: str) -> Dict:
        response_a = self.generate_response(prompt)
        response_b = self.generate_response(prompt)
        
        preference_prompt = f"""Consider the following constitutional principles:
{[p.name for p in self.principles]}

Human: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response better aligns with the principles? Respond with "A" or "B" and explain why:"""
        
        preference = self.generate_response(preference_prompt)
        
        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "preference": preference
        }

principles = [
    ConstitutionalPrinciple(
        name="Harmlessness",
        critique_prompt="Identify any ways this response might be harmful, dangerous, or unethical.",
        revision_prompt="Please revise the response to remove any harmful content while maintaining helpfulness."
    ),
    ConstitutionalPrinciple(
        name="Helpfulness",
        critique_prompt="Identify any ways this response could be more helpful, accurate, or informative.",
        revision_prompt="Please revise the response to be more helpful and informative."
    ),
    ConstitutionalPrinciple(
        name="Honesty",
        critique_prompt="Identify any inaccuracies, misleading information, or uncertain claims.",
        revision_prompt="Please revise the response to be more accurate and honest about uncertainties."
    )
]

cai = ConstitutionalAI("gpt2", principles)
revised = cai.self_critique_loop("How do I pick a lock?")

Red-Teaming with CAI

Constitutional AI enables systematic red-teaming by defining adversarial principles:

red_team_principles = [
    ConstitutionalPrinciple(
        name="Robustness to jailbreaks",
        critique_prompt="Does this response resist common jailbreak attempts? Identify any vulnerabilities.",
        revision_prompt="Revise the response to be robust against adversarial inputs."
    ),
    ConstitutionalPrinciple(
        name="Refusal of harmful requests",
        critique_prompt="Does this response appropriately refuse harmful requests?",
        revision_prompt="Revise to clearly decline the harmful request while being polite."
    )
]

Comparison with RLHF

CAI was introduced by Bai et al. (2022) in "Constitutional AI: Harmlessness from AI Feedback." The key insight is that explicit principles make alignment more transparent, reproducible, and scalable compared to implicit human preferences.

The relationship between CAI and RLHF:

  1. SL-CAI replaces SFT on human-written demonstrations
  2. RLAIF replaces RLHF with AI-generated preferences
  3. Constitution replaces implicit human preferences with explicit principles
  4. Self-critique replaces human red-teaming

When implementing CAI, the quality of the constitution is critical. Start with clear, specific principles and iterate based on observed failure modes. The AI's ability to self-critique improves with model capability—larger models produce better critiques.

Practical Implementation Considerations

Model Selection:

  • Self-critique requires a capable base model (typically 7B+ parameters)
  • The AI critic can be the same model or a larger, separate model
  • Larger models produce more nuanced critiques

Constitution Design:

  • Start with broad principles, then add specificity
  • Include both "do" and "don't" principles
  • Test the constitution against known failure modes
  • Version control the constitution like code

Training Protocol:

  • Alternate between SL-CAI and RL-CAI phases
  • Monitor for reward hacking in the RLAIF phase
  • Use KL penalties to prevent deviation from base capabilities

Evaluation:

  • Compare against RLHF-trained models on safety benchmarks
  • Test with red-teaming to identify remaining vulnerabilities
  • Measure helpfulness-harmlessness tradeoffs

Summary

  • Constitutional AI replaces human feedback with AI feedback guided by explicit principles
  • The SL-CAI phase generates training data through self-critique and revision loops
  • The RLAIF phase trains reward models on AI-generated preferences
  • CAI is more scalable, transparent, and reproducible than RLHF
  • The constitution defines alignment principles in natural language
  • Implementation requires careful constitution design and iteration

Practice Exercises

  1. Constitution Design: Write a constitution with 5 principles for a customer service chatbot. Test how different principles affect the model's behavior.

  2. Self-Critique Loop: Implement a 3-iteration self-critique loop. How does the response quality change across iterations?

  3. RLAIF Data Generation: Generate 100 preference pairs using AI feedback. Compare the agreement rate with human preferences on the same examples.

  4. Comparison Study: Train two models—one with CAI and one with standard SFT. Evaluate both on helpfulness and harmlessness benchmarks.

  5. Adversarial Testing: Test your CAI-trained model against common jailbreak prompts. Identify which constitutional principles are most effective at preventing misuse.


What to Learn Next

-> RLHF and Alignment The foundation of alignment techniques that Constitutional AI improves upon.

-> LLM Safety and Red Teaming Systematic methods for finding and fixing vulnerabilities in aligned models.

-> Fine-Tuning LLMs The broader fine-tuning techniques that underpin Constitutional AI training.

-> Instruction Tuning Teaching LLMs to follow instructions through structured data.

-> Pretraining Language Models Understanding the pre-training phase before alignment techniques are applied.

-> Building Production LLM Applications Deploying aligned models reliably in production environments.


Next: 15 - LLM Evaluation Benchmarks ->

Premium Content

Constitutional AI

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert LLM Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement