LLM Training

Constitutional AI — Training LLMs with Principles, Not Just Data

Constitutional AI replaces human feedback with AI-driven self-critique guided by explicit principles, making alignment more scalable, transparent, and reproducible.

Self-Critique Loop — The model critiques and revises its own outputs against a set of constitutional principles
RLAIF — Reinforcement learning from AI feedback eliminates the need for expensive human labelers
Two-Phase Training — Supervised learning on revised responses followed by RL on AI-generated preferences

"Explicit principles make alignment transparent and reproducible compared to implicit human preferences."

Constitutional AI

Constitutional AI (CAI) is a method developed by Anthropic for aligning language models to human values without relying on extensive human feedback. It uses a set of principles (a "constitution") to guide the model's self-critique and revision process, making alignment more scalable and transparent.

An alignment framework where an AI system is trained to follow a set of explicit principles (a constitution) through a two-phase process: (1) supervised learning from self-critique and revision, and (2) reinforcement learning from AI feedback (RLAIF) rather than human feedback.

Motivation

Traditional RLHF has scalability limitations:

Human feedback is expensive and slow to collect
Human preferences can be inconsistent
Red-teaming requires human effort
Safety guidelines are difficult to codify in reward models

CAI addresses these by using AI itself to provide feedback based on explicit principles.

The Constitution

A constitution is a set of natural language principles that guide model behavior. Example principles:

Architecture Diagram

1. Choose the response that is least likely to be considered harmful.
2. Choose the response that is most helpful and harmless.
3. Choose the response that is most ethical and least likely to cause harm.
4. Choose the response that is most aligned with the values of a helpful assistant.

Phase 1: Supervised Learning from Self-Critique (SL-CAI)

The SL-CAI phase generates training data through a self-critique loop:

Step 1: Generate initial responses

Sample a prompt from the training data
Generate an initial response from the base model

Step 2: Self-critique

Ask the model to critique its own response against the constitution
"Identify specific ways in which the response might violate the principle: [principle]"

Step 3: Revision

Ask the model to revise its response based on the critique
"Please rewrite the response to address the issues identified above"

Step 4: Collect revised responses

Use the revised responses as supervised training data

SL-CAI Loss Function

\mathcal{L}_{\text{SL-CAI}} = -\sum_{i=1}^{N} \log P_\theta(\mathbf{y}_i^{\text{revised}} \mid \mathbf{x}_i)

Here,

$y_i^{\text{revised}}$ =revised response after self-critique
$x_i$ =input prompt
$\theta$ =model parameters
$N$ =number of training examples

The model is fine-tuned on the revised responses using standard supervised learning.

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

In the RLAIF phase, AI-generated preferences replace human preferences:

Step 1: Generate response pairs

For each prompt, generate two candidate responses

Step 2: AI preference labeling

Ask the model (or a separate model) to choose which response is better according to the constitution
"Considering the following principles: [constitution], which response is better?"

Step 3: Train reward model

Train a reward model on the AI-generated preferences

Step 4: PPO optimization

Use PPO to optimize the language model against the learned reward

RL-CAI Objective

\max_{\pi_\theta} \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)} \left[ R_\phi(x, y) - \beta \, \text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)) \right]

Here,

$\pi_\theta$ =policy being optimized
$\pi_{\text{ref}}$ =reference policy (initial model)
$R_\phi$ =reward model trained on AI preferences
$\beta$ =KL penalty coefficient
$D$ =distribution of prompts

CAI Training Objective

\mathcal{L}_{\text{CAI}} = \mathcal{L}_{\text{SL-CAI}} + \alpha \cdot \mathcal{L}_{\text{RL-CAI}}

Here,

$\mathcal{L}_{\text{SL-CAI}}$ =supervised loss on revised responses
$\mathcal{L}_{\text{RL-CAI}}$ =reinforcement learning objective
$\alpha$ =weighting coefficient

RLAIF vs RLHF

RLAIF achieves comparable or superior alignment performance to RLHF while requiring zero human feedback labels. The AI critic provides consistent, scalable preference signals that can be aligned with explicit principles.

Aspect	RLHF	RLAIF
Feedback source	Human annotators	AI model
Cost	High ($15-25/hour per annotator)	Low (compute only)
Consistency	Variable inter-annotator agreement	Consistent within model
Scalability	Limited by annotator pool	Virtually unlimited
Transparency	Implicit preferences	Explicit constitutional principles
Bias	Human biases	Model biases

Implementation Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class ConstitutionalPrinciple:
    name: str
    critique_prompt: str
    revision_prompt: str

class ConstitutionalAI:
    def __init__(self, model_name: str, principles: List[ConstitutionalPrinciple]):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.principles = principles
        self.device = next(self.model.parameters()).device
    
    def generate_response(self, prompt: str, max_new_tokens: int = 256) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True
            )
        return self.tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    
    def critique(self, prompt: str, response: str, principle: ConstitutionalPrinciple) -> str:
        critique_prompt = f"""Human: {prompt}
        
Assistant: {response}

{principle.critique_prompt}

Human: Please provide your critique:"""
        return self.generate_response(critique_prompt)
    
    def revise(self, prompt: str, response: str, critique: str, principle: ConstitutionalPrinciple) -> str:
        revision_prompt = f"""Human: {prompt}
        
Assistant: {response}

Critique: {critique}

{principle.revision_prompt}

Human: Please provide the revised response:"""
        return self.generate_response(revision_prompt)
    
    def self_critique_loop(self, prompt: str, max_iterations: int = 3) -> str:
        response = self.generate_response(prompt)
        
        for i in range(max_iterations):
            principle = self.principles[i % len(self.principles)]
            critique = self.critique(prompt, response, principle)
            
            if "no issues" in critique.lower() or "response is appropriate" in critique.lower():
                break
            
            response = self.revise(prompt, response, critique, principle)
        
        return response
    
    def generate_preference_pair(self, prompt: str) -> Dict:
        response_a = self.generate_response(prompt)
        response_b = self.generate_response(prompt)
        
        preference_prompt = f"""Consider the following constitutional principles:
{[p.name for p in self.principles]}

Human: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response better aligns with the principles? Respond with "A" or "B" and explain why:"""
        
        preference = self.generate_response(preference_prompt)
        
        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "preference": preference
        }

principles = [
    ConstitutionalPrinciple(
        name="Harmlessness",
        critique_prompt="Identify any ways this response might be harmful, dangerous, or unethical.",
        revision_prompt="Please revise the response to remove any harmful content while maintaining helpfulness."
    ),
    ConstitutionalPrinciple(
        name="Helpfulness",
        critique_prompt="Identify any ways this response could be more helpful, accurate, or informative.",
        revision_prompt="Please revise the response to be more helpful and informative."
    ),
    ConstitutionalPrinciple(
        name="Honesty",
        critique_prompt="Identify any inaccuracies, misleading information, or uncertain claims.",
        revision_prompt="Please revise the response to be more accurate and honest about uncertainties."
    )
]

cai = ConstitutionalAI("gpt2", principles)
revised = cai.self_critique_loop("How do I pick a lock?")

Red-Teaming with CAI

Constitutional AI enables systematic red-teaming by defining adversarial principles:

red_team_principles = [
    ConstitutionalPrinciple(
        name="Robustness to jailbreaks",
        critique_prompt="Does this response resist common jailbreak attempts? Identify any vulnerabilities.",
        revision_prompt="Revise the response to be robust against adversarial inputs."
    ),
    ConstitutionalPrinciple(
        name="Refusal of harmful requests",
        critique_prompt="Does this response appropriately refuse harmful requests?",
        revision_prompt="Revise to clearly decline the harmful request while being polite."
    )
]

Comparison with RLHF

CAI was introduced by Bai et al. (2022) in "Constitutional AI: Harmlessness from AI Feedback." The key insight is that explicit principles make alignment more transparent, reproducible, and scalable compared to implicit human preferences.

The relationship between CAI and RLHF:

SL-CAI replaces SFT on human-written demonstrations
RLAIF replaces RLHF with AI-generated preferences
Constitution replaces implicit human preferences with explicit principles
Self-critique replaces human red-teaming

When implementing CAI, the quality of the constitution is critical. Start with clear, specific principles and iterate based on observed failure modes. The AI's ability to self-critique improves with model capability—larger models produce better critiques.

Practical Implementation Considerations

Model Selection:

Self-critique requires a capable base model (typically 7B+ parameters)
The AI critic can be the same model or a larger, separate model
Larger models produce more nuanced critiques

Constitution Design:

Start with broad principles, then add specificity
Include both "do" and "don't" principles
Test the constitution against known failure modes
Version control the constitution like code

Training Protocol:

Alternate between SL-CAI and RL-CAI phases
Monitor for reward hacking in the RLAIF phase
Use KL penalties to prevent deviation from base capabilities

Evaluation:

Compare against RLHF-trained models on safety benchmarks
Test with red-teaming to identify remaining vulnerabilities
Measure helpfulness-harmlessness tradeoffs

Summary

Constitutional AI replaces human feedback with AI feedback guided by explicit principles
The SL-CAI phase generates training data through self-critique and revision loops
The RLAIF phase trains reward models on AI-generated preferences
CAI is more scalable, transparent, and reproducible than RLHF
The constitution defines alignment principles in natural language
Implementation requires careful constitution design and iteration

Practice Exercises

Constitution Design: Write a constitution with 5 principles for a customer service chatbot. Test how different principles affect the model's behavior.
Self-Critique Loop: Implement a 3-iteration self-critique loop. How does the response quality change across iterations?
RLAIF Data Generation: Generate 100 preference pairs using AI feedback. Compare the agreement rate with human preferences on the same examples.
Comparison Study: Train two models—one with CAI and one with standard SFT. Evaluate both on helpfulness and harmlessness benchmarks.
Adversarial Testing: Test your CAI-trained model against common jailbreak prompts. Identify which constitutional principles are most effective at preventing misuse.

What to Learn Next

-> RLHF and Alignment The foundation of alignment techniques that Constitutional AI improves upon.

-> LLM Safety and Red Teaming Systematic methods for finding and fixing vulnerabilities in aligned models.

-> Fine-Tuning LLMs The broader fine-tuning techniques that underpin Constitutional AI training.

-> Instruction Tuning Teaching LLMs to follow instructions through structured data.

-> Pretraining Language Models Understanding the pre-training phase before alignment techniques are applied.

-> Building Production LLM Applications Deploying aligned models reliably in production environments.

Next: 15 - LLM Evaluation Benchmarks ->

Constitutional AI

Constitutional AI — Training LLMs with Principles, Not Just Data

Constitutional AI

Motivation

The Constitution

Phase 1: Supervised Learning from Self-Critique (SL-CAI)

SL-CAI Loss Function

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

RL-CAI Objective

RLAIF vs RLHF

Implementation Example

Red-Teaming with CAI

Comparison with RLHF

Practical Implementation Considerations

Summary

Practice Exercises

What to Learn Next

Premium Content

Need Expert LLM Help?