LLM Multi-Modal: Vision-Language, Audio, and Multimodal Systems
Multimodal LLMs process and generate across multiple data types including text, images, audio, and video. Production deployment requires specialized encoders, fusion strategies, and cross-modal alignment techniques.
Multimodal Architecture
Multimodal Processing
1. Vision-Language Pipeline
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
from enum import Enum
class Modality(Enum):
TEXT = "text"
IMAGE = "image"
AUDIO = "audio"
VIDEO = "video"
@dataclass
class MultimodalInput:
modality: Modality
data: Any
metadata: Dict[str, Any]
class VisionLanguageModel:
def __init__(self):
self.image_encoder = "clip-vit-large"
self.text_encoder = "llama-tokenizer"
self.projection_dim = 1024
self.max_image_tokens = 576
def encode_image(self, image_path: str) -> Dict:
return {
"embeddings": [0.1] * self.projection_dim,
"num_patches": 256,
"spatial_tokens": 576
}
def encode_text(self, text: str) -> Dict:
tokens = text.split()
return {
"token_ids": list(range(len(tokens))),
"num_tokens": len(tokens)
}
def project_to_llm_space(self, image_embeddings: List[float],
text_embeddings: List[float]) -> List[float]:
return [a + b for a, b in zip(
image_embeddings[:self.projection_dim],
text_embeddings[:self.projection_dim]
)]
def generate_caption(self, image_path: str, max_tokens: int = 100) -> str:
img_features = self.encode_image(image_path)
return f"Caption for image with {img_features['num_patches']} patches"
def visual_qa(self, image_path: str, question: str) -> str:
img_features = self.encode_image(image_path)
text_features = self.encode_text(question)
fused = self.project_to_llm_space(img_features["embeddings"],
text_features["token_ids"][:self.projection_dim])
return f"Answer to: {question[:50]}..."
def compute_image_token_budget(self, image_size: tuple,
patch_size: int = 14) -> int:
h, w = image_size
patches_h = h // patch_size
patches_w = w // patch_size
return patches_h * patches_w
2. Audio-Language Pipeline
@dataclass
class AudioInput:
sample_rate: int
duration_seconds: float
channels: int
class AudioLanguageModel:
def __init__(self):
self.sample_rate = 16000
self.chunk_length = 30
self.embedding_dim = 1280
def encode_audio(self, audio_path: str) -> Dict:
return {
"embeddings": [0.1] * self.embedding_dim,
"num_frames": 3000,
"duration_ms": 5000
}
def transcribe(self, audio_path: str) -> str:
features = self.encode_audio(audio_path)
return f"Transcription of audio ({features['duration_ms']}ms)"
def audio_qa(self, audio_path: str, question: str) -> str:
audio_features = self.encode_audio(audio_path)
return f"Answer about audio: {question[:30]}..."
def speech_to_speech(self, audio_path: str, instruction: str) -> str:
return f"Transformed audio based on: {instruction[:30]}..."
Key Formulas
Cross-Modal Attention
Here,
- =Query from target modality
- =Key from source modality
- =Value from source modality
- =Key dimension
Multimodal Fusion Score
Here,
- =Text modality contribution
- =Image modality contribution
- =Audio modality contribution
- =Modality weights
Modality Comparison
| Modality | Encoder | Token Count | Latency | Use Case |
|---|---|---|---|---|
| Text | LLaMA Tokenizer | Variable | Low | Standard LLM tasks |
| Image | CLIP ViT-L/14 | 576 tokens | Medium | Image understanding |
| Audio | Whisper Encoder | ~3000 frames | Medium | Transcription, QA |
| Video | VideoCLIP | Variable | High | Video understanding |
Best Practices
- Align modalities using contrastive learning before fusion
- Use modality-specific tokenizers to preserve information
- Implement early fusion for tasks requiring deep cross-modal understanding
- Cache encoder outputs when processing the same input multiple times
- Monitor token budgets since images consume many more tokens than text
Production Considerations
Resource Requirements by Modality
| Modality | GPU Memory | CPU Overhead | Storage | Network |
|---|---|---|---|---|
| Text Only | 4-8 GB | Low | Minimal | Low |
| Text + Image | 16-32 GB | Medium | High (images) | Medium |
| Text + Audio | 8-16 GB | Medium | Medium | Medium |
| Text + Video | 32-80 GB | High | Very High | High |
Latency Breakdown
| Component | Text | Image | Audio | Video |
|---|---|---|---|---|
| Encoding | 5ms | 50ms | 100ms | 500ms |
| Projection | 1ms | 5ms | 5ms | 10ms |
| LLM Inference | 100ms | 120ms | 110ms | 200ms |
| Decoding | 50ms | 200ms | 150ms | 1000ms |
| Total | 156ms | 375ms | 365ms | 1710ms |