Multilingual NLP
Multilingual NLP enables language technology to work across hundreds of languages, addressing the fact that most of the world's languages lack sufficient resources for monolingual systems.
The Multilingual Challenge
| Challenge | Description | Impact |
|---|---|---|
| Data imbalance | English dominates training data | Poor low-resource performance |
| Script diversity | Different writing systems | Requires script-agnostic models |
| Morphological variation | Agglutinative vs isolating | Tokenization challenges |
| Linguistic distance | Diverse syntax and semantics | Transfer learning limits |
Multilingual Model Architecture
mBERT (Multilingual BERT)
mBERT is trained on Wikipedia text in 104 languages with a shared vocabulary.
DfCross-Lingual Transfer
Given a task in high-resource language and a target low-resource language , mBERT enables zero-shot transfer:
without any labeled data in , relying on shared multilingual representations.
from transformers import BertTokenizer, BertForSequenceClassification
import torch
class MultilingualClassifier:
def __init__(self, model_name="bert-base-multilingual-cased", num_labels=3):
self.tokenizer = BertTokenizer.from_pretrained(model_name)
self.model = BertForSequenceClassification.from_pretrained(
model_name, num_labels=num_labels
)
self.label_map = {0: "negative", 1: "neutral", 2: "positive"}
def predict(self, text, language=None):
inputs = self.tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512,
)
with torch.no_grad():
outputs = self.model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = probs.argmax(dim=-1).item()
confidence = probs[0, predicted_class].item()
return {
"text": text,
"language": language,
"label": self.label_map[predicted_class],
"confidence": confidence,
"probabilities": {
self.label_map[i]: probs[0, i].item()
for i in range(len(self.label_map))
},
}
# Zero-shot cross-lingual transfer
classifier = MultilingualClassifier()
# English training data
texts_en = [
("This product is amazing!", 2),
("Terrible experience.", 0),
("It's okay, nothing special.", 1),
]
# Test in different languages (zero-shot)
test_texts = [
("Ce produit est incroyable!", "French"),
("Dieses Produkt ist erstaunlich!", "German"),
("Este producto es increíble!", "Spanish"),
("この製品は素晴らしい!", "Japanese"),
]
for text, lang in test_texts:
result = classifier.predict(text, language=lang)
print(f"[{lang}] {result['label']} ({result['confidence']:.2f})")
XLM-RoBERTa
XLM-RoBERTa (XLM-R) is pre-trained on 2.5TB of CommonCrawl data across 100 languages, significantly outperforming mBERT.
| Model | Languages | Pre-training Data | XNLI Accuracy | Parameters |
|---|---|---|---|---|
| mBERT | 104 | Wikipedia | 72.1 | 110M |
| XLM-R Base | 100 | 2.5TB CC-100 | 77.8 | 270M |
| XLM-R Large | 100 | 2.5TB CC-100 | 83.7 | 550M |
DfXLM-R Pre-training Objective
XLM-R uses masked language modeling across all languages simultaneously:
The key difference from mBERT is the removal of the cross-lingual translation language modeling (TLM) objective, relying instead on large-scale monolingual data.
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
import torch
class XLMRClassifier:
def __init__(self, model_name="xlm-roberta-base", num_labels=6):
self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
self.model = XLMRobertaForSequenceClassification.from_pretrained(
model_name, num_labels=num_labels
)
self.languages = ["en", "de", "fr", "es", "zh", "ja", "ko", "ar", "hi", "ru"]
def predict_multilingual(self, texts_by_language):
"""Predict across multiple languages."""
results = {}
for lang, texts in texts_by_language.items():
lang_results = []
for text in texts:
inputs = self.tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512,
)
with torch.no_grad():
outputs = self.model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
lang_results.append({
"text": text,
"prediction": probs.argmax(dim=-1).item(),
"confidence": probs.max().item(),
})
results[lang] = lang_results
return results
def compute_zero_shot_transfer(self, labeled_data_en, unlabeled_data_target):
"""Evaluate zero-shot transfer from English to target language."""
# Fine-tune on English
self.train_on_english(labeled_data_en)
# Evaluate on target language (no target language training data)
predictions = self.predict_multilingual({target_lang: unlabeled_data_target})
# Compare with English baseline
return {
"transfer_gap": self.compute_gap(predictions),
"per_class_transfer": self.compute_per_class_gap(predictions),
}
# XLM-R fine-tuning for NER
from transformers import XLMRobertaForTokenClassification
class MultilingualNER:
def __init__(self, model_name="xlm-roberta-base", num_labels=9):
self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
self.model = XLMRobertaForTokenClassification.from_pretrained(
model_name, num_labels=num_labels
)
self.label_list = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"]
def predict_entities(self, text, language="en"):
inputs = self.tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")[0]
with torch.no_grad():
outputs = self.model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)[0]
entities = []
current_entity = None
for i, (pred, offset) in enumerate(zip(predictions, offset_mapping)):
label = self.label_list[pred.item()]
if label.startswith("B-"):
if current_entity:
entities.append(current_entity)
current_entity = {
"type": label[2:],
"start": offset[0].item(),
"end": offset[1].item(),
"text": text[offset[0]:offset[1]],
}
elif label.startswith("I-") and current_entity and current_entity["type"] == label[2:]:
current_entity["end"] = offset[1].item()
current_entity["text"] = text[current_entity["start"]:offset[1].item()]
else:
if current_entity:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
return entities
Cross-Lingual Transfer Strategies
| Strategy | Description | Performance |
|---|---|---|
| Zero-shot | Train on English, test on target | Baseline |
| Translate-train | Translate target data to English | Moderate improvement |
| Translate-test | Translate test data to English | Moderate improvement |
| Multi-task | Train jointly on multiple languages | Significant improvement |
| Adapter-based | Language-specific adapters | Best performance |
Language Family Transfer
| Source Languages | Target Language Family | Transfer Quality |
|---|---|---|
| English + German | Germanic | Excellent |
| English + French | Romance | Excellent |
| English + Chinese | Sino-Tibetan | Good |
| English + Arabic | Semitic | Good |
| English + Hindi | Indo-Aryan | Moderate |
Best Practices
- Use XLM-R over mBERT - It significantly outperforms mBERT on most multilingual tasks
- Include language tags - Explicit language identification helps model routing
- Balance training data - Oversample low-resource languages during training
- Evaluate on multiple languages - Ensure robust cross-lingual transfer
- Consider script differences - Some scripts require specialized tokenization
Key Takeaways
- XLM-R achieves near-state-of-art on multilingual benchmarks across 100 languages
- Zero-shot cross-lingual transfer enables NLP for low-resource languages without labeled data
- Language family proximity affects transfer quality between related languages
- Multi-task training on multiple languages improves overall performance
- Script diversity requires careful tokenizer design and preprocessing