🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Multilingual NLP

Multilingual NLPMultilingual Language Models🟢 Free Lesson

Advertisement

Multilingual NLP

Multilingual NLP enables language technology to work across hundreds of languages, addressing the fact that most of the world's languages lack sufficient resources for monolingual systems.

The Multilingual Challenge

ChallengeDescriptionImpact
Data imbalanceEnglish dominates training dataPoor low-resource performance
Script diversityDifferent writing systemsRequires script-agnostic models
Morphological variationAgglutinative vs isolatingTokenization challenges
Linguistic distanceDiverse syntax and semanticsTransfer learning limits

Multilingual Model Architecture


mBERT (Multilingual BERT)

mBERT is trained on Wikipedia text in 104 languages with a shared vocabulary.

DfCross-Lingual Transfer

Given a task in high-resource language LhL_h and a target low-resource language LlL_l, mBERT enables zero-shot transfer:

P(yxLl)P(yBERT(xLl))P(y | x_{L_l}) \approx P(y | \text{BERT}(x_{L_l}))

without any labeled data in LlL_l, relying on shared multilingual representations.

from transformers import BertTokenizer, BertForSequenceClassification
import torch

class MultilingualClassifier:
    def __init__(self, model_name="bert-base-multilingual-cased", num_labels=3):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels
        )
        self.label_map = {0: "negative", 1: "neutral", 2: "positive"}

    def predict(self, text, language=None):
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512,
        )

        with torch.no_grad():
            outputs = self.model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1)

        predicted_class = probs.argmax(dim=-1).item()
        confidence = probs[0, predicted_class].item()

        return {
            "text": text,
            "language": language,
            "label": self.label_map[predicted_class],
            "confidence": confidence,
            "probabilities": {
                self.label_map[i]: probs[0, i].item()
                for i in range(len(self.label_map))
            },
        }

# Zero-shot cross-lingual transfer
classifier = MultilingualClassifier()

# English training data
texts_en = [
    ("This product is amazing!", 2),
    ("Terrible experience.", 0),
    ("It's okay, nothing special.", 1),
]

# Test in different languages (zero-shot)
test_texts = [
    ("Ce produit est incroyable!", "French"),
    ("Dieses Produkt ist erstaunlich!", "German"),
    ("Este producto es increíble!", "Spanish"),
    ("この製品は素晴らしい!", "Japanese"),
]

for text, lang in test_texts:
    result = classifier.predict(text, language=lang)
    print(f"[{lang}] {result['label']} ({result['confidence']:.2f})")

XLM-RoBERTa

XLM-RoBERTa (XLM-R) is pre-trained on 2.5TB of CommonCrawl data across 100 languages, significantly outperforming mBERT.

ModelLanguagesPre-training DataXNLI AccuracyParameters
mBERT104Wikipedia72.1110M
XLM-R Base1002.5TB CC-10077.8270M
XLM-R Large1002.5TB CC-10083.7550M

DfXLM-R Pre-training Objective

XLM-R uses masked language modeling across all languages simultaneously:

LMLM=iMlogP(xix\M;θ)\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}; \theta)

The key difference from mBERT is the removal of the cross-lingual translation language modeling (TLM) objective, relying instead on large-scale monolingual data.

from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
import torch

class XLMRClassifier:
    def __init__(self, model_name="xlm-roberta-base", num_labels=6):
        self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
        self.model = XLMRobertaForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels
        )
        self.languages = ["en", "de", "fr", "es", "zh", "ja", "ko", "ar", "hi", "ru"]

    def predict_multilingual(self, texts_by_language):
        """Predict across multiple languages."""
        results = {}

        for lang, texts in texts_by_language.items():
            lang_results = []
            for text in texts:
                inputs = self.tokenizer(
                    text,
                    return_tensors="pt",
                    padding=True,
                    truncation=True,
                    max_length=512,
                )

                with torch.no_grad():
                    outputs = self.model(**inputs)
                    probs = torch.softmax(outputs.logits, dim=-1)

                lang_results.append({
                    "text": text,
                    "prediction": probs.argmax(dim=-1).item(),
                    "confidence": probs.max().item(),
                })

            results[lang] = lang_results

        return results

    def compute_zero_shot_transfer(self, labeled_data_en, unlabeled_data_target):
        """Evaluate zero-shot transfer from English to target language."""
        # Fine-tune on English
        self.train_on_english(labeled_data_en)

        # Evaluate on target language (no target language training data)
        predictions = self.predict_multilingual({target_lang: unlabeled_data_target})

        # Compare with English baseline
        return {
            "transfer_gap": self.compute_gap(predictions),
            "per_class_transfer": self.compute_per_class_gap(predictions),
        }

# XLM-R fine-tuning for NER
from transformers import XLMRobertaForTokenClassification

class MultilingualNER:
    def __init__(self, model_name="xlm-roberta-base", num_labels=9):
        self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
        self.model = XLMRobertaForTokenClassification.from_pretrained(
            model_name, num_labels=num_labels
        )
        self.label_list = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"]

    def predict_entities(self, text, language="en"):
        inputs = self.tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
        offset_mapping = inputs.pop("offset_mapping")[0]

        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)[0]

        entities = []
        current_entity = None

        for i, (pred, offset) in enumerate(zip(predictions, offset_mapping)):
            label = self.label_list[pred.item()]

            if label.startswith("B-"):
                if current_entity:
                    entities.append(current_entity)
                current_entity = {
                    "type": label[2:],
                    "start": offset[0].item(),
                    "end": offset[1].item(),
                    "text": text[offset[0]:offset[1]],
                }
            elif label.startswith("I-") and current_entity and current_entity["type"] == label[2:]:
                current_entity["end"] = offset[1].item()
                current_entity["text"] = text[current_entity["start"]:offset[1].item()]
            else:
                if current_entity:
                    entities.append(current_entity)
                    current_entity = None

        if current_entity:
            entities.append(current_entity)

        return entities

Cross-Lingual Transfer Strategies

StrategyDescriptionPerformance
Zero-shotTrain on English, test on targetBaseline
Translate-trainTranslate target data to EnglishModerate improvement
Translate-testTranslate test data to EnglishModerate improvement
Multi-taskTrain jointly on multiple languagesSignificant improvement
Adapter-basedLanguage-specific adaptersBest performance

Language Family Transfer

Source LanguagesTarget Language FamilyTransfer Quality
English + GermanGermanicExcellent
English + FrenchRomanceExcellent
English + ChineseSino-TibetanGood
English + ArabicSemiticGood
English + HindiIndo-AryanModerate

Best Practices

  1. Use XLM-R over mBERT - It significantly outperforms mBERT on most multilingual tasks
  2. Include language tags - Explicit language identification helps model routing
  3. Balance training data - Oversample low-resource languages during training
  4. Evaluate on multiple languages - Ensure robust cross-lingual transfer
  5. Consider script differences - Some scripts require specialized tokenization

Key Takeaways

  • XLM-R achieves near-state-of-art on multilingual benchmarks across 100 languages
  • Zero-shot cross-lingual transfer enables NLP for low-resource languages without labeled data
  • Language family proximity affects transfer quality between related languages
  • Multi-task training on multiple languages improves overall performance
  • Script diversity requires careful tokenizer design and preprocessing

Premium Content

Multilingual NLP

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement