Named Entity Recognition (NER)
NER identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and monetary values. It's a foundational task for information extraction and knowledge graph construction.
BIO Tagging Scheme
BIO (Beginning-Inside-Outside) is the standard tagging scheme for NER:
| Tag | Meaning | Example |
|---|---|---|
| B-PER | Beginning of person | John |
| I-PER | Inside (continuation) | Smith |
| B-ORG | Beginning of organization | |
| I-ORG | Inside organization | Inc |
| B-LOC | Beginning of location | New |
| I-LOC | Inside location | York |
| O | Outside any entity | the, is, at |
# Example BIO tags
tokens = ["John", "Smith", "works", "at", "Google", "in", "New", "York"]
tags = ["B-PER", "I-PER", "O", "O", "B-ORG", "O", "B-LOC", "I-LOC"]
NER with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:20} {ent.label_:8} {spacy.explain(ent.label_)}")
# Apple Inc. ORG Companies, agencies
# Steve Jobs PERSON People, including fictional
# Cupertino GPE Countries, cities, states
# California GPE Countries, cities, states
# 1976 DATE Absolute or relative dates
Custom NER Training
import spacy
from spacy.tokens import DocBin
from spacy.training import Example
# Training data in spaCy format
TRAIN_DATA = [
("iPhone 15 Pro is amazing", {"entities": [(0, 14, "PRODUCT")]}),
("Samsung Galaxy S24 released", {"entities": [(0, 18, "PRODUCT")]}),
("Tim Cook announced the new MacBook", {"entities": [(0, 8, "PERSON"), (29, 36, "PRODUCT")]}),
]
# Initialize blank model
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
# Add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# Train
optimizer = nlp.begin_training()
for i in range(20):
losses = {}
for text, annotations in TRAIN_DATA:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], sgd=optimizer, losses=losses)
if i % 5 == 0:
print(f"Step {i}, Loss: {losses['ner']:.4f}")
CRF-Based NER
Conditional Random Fields model the dependencies between neighboring tags.
# Feature extraction for CRF
def word2features(sent, i):
word = sent[i]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
}
if i > 0:
features['prev.word'] = sent[i-1].lower()
if i < len(sent)-1:
features['next.word'] = sent[i+1].lower()
return features
Transformer-Based NER
from transformers import pipeline
ner_pipeline = pipeline("ner", aggregation_strategy="simple")
text = "Elon Musk founded SpaceX in Hawthorne, California."
entities = ner_pipeline(text)
for ent in entities:
print(f"{ent['word']:20} {ent['entity_group']:6} {ent['score']:.3f}")
# Elon Musk PER 0.998
# SpaceX ORG 0.995
# Hawthorne LOC 0.989
# California LOC 0.994
NER Challenges
| Challenge | Example | Difficulty |
|---|---|---|
| Ambiguity | "Washington" (person vs state) | High |
| Nested entities | "New York City Police Department" | Medium |
| Abbreviations | "MIT", "NYC" | Medium |
| Domain-specific | Gene names, drug names | High |
| Multi-language | Varying entity formats | Very High |