Named Entity Recognition

Named Entity Recognition (NER)

NER identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and monetary values. It's a foundational task for information extraction and knowledge graph construction.

BIO Tagging Scheme

BIO (Beginning-Inside-Outside) is the standard tagging scheme for NER:

Tag	Meaning	Example
B-PER	Beginning of person	John
I-PER	Inside (continuation)	Smith
B-ORG	Beginning of organization	Google
I-ORG	Inside organization	Inc
B-LOC	Beginning of location	New
I-LOC	Inside location	York
O	Outside any entity	the, is, at

# Example BIO tags
tokens = ["John", "Smith", "works", "at", "Google", "in", "New", "York"]
tags = ["B-PER", "I-PER", "O", "O", "B-ORG", "O", "B-LOC", "I-LOC"]

NER with spaCy

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:20} {ent.label_:8} {spacy.explain(ent.label_)}")
# Apple Inc.          ORG      Companies, agencies
# Steve Jobs          PERSON   People, including fictional
# Cupertino           GPE      Countries, cities, states
# California          GPE      Countries, cities, states
# 1976                DATE     Absolute or relative dates

Custom NER Training

import spacy
from spacy.tokens import DocBin
from spacy.training import Example

# Training data in spaCy format
TRAIN_DATA = [
    ("iPhone 15 Pro is amazing", {"entities": [(0, 14, "PRODUCT")]}),
    ("Samsung Galaxy S24 released", {"entities": [(0, 18, "PRODUCT")]}),
    ("Tim Cook announced the new MacBook", {"entities": [(0, 8, "PERSON"), (29, 36, "PRODUCT")]}),
]

# Initialize blank model
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")

# Add labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# Train
optimizer = nlp.begin_training()
for i in range(20):
    losses = {}
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], sgd=optimizer, losses=losses)
    if i % 5 == 0:
        print(f"Step {i}, Loss: {losses['ner']:.4f}")

CRF-Based NER

Conditional Random Fields model the dependencies between neighboring tags.

# Feature extraction for CRF
def word2features(sent, i):
    word = sent[i]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        features['prev.word'] = sent[i-1].lower()
    if i < len(sent)-1:
        features['next.word'] = sent[i+1].lower()
    return features

Transformer-Based NER

from transformers import pipeline

ner_pipeline = pipeline("ner", aggregation_strategy="simple")

text = "Elon Musk founded SpaceX in Hawthorne, California."
entities = ner_pipeline(text)
for ent in entities:
    print(f"{ent['word']:20} {ent['entity_group']:6} {ent['score']:.3f}")
# Elon Musk            PER    0.998
# SpaceX               ORG    0.995
# Hawthorne            LOC    0.989
# California           LOC    0.994

NER Challenges

Challenge	Example	Difficulty
Ambiguity	"Washington" (person vs state)	High
Nested entities	"New York City Police Department"	Medium
Abbreviations	"MIT", "NYC"	Medium
Domain-specific	Gene names, drug names	High
Multi-language	Varying entity formats	Very High