Specialized Topics

Natural Language Processing — Teaching Computers to Read

NLP enables computers to understand, interpret, and generate human language — bridging the gap between raw text and actionable insights.

Text Preprocessing — tokenization, stemming, and lemmatization clean and normalize raw text
TF-IDF and Bag of Words — simple but effective vectorization methods for text classification
Word Embeddings — Word2Vec and GloVe capture semantic relationships between words in dense vector space

"Language is the house of being." — Martin Heidegger

NLP Fundamentals — Complete Guide

Natural Language Processing enables computers to understand and generate human language.

Mathematical Foundations

TF-IDF Formula

\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)

where:

\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}

(term frequency)

\text{IDF}(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|}

(inverse document frequency)

Cosine Similarity (for embeddings)

\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} = \frac{\sum_{i=1}^{n} a_i b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \sqrt{\sum_{i=1}^{n} b_i^2}}

Word2Vec Skip-gram Objective

\mathcal{L} = \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)

NLP Preprocessing Pipeline

DfNLP Preprocessing Pipeline

The standard text preprocessing pipeline:

Tokenization: Split text into tokens — "I love ML" -> ["I", "love", "ML"]
Lowercasing: Convert to lowercase — "I LOVE ML" -> "i love ml"
Stop word removal: Remove common words — ["i", "love", "ml"] -> ["love", "ml"]
Stemming: Reduce to root form — ["running", "runs", "ran"] -> ["run"]
Lemmatization: Reduce to dictionary form (better) — ["better"] -> ["good"]
Vectorization: Convert text to numbers

Bag of Words and TF-IDF

DfBag of Words

Bag of Words (BoW) counts word occurrences in each document:

Document	Text	Vector
Doc 1	"I love ML"	[1, 1, 1, 0]
Doc 2	"I love dogs"	[1, 1, 0, 1]

Vocabulary: [I, love, ML, dogs]

Word Embeddings Space

TF-IDF

\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)

Here,

$\text{TF}$ =Term frequency in document
$\text{IDF}$ =Inverse document frequency (rarer = more important)
$\text{TF-IDF}$ =Common words get LOW, rare words get HIGH values

Example: TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love machine learning", "I love dogs", "Machine learning is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

Word Embeddings

DfWord Embeddings

One-hot encoding: [0, 0, 1, 0, 0] — sparse, no meaning

Embedding: [0.2, -0.5, 0.8, 0.1, 0.3] — dense, captures meaning

Word2Vec

Word2Vec learns semantic relationships:

Skip-gram: Predict context from word
CBOW: Predict word from context
Learns analogies: king - man + woman ≈ queen

GloVe

GloVe (Global Vectors) uses co-occurrence statistics and is pre-trained on Wikipedia.

Example: Word Embeddings with Gensim

import gensim.downloader as api

# Load pre-trained Word2Vec
model = api.load('word2vec-google-news-300')

# Similarity
model.similarity('cat', 'dog')  # 0.76

# Analogy
model.most_similar(positive=['king', 'woman'], negative=['man'])
# [('queen', 0.71)]

N-grams and Local Word Order

Text Classification

Example: Text Classification Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', MultinomialNB())
])

# Train
pipeline.fit(X_train_text, y_train)

# Predict
predictions = pipeline.predict(["This movie is great!"])

Key Takeaways

Summary: NLP Fundamentals

Tokenization is the first step in any NLP pipeline
TF-IDF is simple and effective for text classification
Word embeddings capture semantic meaning
Word2Vec and GloVe are pre-trained embeddings
Pre-trained models (BERT, GPT) achieve state-of-the-art
Text preprocessing significantly impacts performance
N-grams capture local word order
Sentiment analysis is a common NLP task

What to Learn Next

-> Transformers Learn the self-attention architecture that revolutionized NLP and powers modern AI.

-> BERT Master encoder-only transformers for text classification, NER, and question answering.

-> GPT Architecture Understand decoder-only transformers that power autoregressive text generation.

-> Naive Bayes Learn the simple probabilistic classifier often used as a strong NLP baseline.

-> RNN and LSTM Explore sequential models that were the dominant NLP approach before transformers.

-> GANs Discover generative adversarial networks for text generation and style transfer.

NLP Fundamentals — Text Processing, Embeddings and Classification