Specialized Topics
Natural Language Processing — Teaching Computers to Read
NLP enables computers to understand, interpret, and generate human language — bridging the gap between raw text and actionable insights.
- Text Preprocessing — tokenization, stemming, and lemmatization clean and normalize raw text
- TF-IDF and Bag of Words — simple but effective vectorization methods for text classification
- Word Embeddings — Word2Vec and GloVe capture semantic relationships between words in dense vector space
"Language is the house of being." — Martin Heidegger
NLP Fundamentals — Complete Guide
Natural Language Processing enables computers to understand and generate human language.
Mathematical Foundations
TF-IDF Formula
where:
(term frequency)
(inverse document frequency)
Cosine Similarity (for embeddings)
Word2Vec Skip-gram Objective
NLP Preprocessing Pipeline
DfNLP Preprocessing Pipeline
The standard text preprocessing pipeline:
- Tokenization: Split text into tokens —
"I love ML"->["I", "love", "ML"] - Lowercasing: Convert to lowercase —
"I LOVE ML"->"i love ml" - Stop word removal: Remove common words —
["i", "love", "ml"]->["love", "ml"] - Stemming: Reduce to root form —
["running", "runs", "ran"]->["run"] - Lemmatization: Reduce to dictionary form (better) —
["better"]->["good"] - Vectorization: Convert text to numbers
Bag of Words and TF-IDF
DfBag of Words
Bag of Words (BoW) counts word occurrences in each document:
| Document | Text | Vector |
|---|---|---|
| Doc 1 | "I love ML" | [1, 1, 1, 0] |
| Doc 2 | "I love dogs" | [1, 1, 0, 1] |
Vocabulary: [I, love, ML, dogs]
Word Embeddings Space
TF-IDF
Here,
- =Term frequency in document
- =Inverse document frequency (rarer = more important)
- =Common words get LOW, rare words get HIGH values
Example: TF-IDF in Python
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I love machine learning", "I love dogs", "Machine learning is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
Word Embeddings
DfWord Embeddings
One-hot encoding: [0, 0, 1, 0, 0] — sparse, no meaning
Embedding: [0.2, -0.5, 0.8, 0.1, 0.3] — dense, captures meaning
Word2Vec
Word2Vec learns semantic relationships:
- Skip-gram: Predict context from word
- CBOW: Predict word from context
- Learns analogies:
king - man + woman ≈ queen
GloVe
GloVe (Global Vectors) uses co-occurrence statistics and is pre-trained on Wikipedia.
Example: Word Embeddings with Gensim
import gensim.downloader as api
# Load pre-trained Word2Vec
model = api.load('word2vec-google-news-300')
# Similarity
model.similarity('cat', 'dog') # 0.76
# Analogy
model.most_similar(positive=['king', 'woman'], negative=['man'])
# [('queen', 0.71)]
N-grams and Local Word Order
Text Classification
Example: Text Classification Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
# Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000)),
('clf', MultinomialNB())
])
# Train
pipeline.fit(X_train_text, y_train)
# Predict
predictions = pipeline.predict(["This movie is great!"])
Key Takeaways
Summary: NLP Fundamentals
- Tokenization is the first step in any NLP pipeline
- TF-IDF is simple and effective for text classification
- Word embeddings capture semantic meaning
- Word2Vec and GloVe are pre-trained embeddings
- Pre-trained models (BERT, GPT) achieve state-of-the-art
- Text preprocessing significantly impacts performance
- N-grams capture local word order
- Sentiment analysis is a common NLP task
What to Learn Next
-> Transformers Learn the self-attention architecture that revolutionized NLP and powers modern AI.
-> BERT Master encoder-only transformers for text classification, NER, and question answering.
-> GPT Architecture Understand decoder-only transformers that power autoregressive text generation.
-> Naive Bayes Learn the simple probabilistic classifier often used as a strong NLP baseline.
-> RNN and LSTM Explore sequential models that were the dominant NLP approach before transformers.
-> GANs Discover generative adversarial networks for text generation and style transfer.