🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

NLP Fundamentals — Text Processing, Embeddings and Classification

Core MLNLP🟢 Free Lesson

Advertisement

Specialized Topics

Natural Language Processing — Teaching Computers to Read

NLP enables computers to understand, interpret, and generate human language — bridging the gap between raw text and actionable insights.

  • Text Preprocessing — tokenization, stemming, and lemmatization clean and normalize raw text
  • TF-IDF and Bag of Words — simple but effective vectorization methods for text classification
  • Word Embeddings — Word2Vec and GloVe capture semantic relationships between words in dense vector space

"Language is the house of being." — Martin Heidegger

NLP Fundamentals — Complete Guide

Natural Language Processing enables computers to understand and generate human language.


Mathematical Foundations

TF-IDF Formula

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)

where:

TF(t,d)=ft,dtdft,d\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}

(term frequency)

IDF(t,D)=logN{dD:td}\text{IDF}(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|}

(inverse document frequency)

Cosine Similarity (for embeddings)

cos(a,b)=abab=i=1naibii=1nai2i=1nbi2\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} = \frac{\sum_{i=1}^{n} a_i b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \sqrt{\sum_{i=1}^{n} b_i^2}}

Word2Vec Skip-gram Objective

L=t=1Tcjc,j0logP(wt+jwt)\mathcal{L} = \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)

NLP Preprocessing Pipeline

NLP Text Preprocessing PipelineRaw Text"I LOVE ML!!!"TokenizeSplit → words["I","LOVE","ML"]Lowercase→ lowercase["i","love","ml"]StopwordsRemove "i"["love","ml"]Stem/Lemma→ root form["love","ml"]Vector→ numbersTokenization ComparisonWord Tokenization"Don't" → ["Do", "n't"]"New-York" → ["New", "York"]Simple, fast, loses contextSubword (BPE)"unhappiness" → ["un","happi","ness"]Handles rare wordsUsed by BERT, GPTCharacter Tokenization"Hello" → ["H","e","l","l","o"]No OOV issuesLong sequences, slow

DfNLP Preprocessing Pipeline

The standard text preprocessing pipeline:

  1. Tokenization: Split text into tokens — "I love ML" -> ["I", "love", "ML"]
  2. Lowercasing: Convert to lowercase — "I LOVE ML" -> "i love ml"
  3. Stop word removal: Remove common words — ["i", "love", "ml"] -> ["love", "ml"]
  4. Stemming: Reduce to root form — ["running", "runs", "ran"] -> ["run"]
  5. Lemmatization: Reduce to dictionary form (better) — ["better"] -> ["good"]
  6. Vectorization: Convert text to numbers

Bag of Words and TF-IDF

DfBag of Words

Bag of Words (BoW) counts word occurrences in each document:

DocumentTextVector
Doc 1"I love ML"[1, 1, 1, 0]
Doc 2"I love dogs"[1, 1, 0, 1]

Vocabulary: [I, love, ML, dogs]

Word Embeddings Space

Word Embeddings — Semantic Vector SpaceOne-Hot Encoding (Sparse)king = [1, 0, 0, 0, 0]queen = [0, 1, 0, 0, 0]man = [0, 0, 1, 0, 0]woman = [0, 0, 0, 1, 0]cat = [0, 0, 0, 0, 1]• All orthogonal (dot = 0)• No semantic meaning• Sparse, high-dimensionalWord Embedding (Dense)kingqueenmanwomancatking - man + woman ≈ queenSemantic relationships preserved!Dense, low-dimensional, meaningful

TF-IDF

TF-IDF(t,d)=TF(t,d)×IDF(t)\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)

Here,

  • TF\text{TF}=Term frequency in document
  • IDF\text{IDF}=Inverse document frequency (rarer = more important)
  • TF-IDF\text{TF-IDF}=Common words get LOW, rare words get HIGH values

Example: TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love machine learning", "I love dogs", "Machine learning is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

Word Embeddings

DfWord Embeddings

One-hot encoding: [0, 0, 1, 0, 0] — sparse, no meaning

Embedding: [0.2, -0.5, 0.8, 0.1, 0.3] — dense, captures meaning

Word2Vec

Word2Vec learns semantic relationships:

  • Skip-gram: Predict context from word
  • CBOW: Predict word from context
  • Learns analogies: king - man + woman ≈ queen

GloVe

GloVe (Global Vectors) uses co-occurrence statistics and is pre-trained on Wikipedia.

Example: Word Embeddings with Gensim

import gensim.downloader as api

# Load pre-trained Word2Vec
model = api.load('word2vec-google-news-300')

# Similarity
model.similarity('cat', 'dog')  # 0.76

# Analogy
model.most_similar(positive=['king', 'woman'], negative=['man'])
# [('queen', 0.71)]

N-grams and Local Word Order

N-grams: Capturing Word OrderUnigrams (n=1)"I love ML" →["I", "love", "ML"]• No word order• Bag of Words approach• Fast, simple• "not good" = "good" (problem!)Bigrams (n=2)"I love ML" →["I love", "love ML"]• Captures adjacent pairs• "not good" captured• Vocabulary grows• Better than unigramsTrigrams (n=3)"I love ML" →["I love ML"]• Full phrase context• Very large vocabulary• Sparse features• Usually n=1 or 2 is best

Text Classification

Example: Text Classification Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', MultinomialNB())
])

# Train
pipeline.fit(X_train_text, y_train)

# Predict
predictions = pipeline.predict(["This movie is great!"])

Key Takeaways

Summary: NLP Fundamentals

  • Tokenization is the first step in any NLP pipeline
  • TF-IDF is simple and effective for text classification
  • Word embeddings capture semantic meaning
  • Word2Vec and GloVe are pre-trained embeddings
  • Pre-trained models (BERT, GPT) achieve state-of-the-art
  • Text preprocessing significantly impacts performance
  • N-grams capture local word order
  • Sentiment analysis is a common NLP task

What to Learn Next

-> Transformers Learn the self-attention architecture that revolutionized NLP and powers modern AI.

-> BERT Master encoder-only transformers for text classification, NER, and question answering.

-> GPT Architecture Understand decoder-only transformers that power autoregressive text generation.

-> Naive Bayes Learn the simple probabilistic classifier often used as a strong NLP baseline.

-> RNN and LSTM Explore sequential models that were the dominant NLP approach before transformers.

-> GANs Discover generative adversarial networks for text generation and style transfer.

Premium Content

NLP Fundamentals — Text Processing, Embeddings and Classification

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement