Word Embeddings
Word embeddings are dense, continuous vector representations of words in a low-dimensional space. Unlike sparse BoW or TF-IDF vectors, embeddings capture semantic relationshipsβwords with similar meanings have similar vectors.
Skip-gram Objective
Skip-gram Probability
Word Embedding Methods Comparison
| Method | Training | Context Window | Captures Morphology | Year |
|---|---|---|---|---|
| Word2Vec (CBOW) | Shallow NN | Fixed window | No | 2013 |
| Word2Vec (Skip-gram) | Shallow NN | Fixed window | No | 2013 |
| GloVe | Matrix factorization | Global co-occurrence | No | 2014 |
| FastText | Subword embeddings | Fixed window | Yes | 2017 |
Using Pre-trained Embeddings
import numpy as np
from gensim.models import KeyedVectors
# Load pre-trained Word2Vec
model = KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True
)
# Similarity
print(model.similarity('king', 'queen')) # ~0.65
print(model.similarity('king', 'man')) # ~0.32
# Most similar
print(model.most_similar('computer', topn=5))
# [('computers', 0.87), ('software', 0.78), ...]
# Analogy: king - man + woman = queen
result = model.most_similar(
positive=['king', 'woman'],
negative=['man'],
topn=1
)
print(result) # [('queen', 0.71)]
Cosine Similarity for Embeddings
Cosine Similarity for Vectors
def cosine_sim(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
king = model['king']
queen = model['queen']
man = model['man']
print(f"king-queen: {cosine_sim(king, queen):.3f}")
print(f"king-man: {cosine_sim(king, man):.3f}")
Visualizing Embeddings
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
words = ['king', 'queen', 'man', 'woman', 'prince', 'princess',
'cat', 'dog', 'fish', 'bird']
vectors = np.array([model[w] for w in words])
tsne = TSNE(n_components=2, random_state=42)
coords = tsne.fit_transform(vectors)
plt.figure(figsize=(10, 8))
plt.scatter(coords[:, 0], coords[:, 1], c='red', alpha=0.7)
for i, word in enumerate(words):
plt.annotate(word, (coords[i, 0], coords[i, 1]))
plt.title("Word Embeddings Visualization")
plt.show()
Properties of Word Embeddings
- Semantic similarity: Similar words have close vectors
- Linear relationships: king - man + woman β queen
- Clustering: Semantically related words form clusters
- Compositionality: Can combine vectors for phrases