Semantic Search
Semantic search retrieves documents based on meaning rather than keyword matching. It uses dense vector representations and efficient similarity search algorithms.
Sentence Embeddings
Sentence embeddings encode text into fixed-size dense vectors that capture semantic meaning.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with many layers.",
"Natural language processing deals with text and speech.",
"Computer vision analyzes images and videos.",
]
# Generate embeddings
embeddings = model.encode(documents)
print(f"Embedding shape: {embeddings.shape}") # (4, 384)
# Compute similarity
query = "What is AI?"
query_embedding = model.encode([query])
similarities = np.dot(embeddings, query_embedding.T).flatten()
ranked_indices = np.argsort(similarities)[::-1]
for idx in ranked_indices:
print(f"Score: {similarities[idx]:.4f} | {documents[idx]}")
Cosine Similarity
DfCosine Similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Batch computation
def batch_cosine_similarity(query_emb, doc_embs):
# Normalize
query_norm = query_emb / np.linalg.norm(query_emb, axis=1, keepdims=True)
doc_norm = doc_embs / np.linalg.norm(doc_embs, axis=1, keepdims=True)
return np.dot(doc_norm, query_norm.T)
FAISS (Facebook AI Similarity Search)
FAISS provides efficient similarity search and clustering of dense vectors.
import faiss
import numpy as np
# Generate random embeddings for demonstration
dimension = 384
num_documents = 100000
np.random.seed(42)
documents = np.random.random((num_documents, dimension)).astype('float32')
# Build FAISS index
# Option 1: Flat index (exact search)
index_flat = faiss.IndexFlatL2(dimension)
index_flat.add(documents)
# Option 2: IVF index (approximate search)
nlist = 100 # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index_ivf.train(documents)
index_ivf.add(documents)
# Option 3: HNSW index (graph-based)
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)
index_hnsw.hnsw.efConstruction = 64
index_hnsw.add(documents)
# Search
query = np.random.random((1, dimension)).astype('float32')
k = 5 # Number of nearest neighbors
# Flat search (exact)
distances, indices = index_flat.search(query, k)
print(f"Flat results: {indices[0]}, distances: {distances[0]}")
# IVF search (approximate)
index_ivf.nprobe = 10 # Number of clusters to search
distances, indices = index_ivf.search(query, k)
print(f"IVF results: {indices[0]}, distances: {distances[0]}")
# HNSW search
distances, indices = index_hnsw.search(query, k)
print(f"HNSW results: {indices[0]}, distances: {distances[0]}")
Index Comparison
| Index Type | Build Time | Search Time | Memory | Accuracy |
|---|---|---|---|---|
| Flat (L2) | O(1) | O(n) | High | 100% |
| IVF | O(n) | O(n/k) | Medium | ~95% |
| HNSW | O(n log n) | O(log n) | High | ~98% |
| PQ | O(n) | O(n/k) | Low | ~90% |
| IVF-PQ | O(n) | O(n/kΒ²) | Low | ~92% |
FAISS Index Selection
def select_index(num_vectors, dimension, speed_priority=True):
if num_vectors < 10000:
return "IndexFlatL2" # Exact search
elif num_vectors < 1000000:
if speed_priority:
return "IndexHNSWFlat" # Fast, good accuracy
else:
return "IndexIVFFlat" # Balanced
else:
if speed_priority:
return "IndexIVFPQ" # Memory efficient
else:
return "IndexHNSWFlat" # Good accuracy
# Example usage
index_type = select_index(500000, 768, speed_priority=True)
print(f"Recommended index: {index_type}")
Vector Databases
| Database | Type | Scaling | Features |
|---|---|---|---|
| FAISS | Library | Single machine | Fast, GPU support |
| Milvus | Distributed | Horizontal | Production-ready |
| Pinecone | Cloud | Managed | Easy to use |
| Weaviate | Self-hosted | Horizontal | GraphQL API |
| Qdrant | Self-hosted | Horizontal | Filtering |
| ChromaDB | Embedded | Single machine | Lightweight |
# Example with ChromaDB
import chromadb
client = chromadb.Client()
collection = client.create_collection("documents")
# Add documents
documents = [
"Machine learning algorithms learn patterns from data.",
"Deep neural networks have multiple hidden layers.",
"Natural language processing understands human language.",
]
ids = ["doc1", "doc2", "doc3"]
collection.add(
documents=documents,
ids=ids,
metadatas=[{"source": "textbook"} for _ in documents]
)
# Query
results = collection.query(
query_texts=["What is deep learning?"],
n_results=2
)
print(results["documents"])
Evaluation Metrics
| Metric | Description | Formula |
|---|---|---|
| Recall@k | Relevant docs in top-k | |relevant β© top-k| / |relevant| |
| MRR | Mean Reciprocal Rank | 1/Q Ξ£ 1/rank_i |
| NDCG | Normalized Discounted Cumulative Gain | DCG / IDCG |
| MAP | Mean Average Precision | 1/Q Ξ£ AP_i |
DfNDCG@k
Semantic Search Ranking
Semantic search outperforms keyword search for queries with different vocabulary but similar meaning. For example, "automobile" and "car" are semantically similar but lexically different.