πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Semantic Search

Information RetrievalSentence Embeddings and Approximate Nearest Neighbors🟒 Free Lesson

Advertisement

Semantic Search

Semantic search retrieves documents based on meaning rather than keyword matching. It uses dense vector representations and efficient similarity search algorithms.

Sentence Embeddings

Sentence embeddings encode text into fixed-size dense vectors that capture semantic meaning.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing deals with text and speech.",
    "Computer vision analyzes images and videos.",
]

# Generate embeddings
embeddings = model.encode(documents)
print(f"Embedding shape: {embeddings.shape}")  # (4, 384)

# Compute similarity
query = "What is AI?"
query_embedding = model.encode([query])

similarities = np.dot(embeddings, query_embedding.T).flatten()
ranked_indices = np.argsort(similarities)[::-1]

for idx in ranked_indices:
    print(f"Score: {similarities[idx]:.4f} | {documents[idx]}")

Cosine Similarity

DfCosine Similarity

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Batch computation
def batch_cosine_similarity(query_emb, doc_embs):
    # Normalize
    query_norm = query_emb / np.linalg.norm(query_emb, axis=1, keepdims=True)
    doc_norm = doc_embs / np.linalg.norm(doc_embs, axis=1, keepdims=True)
    return np.dot(doc_norm, query_norm.T)

FAISS (Facebook AI Similarity Search)

FAISS provides efficient similarity search and clustering of dense vectors.

import faiss
import numpy as np

# Generate random embeddings for demonstration
dimension = 384
num_documents = 100000
np.random.seed(42)

documents = np.random.random((num_documents, dimension)).astype('float32')

# Build FAISS index
# Option 1: Flat index (exact search)
index_flat = faiss.IndexFlatL2(dimension)
index_flat.add(documents)

# Option 2: IVF index (approximate search)
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index_ivf.train(documents)
index_ivf.add(documents)

# Option 3: HNSW index (graph-based)
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)
index_hnsw.hnsw.efConstruction = 64
index_hnsw.add(documents)

# Search
query = np.random.random((1, dimension)).astype('float32')
k = 5  # Number of nearest neighbors

# Flat search (exact)
distances, indices = index_flat.search(query, k)
print(f"Flat results: {indices[0]}, distances: {distances[0]}")

# IVF search (approximate)
index_ivf.nprobe = 10  # Number of clusters to search
distances, indices = index_ivf.search(query, k)
print(f"IVF results: {indices[0]}, distances: {distances[0]}")

# HNSW search
distances, indices = index_hnsw.search(query, k)
print(f"HNSW results: {indices[0]}, distances: {distances[0]}")

Index Comparison

Index TypeBuild TimeSearch TimeMemoryAccuracy
Flat (L2)O(1)O(n)High100%
IVFO(n)O(n/k)Medium~95%
HNSWO(n log n)O(log n)High~98%
PQO(n)O(n/k)Low~90%
IVF-PQO(n)O(n/kΒ²)Low~92%

FAISS Index Selection

def select_index(num_vectors, dimension, speed_priority=True):
    if num_vectors < 10000:
        return "IndexFlatL2"  # Exact search
    elif num_vectors < 1000000:
        if speed_priority:
            return "IndexHNSWFlat"  # Fast, good accuracy
        else:
            return "IndexIVFFlat"  # Balanced
    else:
        if speed_priority:
            return "IndexIVFPQ"  # Memory efficient
        else:
            return "IndexHNSWFlat"  # Good accuracy

# Example usage
index_type = select_index(500000, 768, speed_priority=True)
print(f"Recommended index: {index_type}")

Vector Databases

DatabaseTypeScalingFeatures
FAISSLibrarySingle machineFast, GPU support
MilvusDistributedHorizontalProduction-ready
PineconeCloudManagedEasy to use
WeaviateSelf-hostedHorizontalGraphQL API
QdrantSelf-hostedHorizontalFiltering
ChromaDBEmbeddedSingle machineLightweight
# Example with ChromaDB
import chromadb

client = chromadb.Client()
collection = client.create_collection("documents")

# Add documents
documents = [
    "Machine learning algorithms learn patterns from data.",
    "Deep neural networks have multiple hidden layers.",
    "Natural language processing understands human language.",
]
ids = ["doc1", "doc2", "doc3"]

collection.add(
    documents=documents,
    ids=ids,
    metadatas=[{"source": "textbook"} for _ in documents]
)

# Query
results = collection.query(
    query_texts=["What is deep learning?"],
    n_results=2
)
print(results["documents"])

Evaluation Metrics

MetricDescriptionFormula
Recall@kRelevant docs in top-k|relevant ∩ top-k| / |relevant|
MRRMean Reciprocal Rank1/Q Ξ£ 1/rank_i
NDCGNormalized Discounted Cumulative GainDCG / IDCG
MAPMean Average Precision1/Q Ξ£ AP_i

DfNDCG@k

Semantic Search Ranking

Semantic search outperforms keyword search for queries with different vocabulary but similar meaning. For example, "automobile" and "car" are semantically similar but lexically different.

⭐

Premium Content

Semantic Search

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement