Data Systems

Search Engines and Elasticsearch

Search is one of the most complex problems in system design. Master inverted indices, relevance scoring, analyzers, and the architecture behind Elasticsearch and Lucene.

Inverted Index — The data structure that makes full-text search possible
Relevance — BM25, TF-IDF, and custom scoring functions
Scalability — Sharding, replication, and distributed search

Search is not just finding—it's finding the most relevant result, fast.

How Search Engines Work

DfSearch Engine

A search engine is a system that stores data, indexes it for fast retrieval, and returns results ranked by relevance. Unlike database queries that match exact values, search engines handle fuzzy matching, partial words, synonyms, and relevance ranking. The core data structure is the inverted index.

The Inverted Index

DfInverted Index

An inverted index maps terms to the documents that contain them. For each term, it stores a posting list of document IDs and positions. This allows O(1) lookup of which documents contain a given term, enabling fast full-text search.

Relevance Scoring

DfTF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) scores a term's importance in a document relative to a collection. Terms that appear frequently in a document but rarely across the collection are most informative.

TF-IDF Score

TF-IDF(t, d, D) = TF(t, d) \times IDF(t, D)

Here,

$TF(t, d)$ =Term frequency of t in document d
$IDF(t, D)$ =Inverse document frequency of t in collection D
$TF$ =How often t appears in d (log-normalized)
$IDF$ =log(N/df_t) where N = total docs, df_t = docs containing t

TF-IDF Calculation

For a collection of 1,000,000 documents:

"the" appears in 900,000 documents → IDF = log(1M/900K) ≈ 0.046 (rarely informative)
"elasticsearch" appears in 5,000 documents → IDF = log(1M/5K) ≈ 5.30 (highly informative)

A document containing "elasticsearch" 5 times scores much higher than one containing "the" 5 times.

BM25 (Elasticsearch Default)

DfBM25

BM25 is an improvement over TF-IDF that handles term frequency saturation and document length normalization. It uses a saturation function that prevents a term from dominating the score just because it appears many times, and normalizes for document length.

Elasticsearch Architecture

Elasticsearch Key Concepts

Concept	Description
Index	Collection of documents (like a database table)
Shard	Partition of an index (horizontal scaling)
Replica	Copy of a shard (redundancy + read scaling)
Document	JSON object stored in an index
Mapping	Schema definition for an index
Analyzer	Tokenizes and normalizes text

Analyzers and Tokenization

DfAnalyzer

An analyzer processes text during indexing and search. It consists of a character filter, tokenizer, and token filter. The tokenizer splits text into terms (tokens), while filters modify terms (lowercase, remove stopwords, stem, synonym expansion).

Component	Purpose	Example
Character Filter	Pre-process text	Strip HTML tags
Tokenizer	Split text into terms	Standard tokenizer: "System Design" → ["System", "Design"]
Token Filter	Modify terms	Lowercase, stopword removal, stemming

Analyzer Pipeline

Input: "The Quick Brown Fox Jumps"

Standard Analyzer:

Character filter: (none)
Tokenizer: "The", "Quick", "Brown", "Fox", "Jumps"
Token filter: "quick", "brown", "fox", "jump" (lowercase + stem)

Result: ["quick", "brown", "fox", "jump"]

Search at Scale

Sharding Strategy

Shard Count

N_{shard} = \frac{S_{total}}{S_{shard}} \times R_{replication}

Here,

$N_{shard}$ =Total number of shards
$S_{total}$ =Total data size
$S_{shard}$ =Target shard size (typically 10-50GB)
$R_{replication}$ =Replication factor

Too many shards cause overhead; too few limit parallelism. A common mistake is creating too many small shards. Each shard has overhead (memory, file handles), so target 10-50GB per shard.

Query Execution

Scatter: Coordinating node sends query to all relevant shards
Gather: Each shard executes query locally, returns top results
Merge: Coordinating node merges results from all shards
Return: Final ranked results returned to client

Practice Exercises

Index Design: Design the Elasticsearch mapping for a product catalog with name, description, category, price, and tags. What analyzers would you use for each field?
Relevance Tuning: Given a search query "python programming book", how would you configure boosting to prioritize title matches over description matches?
Sharding Strategy: You have 500GB of log data to index in Elasticsearch. How many shards and replicas would you create? Justify your numbers.
Architecture Design: Design a search-as-you-type feature like Google autocomplete. What are the key components and how do you handle 10K QPS?

Key Takeaways:

Inverted indices map terms to documents for O(1) term lookup
BM25 scores relevance based on term frequency and document frequency
Analyzers process text through tokenization and filtering
Sharding distributes data across nodes; replicas provide redundancy
Elasticsearch uses scatter-gather for distributed search
Too many shards cause overhead; target 10-50GB per shard

What to Learn Next

-> NoSQL Deep Dive Document, key-value, column-family, and graph databases.

-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.

-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.

-> Search Autocomplete Design Building real-time search autocomplete systems.

-> Search Engine Design Designing a complete search engine like Google.

-> Caching Strategies Cache-aside, write-through, write-back, and cache invalidation.

Search Engines and Elasticsearch

Search Engines and Elasticsearch

How Search Engines Work

DfSearch Engine

The Inverted Index

DfInverted Index

Relevance Scoring

DfTF-IDF

TF-IDF Score

TF-IDF Calculation

BM25 (Elasticsearch Default)

DfBM25

Elasticsearch Architecture

Elasticsearch Key Concepts

Analyzers and Tokenization

DfAnalyzer

Analyzer Pipeline

Search at Scale

Sharding Strategy

Shard Count

Query Execution

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?