Data Systems
Search Engines and Elasticsearch
Search is one of the most complex problems in system design. Master inverted indices, relevance scoring, analyzers, and the architecture behind Elasticsearch and Lucene.
- Inverted Index β The data structure that makes full-text search possible
- Relevance β BM25, TF-IDF, and custom scoring functions
- Scalability β Sharding, replication, and distributed search
Search is not just findingβit's finding the most relevant result, fast.
How Search Engines Work
DfSearch Engine
A search engine is a system that stores data, indexes it for fast retrieval, and returns results ranked by relevance. Unlike database queries that match exact values, search engines handle fuzzy matching, partial words, synonyms, and relevance ranking. The core data structure is the inverted index.
The Inverted Index
DfInverted Index
An inverted index maps terms to the documents that contain them. For each term, it stores a posting list of document IDs and positions. This allows O(1) lookup of which documents contain a given term, enabling fast full-text search.
Relevance Scoring
DfTF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) scores a term's importance in a document relative to a collection. Terms that appear frequently in a document but rarely across the collection are most informative.
TF-IDF Score
Here,
- =Term frequency of t in document d
- =Inverse document frequency of t in collection D
- =How often t appears in d (log-normalized)
- =log(N/df_t) where N = total docs, df_t = docs containing t
TF-IDF Calculation
For a collection of 1,000,000 documents:
- "the" appears in 900,000 documents β IDF = log(1M/900K) β 0.046 (rarely informative)
- "elasticsearch" appears in 5,000 documents β IDF = log(1M/5K) β 5.30 (highly informative)
A document containing "elasticsearch" 5 times scores much higher than one containing "the" 5 times.
BM25 (Elasticsearch Default)
DfBM25
BM25 is an improvement over TF-IDF that handles term frequency saturation and document length normalization. It uses a saturation function that prevents a term from dominating the score just because it appears many times, and normalizes for document length.
Elasticsearch Architecture
Elasticsearch Key Concepts
| Concept | Description |
|---|---|
| Index | Collection of documents (like a database table) |
| Shard | Partition of an index (horizontal scaling) |
| Replica | Copy of a shard (redundancy + read scaling) |
| Document | JSON object stored in an index |
| Mapping | Schema definition for an index |
| Analyzer | Tokenizes and normalizes text |
Analyzers and Tokenization
DfAnalyzer
An analyzer processes text during indexing and search. It consists of a character filter, tokenizer, and token filter. The tokenizer splits text into terms (tokens), while filters modify terms (lowercase, remove stopwords, stem, synonym expansion).
| Component | Purpose | Example |
|---|---|---|
| Character Filter | Pre-process text | Strip HTML tags |
| Tokenizer | Split text into terms | Standard tokenizer: "System Design" β ["System", "Design"] |
| Token Filter | Modify terms | Lowercase, stopword removal, stemming |
Analyzer Pipeline
Input: "The Quick Brown Fox Jumps"
Standard Analyzer:
- Character filter: (none)
- Tokenizer: "The", "Quick", "Brown", "Fox", "Jumps"
- Token filter: "quick", "brown", "fox", "jump" (lowercase + stem)
Result: ["quick", "brown", "fox", "jump"]
Search at Scale
Sharding Strategy
Shard Count
Here,
- =Total number of shards
- =Total data size
- =Target shard size (typically 10-50GB)
- =Replication factor
Too many shards cause overhead; too few limit parallelism. A common mistake is creating too many small shards. Each shard has overhead (memory, file handles), so target 10-50GB per shard.
Query Execution
- Scatter: Coordinating node sends query to all relevant shards
- Gather: Each shard executes query locally, returns top results
- Merge: Coordinating node merges results from all shards
- Return: Final ranked results returned to client
Practice Exercises
-
Index Design: Design the Elasticsearch mapping for a product catalog with name, description, category, price, and tags. What analyzers would you use for each field?
-
Relevance Tuning: Given a search query "python programming book", how would you configure boosting to prioritize title matches over description matches?
-
Sharding Strategy: You have 500GB of log data to index in Elasticsearch. How many shards and replicas would you create? Justify your numbers.
-
Architecture Design: Design a search-as-you-type feature like Google autocomplete. What are the key components and how do you handle 10K QPS?
Key Takeaways:
- Inverted indices map terms to documents for O(1) term lookup
- BM25 scores relevance based on term frequency and document frequency
- Analyzers process text through tokenization and filtering
- Sharding distributes data across nodes; replicas provide redundancy
- Elasticsearch uses scatter-gather for distributed search
- Too many shards cause overhead; target 10-50GB per shard
What to Learn Next
-> NoSQL Deep Dive Document, key-value, column-family, and graph databases.
-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.
-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.
-> Search Autocomplete Design Building real-time search autocomplete systems.
-> Search Engine Design Designing a complete search engine like Google.
-> Caching Strategies Cache-aside, write-through, write-back, and cache invalidation.