πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Search Engines and Elasticsearch

Data SystemsSearch Systems🟒 Free Lesson

Advertisement

Data Systems

Search Engines and Elasticsearch

Search is one of the most complex problems in system design. Master inverted indices, relevance scoring, analyzers, and the architecture behind Elasticsearch and Lucene.

  • Inverted Index β€” The data structure that makes full-text search possible
  • Relevance β€” BM25, TF-IDF, and custom scoring functions
  • Scalability β€” Sharding, replication, and distributed search

Search is not just findingβ€”it's finding the most relevant result, fast.

How Search Engines Work

DfSearch Engine

A search engine is a system that stores data, indexes it for fast retrieval, and returns results ranked by relevance. Unlike database queries that match exact values, search engines handle fuzzy matching, partial words, synonyms, and relevance ranking. The core data structure is the inverted index.

The Inverted Index

DfInverted Index

An inverted index maps terms to the documents that contain them. For each term, it stores a posting list of document IDs and positions. This allows O(1) lookup of which documents contain a given term, enabling fast full-text search.

DocumentsDoc1: "system design"Doc2: "design patterns"Doc3: "distributed systems"Inverted Index"system"β†’ [Doc1, Doc3]"design"β†’ [Doc1, Doc2]"distributed"β†’ [Doc3]"patterns"β†’ [Doc2]"scale"β†’ [Doc1, Doc2, Doc3]Query"design system"

Relevance Scoring

DfTF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) scores a term's importance in a document relative to a collection. Terms that appear frequently in a document but rarely across the collection are most informative.

TF-IDF Score

TFβˆ’IDF(t,d,D)=TF(t,d)Γ—IDF(t,D)TF-IDF(t, d, D) = TF(t, d) \times IDF(t, D)

Here,

  • TF(t,d)TF(t, d)=Term frequency of t in document d
  • IDF(t,D)IDF(t, D)=Inverse document frequency of t in collection D
  • TFTF=How often t appears in d (log-normalized)
  • IDFIDF=log(N/df_t) where N = total docs, df_t = docs containing t

TF-IDF Calculation

For a collection of 1,000,000 documents:

  • "the" appears in 900,000 documents β†’ IDF = log(1M/900K) β‰ˆ 0.046 (rarely informative)
  • "elasticsearch" appears in 5,000 documents β†’ IDF = log(1M/5K) β‰ˆ 5.30 (highly informative)

A document containing "elasticsearch" 5 times scores much higher than one containing "the" 5 times.

BM25 (Elasticsearch Default)

DfBM25

BM25 is an improvement over TF-IDF that handles term frequency saturation and document length normalization. It uses a saturation function that prevents a term from dominating the score just because it appears many times, and normalizes for document length.

Elasticsearch Architecture

ClientCoordinating NodeRoutes queries, merges resultsData Node 1Shard 0 (primary)Shard 1 (replica)Inverted indexData Node 2Shard 0 (replica)Shard 1 (primary)Inverted indexData Node 3Shard 0 (replica)Shard 1 (replica)Inverted indexMaster NodeCluster stateIndex managementShard allocation

Elasticsearch Key Concepts

ConceptDescription
IndexCollection of documents (like a database table)
ShardPartition of an index (horizontal scaling)
ReplicaCopy of a shard (redundancy + read scaling)
DocumentJSON object stored in an index
MappingSchema definition for an index
AnalyzerTokenizes and normalizes text

Analyzers and Tokenization

DfAnalyzer

An analyzer processes text during indexing and search. It consists of a character filter, tokenizer, and token filter. The tokenizer splits text into terms (tokens), while filters modify terms (lowercase, remove stopwords, stem, synonym expansion).

ComponentPurposeExample
Character FilterPre-process textStrip HTML tags
TokenizerSplit text into termsStandard tokenizer: "System Design" β†’ ["System", "Design"]
Token FilterModify termsLowercase, stopword removal, stemming

Analyzer Pipeline

Input: "The Quick Brown Fox Jumps"

Standard Analyzer:

  1. Character filter: (none)
  2. Tokenizer: "The", "Quick", "Brown", "Fox", "Jumps"
  3. Token filter: "quick", "brown", "fox", "jump" (lowercase + stem)

Result: ["quick", "brown", "fox", "jump"]

Search at Scale

Sharding Strategy

Shard Count

Nshard=StotalSshardΓ—RreplicationN_{shard} = \frac{S_{total}}{S_{shard}} \times R_{replication}

Here,

  • NshardN_{shard}=Total number of shards
  • StotalS_{total}=Total data size
  • SshardS_{shard}=Target shard size (typically 10-50GB)
  • RreplicationR_{replication}=Replication factor

Too many shards cause overhead; too few limit parallelism. A common mistake is creating too many small shards. Each shard has overhead (memory, file handles), so target 10-50GB per shard.

Query Execution

  1. Scatter: Coordinating node sends query to all relevant shards
  2. Gather: Each shard executes query locally, returns top results
  3. Merge: Coordinating node merges results from all shards
  4. Return: Final ranked results returned to client

Practice Exercises

  1. Index Design: Design the Elasticsearch mapping for a product catalog with name, description, category, price, and tags. What analyzers would you use for each field?

  2. Relevance Tuning: Given a search query "python programming book", how would you configure boosting to prioritize title matches over description matches?

  3. Sharding Strategy: You have 500GB of log data to index in Elasticsearch. How many shards and replicas would you create? Justify your numbers.

  4. Architecture Design: Design a search-as-you-type feature like Google autocomplete. What are the key components and how do you handle 10K QPS?

Key Takeaways:

  • Inverted indices map terms to documents for O(1) term lookup
  • BM25 scores relevance based on term frequency and document frequency
  • Analyzers process text through tokenization and filtering
  • Sharding distributes data across nodes; replicas provide redundancy
  • Elasticsearch uses scatter-gather for distributed search
  • Too many shards cause overhead; target 10-50GB per shard

What to Learn Next

-> NoSQL Deep Dive Document, key-value, column-family, and graph databases.

-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.

-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.

-> Search Autocomplete Design Building real-time search autocomplete systems.

-> Search Engine Design Designing a complete search engine like Google.

-> Caching Strategies Cache-aside, write-through, write-back, and cache invalidation.

⭐

Premium Content

Search Engines and Elasticsearch

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement