🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Cassandra Deep Dive

Data SystemsNoSQL Databases🟢 Free Lesson

Advertisement

Data Systems

Cassandra Deep Dive

Apache Cassandra is a distributed NoSQL database designed for massive write throughput. Master its masterless architecture, consistent hashing, tunable consistency, and query-first data modeling.

  • Masterless — Every node is equal; no single point of failure
  • Tunable — Choose consistency level per query
  • Linearly Scalable — Add nodes to increase capacity proportionally

Cassandra trades flexibility for availability and write performance.

Cassandra Architecture

DfApache Cassandra

Apache Cassandra is a distributed, masterless, wide-column store designed for high write throughput and linear horizontal scalability. It uses consistent hashing for partitioning, configurable replication for fault tolerance, and tunable consistency levels per query. Every node in a Cassandra cluster is identical—there are no special roles.

Node ADC1Node BDC1Node CDC1Node DDC2Node EDC2Node FDC2Data Center 1Data Center 2

Consistent Hashing

DfConsistent Hashing

Consistent hashing maps both nodes and data to positions on a hash ring. Each data item is stored on the first node encountered when moving clockwise on the ring. When nodes are added or removed, only nearby data items are redistributed, minimizing data movement.

Hash Ring Position

Position=hash(key)mod2127Position = hash(key) \mod 2^{127}

Here,

  • PositionPosition=Position on the hash ring
  • keykey=The partition key
  • 21272^{127}=Range of the hash ring (MD5)

Cassandra uses virtual nodes (vnodes) where each physical node owns multiple positions on the ring. This improves load distribution and makes adding/removing nodes more efficient. Default vnode count is 256 per physical node.

Query-First Data Modeling

DfQuery-First Design

Query-first design is Cassandra's fundamental modeling approach: design tables to match your query patterns, not your data relationships. Unlike relational modeling (normalize first, optimize later), Cassandra requires you to know all queries upfront and create tables for each query pattern.

RelationalCassandra
Normalize firstDenormalize first
JOINs at query timePre-join via duplication
Flexible queriesFixed query patterns
Single table per entityMultiple tables per entity

Cassandra Query-First Modeling

Query 1: Get all posts by a user (sorted by time) Query 2: Get all posts in a feed (sorted by time) Query 3: Get all comments on a post

Table for Query 1: posts_by_user

CREATE TABLE posts_by_user (
  user_id UUID,
  created_at TIMESTAMP,
  post_id UUID,
  content TEXT,
  PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

Table for Query 2: feed_by_post

CREATE TABLE feed_by_post (
  user_id UUID,
  created_at TIMESTAMP,
  post_id UUID,
  content TEXT,
  PRIMARY KEY ((user_id), created_at)
);

Each query has its own table. Data is duplicated across tables.

Consistency Levels

DfTunable Consistency

Tunable consistency in Cassandra allows you to choose the consistency level per query. Higher consistency levels require more nodes to acknowledge reads/writes, providing stronger guarantees at the cost of higher latency and lower availability.

LevelDescriptionNodes Required
ONESingle node responds1
QUORUMMajority of replicas(N/2 + 1)
ALLAll replicas respondN
LOCAL_QUORUMMajority in local DC(N_local/2 + 1)
EACH_QUORUMMajority in each DC(N_dc/2 + 1) per DC

Consistency Level Math

CLeffective=R+W>NCL_{effective} = R + W > N

Here,

  • RR=Number of nodes required for reads
  • WW=Number of nodes required for writes
  • NN=Replication factor

Consistency Level Selection

Replication factor (RF) = 3 across 2 data centers (2 in DC1, 1 in DC2):

Use case: User session (high availability) Write: ONE (fast writes to nearest node) Read: ONE (fast reads from nearest node) Trade-off: May read stale data

Use case: Financial transaction (strong consistency) Write: QUORUM (2 of 3 replicas must acknowledge) Read: QUORUM (2 of 3 replicas must respond) Trade-off: Higher latency, lower availability

Use case: Global application (multi-DC) Write: LOCAL_QUORUM (majority in local DC) Read: LOCAL_QUORUM (majority in local DC) Trade-off: Strong consistency within DC, eventual between DCs

Compaction

DfCompaction

Compaction is the process of merging SSTables (sorted string tables) to reclaim space and improve read performance. Cassandra uses write-optimized storage (LSM trees) where writes go to memory and are flushed to immutable SSTables. Compaction merges these SSTables and removes deleted/overwritten data.

StrategyBest ForTrade-off
Size-TieredWrite-heavy workloadsMore space during compaction
LeveledRead-heavy workloadsMore I/O during compaction
Time-WindowTime-series dataExpiration handling

Practice Exercises

  1. Data Modeling: Design the Cassandra tables for a messaging app where users can: (a) get their message history, (b) get messages in a conversation, (c) search messages by sender.

  2. Consistency Design: For a ride-sharing app, design the consistency levels for: (a) updating driver location, (b) processing payments, (c) matching riders with drivers.

  3. Cluster Design: Design a Cassandra cluster for a global application with 100M daily active users across 3 regions. What replication factor, consistency level, and node count would you use?

  4. Operational Planning: Your Cassandra cluster is experiencing hot partitions. Identify the causes and design a solution.

Key Takeaways:

  • Cassandra uses masterless architecture with consistent hashing for distribution
  • Query-first data modeling requires designing tables for each query pattern
  • Tunable consistency levels balance availability and consistency per query
  • CL_QUORUM writes + CL_QUORUM reads with RF=3 guarantees consistency
  • Compaction manages SSTables; choose strategy based on workload
  • Every node is equal—no single point of failure

What to Learn Next

-> NoSQL Deep Dive Document, key-value, column-family, and graph databases overview.

-> DynamoDB Deep Dive DynamoDB internals, partitioning, and global tables.

-> Time-Series Databases InfluxDB, TimescaleDB, and time-series data models.

-> Data Replication Sync vs async replication, leader election, and consistency.

-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.

-> Choosing the Right Database Systematic framework for database selection.

Premium Content

Cassandra Deep Dive

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement