Data Systems

Cassandra Deep Dive

Apache Cassandra is a distributed NoSQL database designed for massive write throughput. Master its masterless architecture, consistent hashing, tunable consistency, and query-first data modeling.

Masterless — Every node is equal; no single point of failure
Tunable — Choose consistency level per query
Linearly Scalable — Add nodes to increase capacity proportionally

Cassandra trades flexibility for availability and write performance.

Cassandra Architecture

DfApache Cassandra

Apache Cassandra is a distributed, masterless, wide-column store designed for high write throughput and linear horizontal scalability. It uses consistent hashing for partitioning, configurable replication for fault tolerance, and tunable consistency levels per query. Every node in a Cassandra cluster is identical—there are no special roles.

Consistent Hashing

DfConsistent Hashing

Consistent hashing maps both nodes and data to positions on a hash ring. Each data item is stored on the first node encountered when moving clockwise on the ring. When nodes are added or removed, only nearby data items are redistributed, minimizing data movement.

Hash Ring Position

Position = hash(key) \mod 2^{127}

Here,

$Position$ =Position on the hash ring
$key$ =The partition key
$2^{127}$ =Range of the hash ring (MD5)

Cassandra uses virtual nodes (vnodes) where each physical node owns multiple positions on the ring. This improves load distribution and makes adding/removing nodes more efficient. Default vnode count is 256 per physical node.

Query-First Data Modeling

DfQuery-First Design

Query-first design is Cassandra's fundamental modeling approach: design tables to match your query patterns, not your data relationships. Unlike relational modeling (normalize first, optimize later), Cassandra requires you to know all queries upfront and create tables for each query pattern.

Relational	Cassandra
Normalize first	Denormalize first
JOINs at query time	Pre-join via duplication
Flexible queries	Fixed query patterns
Single table per entity	Multiple tables per entity

Cassandra Query-First Modeling

Query 1: Get all posts by a user (sorted by time) Query 2: Get all posts in a feed (sorted by time) Query 3: Get all comments on a post

Table for Query 1: posts_by_user

CREATE TABLE posts_by_user (
  user_id UUID,
  created_at TIMESTAMP,
  post_id UUID,
  content TEXT,
  PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

Table for Query 2: feed_by_post

CREATE TABLE feed_by_post (
  user_id UUID,
  created_at TIMESTAMP,
  post_id UUID,
  content TEXT,
  PRIMARY KEY ((user_id), created_at)
);

Each query has its own table. Data is duplicated across tables.

Consistency Levels

DfTunable Consistency

Tunable consistency in Cassandra allows you to choose the consistency level per query. Higher consistency levels require more nodes to acknowledge reads/writes, providing stronger guarantees at the cost of higher latency and lower availability.

Level	Description	Nodes Required
ONE	Single node responds	1
QUORUM	Majority of replicas	(N/2 + 1)
ALL	All replicas respond	N
LOCAL_QUORUM	Majority in local DC	(N_local/2 + 1)
EACH_QUORUM	Majority in each DC	(N_dc/2 + 1) per DC

Consistency Level Math

CL_{effective} = R + W > N

Here,

$R$ =Number of nodes required for reads
$W$ =Number of nodes required for writes
$N$ =Replication factor

Consistency Level Selection

Replication factor (RF) = 3 across 2 data centers (2 in DC1, 1 in DC2):

Use case: User session (high availability) Write: ONE (fast writes to nearest node) Read: ONE (fast reads from nearest node) Trade-off: May read stale data

Use case: Financial transaction (strong consistency) Write: QUORUM (2 of 3 replicas must acknowledge) Read: QUORUM (2 of 3 replicas must respond) Trade-off: Higher latency, lower availability

Use case: Global application (multi-DC) Write: LOCAL_QUORUM (majority in local DC) Read: LOCAL_QUORUM (majority in local DC) Trade-off: Strong consistency within DC, eventual between DCs

Compaction

DfCompaction

Compaction is the process of merging SSTables (sorted string tables) to reclaim space and improve read performance. Cassandra uses write-optimized storage (LSM trees) where writes go to memory and are flushed to immutable SSTables. Compaction merges these SSTables and removes deleted/overwritten data.

Strategy	Best For	Trade-off
Size-Tiered	Write-heavy workloads	More space during compaction
Leveled	Read-heavy workloads	More I/O during compaction
Time-Window	Time-series data	Expiration handling

Practice Exercises

Data Modeling: Design the Cassandra tables for a messaging app where users can: (a) get their message history, (b) get messages in a conversation, (c) search messages by sender.
Consistency Design: For a ride-sharing app, design the consistency levels for: (a) updating driver location, (b) processing payments, (c) matching riders with drivers.
Cluster Design: Design a Cassandra cluster for a global application with 100M daily active users across 3 regions. What replication factor, consistency level, and node count would you use?
Operational Planning: Your Cassandra cluster is experiencing hot partitions. Identify the causes and design a solution.

Key Takeaways:

Cassandra uses masterless architecture with consistent hashing for distribution
Query-first data modeling requires designing tables for each query pattern
Tunable consistency levels balance availability and consistency per query
CL_QUORUM writes + CL_QUORUM reads with RF=3 guarantees consistency
Compaction manages SSTables; choose strategy based on workload
Every node is equal—no single point of failure

What to Learn Next

-> NoSQL Deep Dive Document, key-value, column-family, and graph databases overview.

-> DynamoDB Deep Dive DynamoDB internals, partitioning, and global tables.

-> Time-Series Databases InfluxDB, TimescaleDB, and time-series data models.

-> Data Replication Sync vs async replication, leader election, and consistency.

-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.

-> Choosing the Right Database Systematic framework for database selection.

Cassandra Deep Dive

Cassandra Deep Dive

Cassandra Architecture

DfApache Cassandra

Consistent Hashing

DfConsistent Hashing

Hash Ring Position

Query-First Data Modeling

DfQuery-First Design

Cassandra Query-First Modeling

Consistency Levels

DfTunable Consistency

Consistency Level Math

Consistency Level Selection

Compaction

DfCompaction

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?