Data Systems

Time-Series Databases

Time-series data is one of the fastest growing data types. Master the specialized databases, data models, and optimization techniques for handling massive volumes of timestamped data.

Compression — Specialized algorithms reduce storage by 10-100x
Retention — Automatic data lifecycle management
Aggregation — Pre-computed rollups for fast queries

Time-series databases turn mountains of data into actionable insights.

What Is Time-Series Data?

DfTime-Series Data

Time-series data is a sequence of data points indexed by time. Each data point consists of a timestamp, a measurement (value), and optional metadata (tags/labels). Time-series data is append-heavy, rarely updated, and queried primarily by time ranges. Examples include metrics, IoT sensor readings, financial tick data, and application logs.

Time-Series Characteristics

Characteristic	Implication
Append-only	Data is written once, never modified
Time-ordered	Natural ordering by timestamp
High write volume	Millions of points per second
Time-bounded queries	Most queries filter by time range
Old data less valuable	Aggregation and downsampling needed
Real-time ingestion	Low-latency write path critical

Time-series databases exploit these characteristics for optimization. Unlike general-purpose databases, they can assume data is append-only and time-ordered, enabling aggressive compression and efficient time-range queries.

The Time-Series Data Model

Time-Series Compression

Time-series databases use specialized compression algorithms:

Compression Ratio

R_{compression} = \frac{S_{raw}}{S_{compressed}}

Here,

$R_{compression}$ =Compression ratio (higher is better)
$S_{raw}$ =Uncompressed data size
$S_{compressed}$ =Compressed data size

Algorithm	Technique	Typical Ratio
Delta encoding	Store differences between values	2-5x
Gorilla (Facebook)	XOR with previous value	10-20x
Dictionary encoding	Map repeated values to IDs	5-10x
Run-length encoding	Compress consecutive identical values	Variable

Gorilla Compression for Metrics

For CPU utilization data (0-100%), values are typically close to previous values:

Raw values: 65.2, 65.3, 65.1, 65.4, 65.2 Delta from previous: -, 0.1, -0.2, 0.3, -0.2

Deltas are small numbers that compress extremely well with XOR encoding. Typical compression ratio: 12x.

InfluxDB Architecture

DfInfluxDB

InfluxDB is an open-source time-series database designed for high-throughput ingestion and efficient querying of time-stamped data. It uses a custom storage engine (TSM) optimized for time-series workloads, with built-in downsampling, retention policies, and a powerful query language (Flux/InfluxQL).

Component	Purpose
TSM Engine	Time-structured merge tree for storage
Write Path	WAL + in-memory cache for fast writes
Query Engine	Flux query language for processing
Continuous Queries	Automatic downsampling and aggregation
Retention Policies	Automatic data expiration

TimescaleDB Architecture

DfTimescaleDB

TimescaleDB is a PostgreSQL extension for time-series data. It adds automatic time-based partitioning (hypertables), continuous aggregates, and time-series optimizations while maintaining full SQL compatibility. This allows you to use existing PostgreSQL tools and knowledge.

Hypertables

DfHypertable

A hypertable in TimescaleDB is a table that is automatically partitioned into chunks based on time. Each chunk is a separate PostgreSQL table, enabling efficient time-range queries and automatic data management. The partitioning is transparent to the user—queries work the same as on regular tables.

TimescaleDB Hypertable

-- Create a regular table
CREATE TABLE sensor_data (
  time TIMESTAMPTZ NOT NULL,
  sensor_id INT,
  temperature DOUBLE PRECISION,
  humidity DOUBLE PRECISION
);

-- Convert to hypertable (automatic time partitioning)
SELECT create_hypertable('sensor_data', 'time');

-- Query by time range (automatically prunes chunks)
SELECT time_bucket('1 hour', time), AVG(temperature)
FROM sensor_data
WHERE time > NOW() - INTERVAL '1 day'
GROUP BY time_bucket('1 hour', time);

Retention and Downsampling

DfDownsampling

Downsampling reduces data resolution by aggregating high-frequency data into lower-frequency summaries. For example, converting 1-second metrics to 1-minute averages reduces data volume by 60x while preserving the overall trend.

Downsampling Impact

V_{downsampled} = \frac{V_{original}}{N_{interval}}

Here,

$V_{downsampled}$ =Volume after downsampling
$V_{original}$ =Original data volume
$N_{interval}$ =Number of raw points per aggregated point

A typical retention strategy: keep raw data for 7 days, 1-minute aggregates for 30 days, 1-hour aggregates for 1 year, and daily aggregates indefinitely. This reduces storage by 99.9% while maintaining query performance for recent data.

Query Optimization Patterns

Pattern	Technique
Time-range pruning	Skip chunks outside query range
Pre-computed aggregates	Use materialized views for common queries
Continuous aggregates	Automatically update summaries on write
Partition elimination	Only scan relevant time partitions
Columnar storage	Efficient compression for analytics

Practice Exercises

Data Modeling: Design the schema for an IoT sensor data system that collects temperature, humidity, and pressure from 10,000 sensors every second. Include retention and downsampling strategies.
Compression Analysis: Given 10M data points per day with 8-byte values, estimate the storage needed with and without Gorilla compression (assume 12x ratio).
Query Design: Write a TimescaleDB query to find the top 10 sensors with the highest average temperature over the last 24 hours, bucketed by 5-minute intervals.
Architecture Decision: Compare InfluxDB and TimescaleDB for a monitoring platform. What are the trade-offs in terms of query language, storage efficiency, and operational complexity?

Key Takeaways:

Time-series data is append-only, time-ordered, and queried primarily by time range
Specialized compression (Gorilla, delta encoding) reduces storage 10-100x
Downsampling reduces data volume while preserving trends
InfluxDB provides a purpose-built time-series database
TimescaleDB extends PostgreSQL with time-series optimizations
Retention policies automate data lifecycle management

What to Learn Next

-> Cassandra Deep Dive Cassandra architecture, data modeling, and operational patterns.

-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.

-> Stream Processing Real-time data processing with Flink, Spark Streaming, and Kafka Streams.

-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.

-> Choosing the Right Database Systematic framework for database selection.

-> DynamoDB Deep Dive DynamoDB internals, partitioning, and global tables.

Time-Series Databases

Time-Series Databases

What Is Time-Series Data?

DfTime-Series Data

Time-Series Characteristics

The Time-Series Data Model

Time-Series Compression

Compression Ratio

Gorilla Compression for Metrics

InfluxDB Architecture

DfInfluxDB

TimescaleDB Architecture

DfTimescaleDB

Hypertables

DfHypertable

TimescaleDB Hypertable

Retention and Downsampling

DfDownsampling

Downsampling Impact

Query Optimization Patterns

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?