Data Systems
Time-Series Databases
Time-series data is one of the fastest growing data types. Master the specialized databases, data models, and optimization techniques for handling massive volumes of timestamped data.
- Compression β Specialized algorithms reduce storage by 10-100x
- Retention β Automatic data lifecycle management
- Aggregation β Pre-computed rollups for fast queries
Time-series databases turn mountains of data into actionable insights.
What Is Time-Series Data?
DfTime-Series Data
Time-series data is a sequence of data points indexed by time. Each data point consists of a timestamp, a measurement (value), and optional metadata (tags/labels). Time-series data is append-heavy, rarely updated, and queried primarily by time ranges. Examples include metrics, IoT sensor readings, financial tick data, and application logs.
Time-Series Characteristics
| Characteristic | Implication |
|---|---|
| Append-only | Data is written once, never modified |
| Time-ordered | Natural ordering by timestamp |
| High write volume | Millions of points per second |
| Time-bounded queries | Most queries filter by time range |
| Old data less valuable | Aggregation and downsampling needed |
| Real-time ingestion | Low-latency write path critical |
Time-series databases exploit these characteristics for optimization. Unlike general-purpose databases, they can assume data is append-only and time-ordered, enabling aggressive compression and efficient time-range queries.
The Time-Series Data Model
Time-Series Compression
Time-series databases use specialized compression algorithms:
Compression Ratio
Here,
- =Compression ratio (higher is better)
- =Uncompressed data size
- =Compressed data size
| Algorithm | Technique | Typical Ratio |
|---|---|---|
| Delta encoding | Store differences between values | 2-5x |
| Gorilla (Facebook) | XOR with previous value | 10-20x |
| Dictionary encoding | Map repeated values to IDs | 5-10x |
| Run-length encoding | Compress consecutive identical values | Variable |
Gorilla Compression for Metrics
For CPU utilization data (0-100%), values are typically close to previous values:
Raw values: 65.2, 65.3, 65.1, 65.4, 65.2 Delta from previous: -, 0.1, -0.2, 0.3, -0.2
Deltas are small numbers that compress extremely well with XOR encoding. Typical compression ratio: 12x.
InfluxDB Architecture
DfInfluxDB
InfluxDB is an open-source time-series database designed for high-throughput ingestion and efficient querying of time-stamped data. It uses a custom storage engine (TSM) optimized for time-series workloads, with built-in downsampling, retention policies, and a powerful query language (Flux/InfluxQL).
| Component | Purpose |
|---|---|
| TSM Engine | Time-structured merge tree for storage |
| Write Path | WAL + in-memory cache for fast writes |
| Query Engine | Flux query language for processing |
| Continuous Queries | Automatic downsampling and aggregation |
| Retention Policies | Automatic data expiration |
TimescaleDB Architecture
DfTimescaleDB
TimescaleDB is a PostgreSQL extension for time-series data. It adds automatic time-based partitioning (hypertables), continuous aggregates, and time-series optimizations while maintaining full SQL compatibility. This allows you to use existing PostgreSQL tools and knowledge.
Hypertables
DfHypertable
A hypertable in TimescaleDB is a table that is automatically partitioned into chunks based on time. Each chunk is a separate PostgreSQL table, enabling efficient time-range queries and automatic data management. The partitioning is transparent to the userβqueries work the same as on regular tables.
TimescaleDB Hypertable
-- Create a regular table
CREATE TABLE sensor_data (
time TIMESTAMPTZ NOT NULL,
sensor_id INT,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION
);
-- Convert to hypertable (automatic time partitioning)
SELECT create_hypertable('sensor_data', 'time');
-- Query by time range (automatically prunes chunks)
SELECT time_bucket('1 hour', time), AVG(temperature)
FROM sensor_data
WHERE time > NOW() - INTERVAL '1 day'
GROUP BY time_bucket('1 hour', time);
Retention and Downsampling
DfDownsampling
Downsampling reduces data resolution by aggregating high-frequency data into lower-frequency summaries. For example, converting 1-second metrics to 1-minute averages reduces data volume by 60x while preserving the overall trend.
Downsampling Impact
Here,
- =Volume after downsampling
- =Original data volume
- =Number of raw points per aggregated point
A typical retention strategy: keep raw data for 7 days, 1-minute aggregates for 30 days, 1-hour aggregates for 1 year, and daily aggregates indefinitely. This reduces storage by 99.9% while maintaining query performance for recent data.
Query Optimization Patterns
| Pattern | Technique |
|---|---|
| Time-range pruning | Skip chunks outside query range |
| Pre-computed aggregates | Use materialized views for common queries |
| Continuous aggregates | Automatically update summaries on write |
| Partition elimination | Only scan relevant time partitions |
| Columnar storage | Efficient compression for analytics |
Practice Exercises
-
Data Modeling: Design the schema for an IoT sensor data system that collects temperature, humidity, and pressure from 10,000 sensors every second. Include retention and downsampling strategies.
-
Compression Analysis: Given 10M data points per day with 8-byte values, estimate the storage needed with and without Gorilla compression (assume 12x ratio).
-
Query Design: Write a TimescaleDB query to find the top 10 sensors with the highest average temperature over the last 24 hours, bucketed by 5-minute intervals.
-
Architecture Decision: Compare InfluxDB and TimescaleDB for a monitoring platform. What are the trade-offs in terms of query language, storage efficiency, and operational complexity?
Key Takeaways:
- Time-series data is append-only, time-ordered, and queried primarily by time range
- Specialized compression (Gorilla, delta encoding) reduces storage 10-100x
- Downsampling reduces data volume while preserving trends
- InfluxDB provides a purpose-built time-series database
- TimescaleDB extends PostgreSQL with time-series optimizations
- Retention policies automate data lifecycle management
What to Learn Next
-> Cassandra Deep Dive Cassandra architecture, data modeling, and operational patterns.
-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.
-> Stream Processing Real-time data processing with Flink, Spark Streaming, and Kafka Streams.
-> Data Partitioning Sharding strategies, consistent hashing, and partition keys.
-> Choosing the Right Database Systematic framework for database selection.
-> DynamoDB Deep Dive DynamoDB internals, partitioning, and global tables.