ClickHouse® is an analytical database designed for large datasets. That claim is worth unpacking, because most databases claim good performance until the data gets big enough to expose their architectural assumptions. ClickHouse's design choices are specifically oriented around what happens at scale: columnar storage, vectorized execution, sparse indexes, and pre-aggregation via materialized views. Each of those decisions compounds as data volume grows.
This post covers the schema patterns, query techniques, materialized view strategies, approximate aggregation functions, and data lifecycle patterns that make ClickHouse the right choice for large dataset workloads.
Why columnar storage changes the math at scale
Row-oriented databases store each row's fields together on disk. An aggregation query that touches 2 of 20 columns reads all 20 for every matching row. At 10 million rows that overhead is manageable. At 10 billion rows, it means reading roughly 10x more data from disk than the query actually needs.
ClickHouse stores each column separately. That same query reads only the 2 columns it needs, in compressed blocks. The ClickHouse engineering team has documented that their compression achieves 32x on real-world datasets (in their published benchmark using a 1B-row weather dataset with ZSTD), meaning each query reads 32x less raw data from disk before decompression and 10x less after compared to a row-oriented equivalent reading all columns.
The practical effect: queries that take minutes on a row-oriented database at 500M rows run in seconds on ClickHouse with the right sort key. Tinybird's own engineering documentation shows simple aggregation queries running in 20-30 seconds on traditional databases, and 250ms with ClickHouse pre-aggregation on a 500M-row dataset.
Designing sort keys for large datasets
The sort key determines which queries can skip data instead of scanning it. For large datasets, getting the sort key right is the most important schema decision. ClickHouse stores data in sorted order within each part and keeps a sparse primary index, so queries that filter by sort key columns skip the majority of granules entirely.
CREATE TABLE events
(
event_time DateTime64(3),
tenant_id LowCardinality(String),
event_type LowCardinality(String),
user_id String,
session_id String,
page_url String,
referrer LowCardinality(Nullable(String)),
duration_ms UInt32,
properties String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (tenant_id, event_type, toStartOfHour(event_time), user_id);
tenant_id first ensures multi-tenant workloads can skip all other tenants' granules entirely. event_type second means queries filtering to a specific event type (like page_view or purchase) skip irrelevant events. toStartOfHour(event_time) before user_id means time-range queries on event data read proportional amounts of data rather than scattering reads across the full dataset.
For a table with 100 billion rows, a query filtering on tenant_id + event_type + time range may touch less than 0.1% of total granules. That is what makes sub-second queries at scale possible.
Approximate aggregation for interactive queries
For counting 100 billion unique IDs, exact uniqExact() requires significant memory because it must track every distinct value seen. ClickHouse's uniq() uses HyperLogLog and returns an approximation with typical error under 2.3%, which is accurate enough for product analytics and costs a fraction of the memory:
-- Exact unique count: memory-intensive at scale
SELECT uniqExact(user_id) FROM events WHERE event_time >= today() - 30;
-- Approximate: sub-1% error, runs fast at any scale
SELECT uniq(user_id) FROM events WHERE event_time >= today() - 30;
-- For funnel analysis across billions of events
SELECT
uniq(user_id) AS visitors,
uniqIf(user_id, has(groupArray(event_type), 'signup')) AS signups,
uniqIf(user_id, has(groupArray(event_type), 'purchase')) AS purchasers
FROM events
WHERE event_time >= today() - 7;
Similarly, quantile() approximates percentiles using T-Digest, making p50/p95/p99 latency analysis feasible over tens of billions of rows without full sorts.
Materialized views for pre-aggregation
For analytics at scale, the most effective pattern for large datasets is to pre-aggregate on write. ClickHouse materialized views transform data as it arrives and write summaries to a separate table. Dashboard queries read the summary, not the raw table:
CREATE TABLE hourly_event_counts
(
hour DateTime,
tenant_id LowCardinality(String),
event_type LowCardinality(String),
user_count AggregateFunction(uniq, String),
event_count UInt64
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(hour)
ORDER BY (tenant_id, event_type, hour);
CREATE MATERIALIZED VIEW hourly_event_counts_mv TO hourly_event_counts AS
SELECT
toStartOfHour(event_time) AS hour,
tenant_id,
event_type,
uniqState(user_id) AS user_count,
count() AS event_count
FROM events
GROUP BY hour, tenant_id, event_type;
A dashboard query that previously scanned 100 billion raw events now reads from a table with roughly 8,760 rows per tenant per event type per year. The reduction in data read is proportional to the reduction in query time. Tinybird's engineering team has documented 20x-2000x data reduction from rollups at different granularities on a 500M-row dataset.
Secondary indexes for large sparse tables
ClickHouse's skip indexes allow granule-level skipping based on data statistics beyond the primary sort key:
-- Bloom filter index for high-cardinality string lookup
ALTER TABLE events ADD INDEX idx_user_id user_id TYPE bloom_filter GRANULARITY 1;
-- Min/max index for numeric range queries
ALTER TABLE events ADD INDEX idx_duration duration_ms TYPE minmax GRANULARITY 1;
-- Set index for low-cardinality string filtering
ALTER TABLE events ADD INDEX idx_referrer referrer TYPE set(100) GRANULARITY 1;
Skip indexes are most effective when the filter column correlates poorly with the sort key, meaning the sort key alone wouldn't skip those granules. For a table sorted by (tenant_id, event_type, hour, user_id), adding a bloom filter on session_id allows queries filtering by session to skip most granules without a full scan.
Load testing at large dataset scale
Before putting a large dataset workload in production, running load tests against your real-time data systems exposes the query patterns that degrade under concurrent access. The two failure modes to test for are: queries that are fast in isolation but slow under concurrency (memory pressure from parallel aggregations), and queries that scan too many granules because the sort key doesn't match the filter.
A useful diagnostic: system.query_log in ClickHouse records read_rows, read_bytes, and elapsed_ms for every query. Queries that read more than 10% of total table rows for a filtered result are candidates for sort key or index optimization.
-- Find queries reading excessive data
SELECT
query_kind,
left(query, 100) AS query_excerpt,
read_rows,
read_bytes,
elapsed_ms,
round(read_rows / (
SELECT count() FROM events
) * 100, 2) AS pct_table_scanned
FROM system.query_log
WHERE type = 'QueryFinish'
AND elapsed_ms > 1000
AND tables LIKE '%events%'
ORDER BY read_bytes DESC
LIMIT 20;
Partitioning strategy at large scale
Partitioning controls how ClickHouse organizes data on disk and which partitions a query must touch. At large dataset scale, partition design directly affects both query performance and maintenance operations.
Monthly partitioning by toYYYYMM(event_time) is the standard choice for time-series event data. It keeps each partition's data volume manageable (one month of events), allows dropping historical partitions efficiently for GDPR or retention purposes, and aligns with the natural time boundaries that most queries use.
For multi-tenant workloads where queries always filter by tenant_id, a composite partition key keeps tenant data co-located:
ENGINE = MergeTree
PARTITION BY (tenant_id, toYYYYMM(event_time))
ORDER BY (event_type, event_time, user_id);
This allows ALTER TABLE events DROP PARTITION ('tenant_123', 202501) to delete all data for a specific tenant in a specific month without a full table rewrite, which matters for GDPR erasure at scale.
Avoid over-partitioning. ClickHouse creates one directory per part and one directory per partition. A table with 100 partitions, each containing 100 active parts, creates 10,000 directories. Query planning overhead and metadata operations scale with part count. The ClickHouse documentation recommends keeping the total number of active parts below a few thousand for operational stability.
Data lifecycle and retention
Large datasets grow without bound unless retention is managed explicitly. ClickHouse's TTL rules handle tiered storage and deletion automatically:
ALTER TABLE events MODIFY TTL
event_time + INTERVAL 30 DAY TO VOLUME 'warm',
event_time + INTERVAL 1 YEAR TO VOLUME 'cold',
event_time + INTERVAL 3 YEAR DELETE;
Log analytics architectures for real-time systems use the same tiering pattern: hot data on NVMe SSD, warm data on standard SSD, cold data on object storage like S3. The raw events table shrinks to its hot tier size while the pre-aggregated rollup tables retain the full history compactly. Most dashboard queries against historical data touch only rollup tables.
Tinybird for large datasets
Tinybird is managed ClickHouse designed for teams who need large dataset performance without large dataset infrastructure overhead. The schema design, sort key optimization, materialized view management, and TTL policy execution described in this post are handled by Tinybird rather than by your engineering team.
For ingestion into large dataset workloads, the Events API accepts streaming writes that become queryable within seconds, and the Kafka connector handles high-throughput topics without requiring a separate consumer. Your table grows continuously, and ClickHouse's append-optimized MergeTree engine handles that growth without degrading query performance on the existing data.
On the query side, every SQL Pipe you define against a large Tinybird datasource becomes a parameterized HTTP endpoint. Outgrowing Postgres and moving OLAP workloads off Postgres is one of the most common patterns that brings teams to Tinybird: a Postgres instance that handles OLTP well starts slowing on analytical queries as the table passes 50 million rows, and the fix is to stream that data into Tinybird where the same queries run in milliseconds.
Resend, processing 100TB per month on Tinybird, measured 62ms p90 query latency in production without caching, starting from a Postgres instance that was timing out on simple aggregate queries. Tinybird is SOC 2 Type II certified. The pipeline from data source to queryable API endpoint takes hours to configure rather than weeks to build.
The design choices compound
Every ClickHouse decision described in this post, columnar storage, sort key alignment, materialized pre-aggregation, approximate functions, skip indexes, TTL tiering, works in isolation. But the performance gains are multiplicative. A query that benefits from a well-aligned sort key reads 1% of the granules. The same query reading only 3 out of 20 columns reads 15% of the data. The same query reading from a pre-aggregated rollup instead of the raw table reads 0.001% of the rows. Combined, sub-second query latency over hundreds of billions of rows is not an edge case. It is the expected outcome of applying the right schema from the start.
The inversion this creates is worth stating directly: with most databases, performance degrades as data volume grows, and the engineering response is workarounds. Pre-computed caches, replicas, row limits, sampling. With ClickHouse designed correctly, performance scales with hardware, and most queries stay fast without workarounds even as the table grows by an order of magnitude.
