You keep hearing about ClickHouse®. Your data team uses it. It shows up in engineering blog posts from companies like Cloudflare, Uber, and InMobi. You want to understand what it actually is and whether you should be using it.
This is the intro that assumes you know SQL and have used Postgres or MySQL, and explains ClickHouse from there.
What ClickHouse is
ClickHouse is "a high-performance, column-oriented SQL database management system (DBMS) for online analytical processing (OLAP)." It runs SQL. It stores data. It's free and open source. There's also a managed cloud version.
The things that make it different from Postgres:
- It stores data in columns, not rows. This is the fundamental change. Postgres stores each row together on disk. ClickHouse stores each column together. That physical layout change is responsible for most of the performance difference on analytical queries.
- It's optimized for reads over millions or billions of rows. Aggregations, GROUP BY, COUNT DISTINCT, time-series rollups, these run in milliseconds even on very large tables.
- It's not built for OLTP. Updates, deletes, and single-row lookups are not its strength. It's built for the query patterns typical in analytics, monitoring, and reporting.
Why columnar storage makes analytics fast
Here's the core idea.
In a row-oriented database, data is stored like this, each row consecutively on disk:
[id=1, name="Alice", plan="pro", revenue=150]
[id=2, name="Bob", plan="free", revenue=0 ]
[id=3, name="Carol", plan="pro", revenue=200]
When you run SELECT sum(revenue) FROM users WHERE plan = 'pro', Postgres reads all rows from disk, even though you only need two columns. The operating system reads data in fixed-size blocks (typically 4-8KB). Even if you only need plan and revenue, you're loading id and name too.
In a column-oriented database, storage looks like this, each column separately:
id: [1, 2, 3, ...]
name: ["Alice", "Bob", "Carol", ...]
plan: ["pro", "free", "pro", ...]
revenue: [150, 0, 200, ...]
The same SELECT sum(revenue) FROM users WHERE plan = 'pro' only needs to read the plan and revenue columns. The id and name columns are never touched. When a query only needs a few columns out of a large table, the database reads just those columns from disk, skipping all the rest.
On a table with 20 columns and 1 billion rows, a query that uses 3 columns reads roughly 15% of the data a row store would read. That's where the speed comes from.
Two more factors stack on top:
Vectorized execution. ClickHouse processes data in batches of values, "vectors", rather than row by row. Modern CPUs are optimized for this kind of batch arithmetic, and the result is very high throughput per core.
Compression. Columns store the same type of data sequentially. Similar values compress far better than mixed rows. Typical event data compresses 5-10x in ClickHouse, which means less disk I/O and faster scans.
When to use it vs. when to use Postgres
Use Postgres when:
- You need to update or delete individual rows frequently
- Your query patterns are mostly point lookups (fetch user by ID, fetch order by ID)
- You need strong ACID guarantees for financial or transactional data
- Your data volume is in the millions, not billions
Use ClickHouse when:
- You're running aggregations over time: total events, unique users, revenue by day
- You need dashboards or reports that feel live, not batch-computed
- Your event volume is growing fast (millions of rows per day or more)
- Postgres is visibly slow for analytics queries
The two are not mutually exclusive. Most production setups run Postgres for the application database and ClickHouse for the analytics layer. User accounts, orders, and transactions live in Postgres. Events, impressions, and metrics flow into ClickHouse. If you're coming from a managed warehouse and want to understand what you'd be giving up, ClickHouse vs BigQuery covers the tradeoffs in detail.
Your first ClickHouse table
ClickHouse tables look similar to SQL tables you've written before, but with some additions:
CREATE TABLE page_views
(
timestamp DateTime,
user_id String,
page_path LowCardinality(String),
referrer String,
duration_ms UInt32,
country LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);
Three concepts here that don't exist in Postgres:
ENGINE = MergeTree: Every ClickHouse table has an engine that determines how data is stored, merged, and indexed. MergeTree is the standard choice. It supports sampling for approximate query processing, data replication for high availability, data partitioning for efficient data management, and primary key indexing for fast lookups.
PARTITION BY toYYYYMM(timestamp): Partitioning splits the table into separate physical directories by month. A query with WHERE timestamp >= '2026-05-01' only scans May's partition, ClickHouse completely skips prior months. For time-series data, this is a massive query speed improvement.
ORDER BY (user_id, timestamp): The sort key determines how data is physically ordered on disk within each partition. Queries that filter on user_id can skip large blocks of data that don't match. Put the columns you filter most often first in the sort key.
LowCardinality(String): This is a wrapper that dictionary-encodes columns with a small number of distinct values. page_path on a typical site has a few thousand distinct values, LowCardinality makes filtering and GROUP BY on those columns significantly faster, with better compression.
Writing data: three approaches
HTTP ingestion
ClickHouse exposes an HTTP interface. Insert JSON directly:
echo '{"timestamp":"2026-05-25T10:00:00","user_id":"usr_abc","page_path":"/pricing","duration_ms":1200,"country":"US"}' \
| curl -X POST 'http://localhost:8123/?query=INSERT+INTO+page_views+FORMAT+JSONEachRow' \
--data-binary @-
ClickHouse client library
The official @clickhouse/client package handles connection pooling and batching:
import { createClient } from "@clickhouse/client";
const client = createClient({ host: "http://localhost:8123" });
await client.insert({
table: "page_views",
values: [{ timestamp: new Date(), user_id: "usr_abc", page_path: "/pricing", duration_ms: 1200, country: "US" }],
format: "JSONEachRow",
});
Kafka (for high-volume streams)
At scale, the standard architecture is Kafka as the message bus, with ClickHouse consuming directly via its Kafka table engine. The standard three-part pattern (Kafka engine table, MergeTree destination, materialized view) is covered in Kafka to ClickHouse. Once the view exists, ClickHouse consumes automatically and persists rows with no consumer code needed in your application.
Your first queries
ClickHouse runs standard SQL. Basic queries work exactly as you'd expect:
-- Total page views and unique users, last 7 days
SELECT
toDate(timestamp) AS day,
count() AS views,
uniq(user_id) AS unique_users
FROM page_views
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY day
ORDER BY day;
uniq(user_id) uses a HyperLogLog approximation (~2% error). For exact distinct counts, use uniqExact(). For even faster approximation at scale, uniqHLL12() uses a smaller sketch with slightly less accuracy. This trade-off comes up a lot in ClickHouse: approximate functions are often 10-100x faster and the error is acceptable for most analytics use cases.
-- Top pages by bounce rate (single-page sessions)
SELECT
page_path,
count() AS total_sessions,
countIf(duration_ms < 5000) AS bounces,
round(countIf(duration_ms < 5000) /
nullIf(count(), 0) * 100, 1) AS bounce_rate_pct
FROM page_views
WHERE timestamp >= now() - INTERVAL 30 DAY
GROUP BY page_path
ORDER BY bounce_rate_pct DESC
LIMIT 20;
countIf(condition) is more efficient than COUNT(CASE WHEN ...). One pass over the data, two conditional counts.
Performance tips that matter early
Use PREWHERE for selective filters. PREWHERE is ClickHouse-specific. It reads the filter column first, identifies matching rows, then reads remaining columns only for those rows, much faster on large tables with selective conditions:
-- PREWHERE scans country column first, then reads only matching rows
SELECT page_path, count() FROM page_views
PREWHERE country = 'US'
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY page_path;
Filter on the sort key first. If your table is sorted by (user_id, timestamp), queries with WHERE user_id = 'x' can skip most of the table. Queries with only WHERE timestamp > ... can't skip as efficiently.
Avoid functions on filter columns. WHERE toDate(timestamp) = '2026-05-25' wraps the indexed column in a function, disabling the index. Write it as WHERE timestamp >= '2026-05-25' AND timestamp < '2026-05-26' instead.
*Don't SELECT . Always name the columns you need. Selecting unused columns reads data you'll throw away. On tables with many wide columns, this is a significant performance difference.
Pre-aggregation with materialized views
For very high-volume tables, pre-aggregating data at insert time is how you make dashboard queries stay fast as data grows. ClickHouse materialized views act as insert triggers, when rows arrive, the view transforms and aggregates them into a separate summary table automatically.
A daily summary that always reflects the latest data:
-- Summary table
CREATE TABLE page_views_daily
(
day Date,
page_path LowCardinality(String),
country LowCardinality(String),
views UInt64,
unique_users AggregateFunction(uniq, String)
)
ENGINE = AggregatingMergeTree
ORDER BY (day, page_path, country);
-- Materialized view: populates the summary on every insert
CREATE MATERIALIZED VIEW page_views_daily_mv
TO page_views_daily AS
SELECT
toDate(timestamp) AS day,
page_path,
country,
count() AS views,
uniqState(user_id) AS unique_users
FROM page_views
GROUP BY day, page_path, country;
Dashboard queries hit the daily summary instead of raw events. A query for a 90-day date range reads 90 rows per page/country combination, not millions of raw events.
ClickHouse without running a cluster
Installing and running ClickHouse locally for development is straightforward. Running it in production, managing replication, failover, scaling, backups, and schema migrations across nodes, is a different story.
Tinybird is managed ClickHouse with three additions that matter for most teams:
- Events API: HTTP endpoint for ingestion. POST JSON directly without needing the ClickHouse client library or managing batching. Handles 1K+ req/sec by default.
- SQL editor and Pipes: Write SQL in the browser, parameterize it, and publish as an HTTP endpoint. Your frontend calls
/campaign_metrics.json?days=7, no backend route needed. - No cluster management: Tinybird handles replication, scaling, backups, and ClickHouse version upgrades.
The page_views example above as a Tinybird workflow:
# Install CLI
pip install tinybird-cli
# Define the datasource (schema)
cat > page_views.datasource << 'EOF'
SCHEMA >
timestamp DateTime `json:$.timestamp`,
user_id String `json:$.user_id`,
page_path LowCardinality(String) `json:$.page_path`,
country LowCardinality(String) `json:$.country`,
duration_ms UInt32 `json:$.duration_ms`
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "user_id, timestamp"
EOF
# Deploy the schema
tb deploy
# Send an event
curl -X POST "https://api.tinybird.co/v0/events?name=page_views" \
-H "Authorization: Bearer $TB_TOKEN" \
-d '{"timestamp":"2026-05-25T10:00:00","user_id":"u1","page_path":"/pricing","country":"US","duration_ms":1200}'
Then define the query as a Pipe and publish it as an endpoint, same ClickHouse SQL, no extra infrastructure.
What to try first
If you want to explore ClickHouse without running anything locally:
- ClickHouse Cloud free tier, managed ClickHouse with a SQL editor, $300 of trial credits.
- Tinybird free tier, managed ClickHouse with ingestion API and endpoint publishing. Free up to 1M rows/month.
- ClickHouse Playground, query public datasets with no signup, useful for getting comfortable with the SQL syntax.
The learning curve from Postgres is low if you already know SQL. The main adjustments are schema design (sort keys, partition keys, table engines) and learning when approximation functions (uniq, quantile) are the right trade-off. The ClickHouse course covers these concepts with progressively harder exercises if you want structured practice. Most engineers are productive within a day.
