---
title: "ClickHouse for beginners"
excerpt: "ClickHouse is a column-oriented database built for analytical queries. This is the introduction you'd want if you came from Postgres or MySQL: what it is, why it's fast, and how to start using it without running your own cluster."
authors: "Tinybird"
categories: "AI Resources"
createdOn: "2026-05-25 00:00:00"
publishedOn: "2026-05-25 00:00:00"
updatedOn: "2026-05-25 00:00:00"
status: "published"
---

You keep hearing about ClickHouse®. Your data team uses it. It shows up in engineering blog posts from companies like Cloudflare, Uber, and InMobi. You want to understand what it actually is and whether you should be using it.

This is the intro that assumes you know SQL and have used Postgres or MySQL, and explains ClickHouse from there.

## What ClickHouse is

ClickHouse is "a high-performance, column-oriented SQL database management system (DBMS) for online analytical processing (OLAP)." It runs SQL. It stores data. It's free and open source. There's also a managed cloud version.

The things that make it different from Postgres:

1. **It stores data in columns, not rows.** This is the fundamental change. Postgres stores each row together on disk. ClickHouse stores each column together. That physical layout change is responsible for most of the performance difference on analytical queries.
2. **It's optimized for reads over millions or billions of rows.** Aggregations, GROUP BY, COUNT DISTINCT, time-series rollups, these run in milliseconds even on very large tables.
3. **It's not built for OLTP.** Updates, deletes, and single-row lookups are not its strength. It's built for the query patterns typical in analytics, monitoring, and reporting.

## Why columnar storage makes analytics fast

Here's the core idea.

In a row-oriented database, data is stored like this, each row consecutively on disk:

```text
[id=1, name="Alice", plan="pro",    revenue=150]
[id=2, name="Bob",   plan="free",   revenue=0  ]
[id=3, name="Carol", plan="pro",    revenue=200]
```

When you run `SELECT sum(revenue) FROM users WHERE plan = 'pro'`, Postgres reads all rows from disk, even though you only need two columns. The operating system reads data in fixed-size blocks (typically 4-8KB). Even if you only need `plan` and `revenue`, you're loading `id` and `name` too.

In a column-oriented database, storage looks like this, each column separately:

```text
id:      [1, 2, 3, ...]
name:    ["Alice", "Bob", "Carol", ...]
plan:    ["pro", "free", "pro", ...]
revenue: [150, 0, 200, ...]
```

The same `SELECT sum(revenue) FROM users WHERE plan = 'pro'` only needs to read the `plan` and `revenue` columns. The `id` and `name` columns are never touched. When a query only needs a few columns out of a large table, the database reads just those columns from disk, skipping all the rest.

On a table with 20 columns and 1 billion rows, a query that uses 3 columns reads roughly 15% of the data a row store would read. That's where the speed comes from.

Two more factors stack on top:

**Vectorized execution.** ClickHouse processes data in batches of values, "vectors", rather than row by row. Modern CPUs are optimized for this kind of batch arithmetic, and the result is very high throughput per core.

**Compression.** Columns store the same type of data sequentially. Similar values compress far better than mixed rows. Typical event data compresses 5-10x in ClickHouse, which means less disk I/O and faster scans.

## When to use it vs. when to use Postgres

**Use Postgres when:**

- You need to update or delete individual rows frequently
- Your query patterns are mostly point lookups (fetch user by ID, fetch order by ID)
- You need strong ACID guarantees for financial or transactional data
- Your data volume is in the millions, not billions

**Use ClickHouse when:**

- You're running aggregations over time: total events, unique users, revenue by day
- You need dashboards or reports that feel live, not batch-computed
- Your event volume is growing fast (millions of rows per day or more)
- Postgres is visibly slow for analytics queries

The two are not mutually exclusive. Most production setups run Postgres for the application database and ClickHouse for the analytics layer. User accounts, orders, and transactions live in Postgres. Events, impressions, and metrics flow into ClickHouse. If you're coming from a managed warehouse and want to understand what you'd be giving up, [ClickHouse vs BigQuery](https://www.tinybird.co/blog/clickhouse-vs-bigquery-real-time-analytics) covers the tradeoffs in detail.

## Your first ClickHouse table

ClickHouse tables look similar to SQL tables you've written before, but with some additions:

```sql
CREATE TABLE page_views
(
    timestamp   DateTime,
    user_id     String,
    page_path   LowCardinality(String),
    referrer    String,
    duration_ms UInt32,
    country     LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);
```

Three concepts here that don't exist in Postgres:

**`ENGINE = MergeTree`**: Every ClickHouse table has an engine that determines how data is stored, merged, and indexed. `MergeTree` is the standard choice. It supports sampling for approximate query processing, data replication for high availability, data partitioning for efficient data management, and primary key indexing for fast lookups.

**`PARTITION BY toYYYYMM(timestamp)`**: Partitioning splits the table into separate physical directories by month. A query with `WHERE timestamp >= '2026-05-01'` only scans May's partition, ClickHouse completely skips prior months. For time-series data, this is a massive query speed improvement.

**`ORDER BY (user_id, timestamp)`**: The sort key determines how data is physically ordered on disk within each partition. Queries that filter on `user_id` can skip large blocks of data that don't match. Put the columns you filter most often first in the sort key.

**`LowCardinality(String)`**: This is a wrapper that dictionary-encodes columns with a small number of distinct values. `page_path` on a typical site has a few thousand distinct values, LowCardinality makes filtering and GROUP BY on those columns significantly faster, with better compression.

## Writing data: three approaches

### HTTP ingestion

ClickHouse exposes an HTTP interface. Insert JSON directly:

```bash
echo '{"timestamp":"2026-05-25T10:00:00","user_id":"usr_abc","page_path":"/pricing","duration_ms":1200,"country":"US"}' \
  | curl -X POST 'http://localhost:8123/?query=INSERT+INTO+page_views+FORMAT+JSONEachRow' \
    --data-binary @-
```

### ClickHouse client library

The official `@clickhouse/client` package handles connection pooling and batching:

```typescript
import { createClient } from "@clickhouse/client";

const client = createClient({ host: "http://localhost:8123" });

await client.insert({
  table:  "page_views",
  values: [{ timestamp: new Date(), user_id: "usr_abc", page_path: "/pricing", duration_ms: 1200, country: "US" }],
  format: "JSONEachRow",
});
```

### Kafka (for high-volume streams)

At scale, the standard architecture is Kafka as the message bus, with ClickHouse consuming directly via its Kafka table engine. The standard three-part pattern (Kafka engine table, MergeTree destination, materialized view) is covered in [Kafka to ClickHouse](https://www.tinybird.co/blog/kafka-to-clickhouse-example). Once the view exists, ClickHouse consumes automatically and persists rows with no consumer code needed in your application.

## Your first queries

ClickHouse runs standard SQL. Basic queries work exactly as you'd expect:

```sql
-- Total page views and unique users, last 7 days
SELECT
    toDate(timestamp)   AS day,
    count()             AS views,
    uniq(user_id)       AS unique_users
FROM page_views
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY day
ORDER BY day;
```

`uniq(user_id)` uses a HyperLogLog approximation (~2% error). For exact distinct counts, use `uniqExact()`. For even faster approximation at scale, `uniqHLL12()` uses a smaller sketch with slightly less accuracy. This trade-off comes up a lot in ClickHouse: approximate functions are often 10-100x faster and the error is acceptable for most analytics use cases.

```sql
-- Top pages by bounce rate (single-page sessions)
SELECT
    page_path,
    count()                                             AS total_sessions,
    countIf(duration_ms < 5000)                         AS bounces,
    round(countIf(duration_ms < 5000) /
          nullIf(count(), 0) * 100, 1)                  AS bounce_rate_pct
FROM page_views
WHERE timestamp >= now() - INTERVAL 30 DAY
GROUP BY page_path
ORDER BY bounce_rate_pct DESC
LIMIT 20;
```

`countIf(condition)` is more efficient than `COUNT(CASE WHEN ...)`. One pass over the data, two conditional counts.

## Performance tips that matter early

**Use PREWHERE for selective filters.** `PREWHERE` is ClickHouse-specific. It reads the filter column first, identifies matching rows, then reads remaining columns only for those rows, much faster on large tables with selective conditions:

```sql
-- PREWHERE scans country column first, then reads only matching rows
SELECT page_path, count() FROM page_views
PREWHERE country = 'US'
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY page_path;
```

**Filter on the sort key first.** If your table is sorted by `(user_id, timestamp)`, queries with `WHERE user_id = 'x'` can skip most of the table. Queries with only `WHERE timestamp > ...` can't skip as efficiently.

**Avoid functions on filter columns.** `WHERE toDate(timestamp) = '2026-05-25'` wraps the indexed column in a function, disabling the index. Write it as `WHERE timestamp >= '2026-05-25' AND timestamp < '2026-05-26'` instead.

**Don't SELECT *.** Always name the columns you need. Selecting unused columns reads data you'll throw away. On tables with many wide columns, this is a significant performance difference.

## Pre-aggregation with materialized views

For very high-volume tables, pre-aggregating data at insert time is how you make dashboard queries stay fast as data grows. ClickHouse materialized views act as insert triggers, when rows arrive, the view transforms and aggregates them into a separate summary table automatically.

A daily summary that always reflects the latest data:

```sql
-- Summary table
CREATE TABLE page_views_daily
(
    day          Date,
    page_path    LowCardinality(String),
    country      LowCardinality(String),
    views        UInt64,
    unique_users AggregateFunction(uniq, String)
)
ENGINE = AggregatingMergeTree
ORDER BY (day, page_path, country);

-- Materialized view: populates the summary on every insert
CREATE MATERIALIZED VIEW page_views_daily_mv
TO page_views_daily AS
SELECT
    toDate(timestamp)  AS day,
    page_path,
    country,
    count()            AS views,
    uniqState(user_id) AS unique_users
FROM page_views
GROUP BY day, page_path, country;
```

Dashboard queries hit the daily summary instead of raw events. A query for a 90-day date range reads 90 rows per page/country combination, not millions of raw events.

## ClickHouse without running a cluster

Installing and running ClickHouse locally for development is straightforward. Running it in production, managing replication, failover, scaling, backups, and schema migrations across nodes, is a different story.

Tinybird is managed ClickHouse with three additions that matter for most teams:

1. **Events API**: HTTP endpoint for ingestion. POST JSON directly without needing the ClickHouse client library or managing batching. Handles 1K+ req/sec by default.
2. **SQL editor and Pipes**: Write SQL in the browser, parameterize it, and publish as an HTTP endpoint. Your frontend calls `/campaign_metrics.json?days=7`, no backend route needed.
3. **No cluster management**: Tinybird handles replication, scaling, backups, and ClickHouse version upgrades.

The `page_views` example above as a Tinybird workflow:

```bash
# Install CLI
pip install tinybird-cli

# Define the datasource (schema)
cat > page_views.datasource << 'EOF'
SCHEMA >
    timestamp   DateTime `json:$.timestamp`,
    user_id     String   `json:$.user_id`,
    page_path   LowCardinality(String) `json:$.page_path`,
    country     LowCardinality(String) `json:$.country`,
    duration_ms UInt32   `json:$.duration_ms`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "user_id, timestamp"
EOF

# Deploy the schema
tb deploy

# Send an event
curl -X POST "https://api.tinybird.co/v0/events?name=page_views" \
  -H "Authorization: Bearer $TB_TOKEN" \
  -d '{"timestamp":"2026-05-25T10:00:00","user_id":"u1","page_path":"/pricing","country":"US","duration_ms":1200}'
```

Then define the query as a Pipe and publish it as an endpoint, same ClickHouse SQL, no extra infrastructure.

## What to try first

If you want to explore ClickHouse without running anything locally:

1. **[ClickHouse Cloud free tier](https://clickhouse.cloud)**, managed ClickHouse with a SQL editor, $300 of trial credits.
2. **[Tinybird free tier](https://www.tinybird.co/pricing)**, managed ClickHouse with ingestion API and endpoint publishing. Free up to 1M rows/month.
3. **[ClickHouse Playground](https://play.clickhouse.com)**, query public datasets with no signup, useful for getting comfortable with the SQL syntax.

The learning curve from Postgres is low if you already know SQL. The main adjustments are schema design (sort keys, partition keys, table engines) and learning when approximation functions (`uniq`, `quantile`) are the right trade-off. The [ClickHouse course](https://www.tinybird.co/blog/clickhouse-course) covers these concepts with progressively harder exercises if you want structured practice. Most engineers are productive within a day.

{% cta
  title="Try ClickHouse without running a cluster"
  text="Tinybird is managed ClickHouse® with a SQL editor, ingestion API, and endpoint publishing. Free to start."
  button={href: "https://cloud.tinybird.co/signup", target: "_blank", text: "Try Tinybird free"}
/%}
