---
title: "ClickHouse for analysts"
excerpt: "ClickHouse runs standard SQL, returns aggregations over billions of rows in milliseconds, and doesn't require you to manage infrastructure. Here's what analysts need to know to get the most out of it, including how Tinybird makes the whole workflow faster."
authors: "Tinybird"
categories: "AI Resources"
createdOn: "2026-05-25 00:00:00"
publishedOn: "2026-05-25 00:00:00"
updatedOn: "2026-05-25 00:00:00"
status: "published"
---

ClickHouse® is a columnar database built for exactly the queries analysts run most: aggregations across large date ranges, GROUP BY across multiple dimensions, COUNT DISTINCT on high-cardinality columns, and time-series rollups. Queries that take minutes in Postgres take milliseconds here.

ClickHouse is "a high-performance, column-oriented SQL database management system (DBMS) for online analytical processing (OLAP)." Unlike row-store databases, where even a query touching two columns reads entire rows from disk, ClickHouse stores each column independently. A query that only needs `event_time`, `event_type`, and `user_id` reads exactly those three columns and nothing else.

It runs standard SQL. If you know SQL, you can query ClickHouse today. There are a few extensions worth knowing, a few patterns that unlock the most common analyst workflows, and a few gotchas that catch people coming from row-store databases. This post covers all three.

If you're using [Tinybird](https://www.tinybird.co/), managed ClickHouse with a SQL editor and API layer, every example here runs exactly as written in the Tinybird workspace.

## What ClickHouse is fast at

The queries that are painful in row-store databases are ClickHouse's native workload. If you're coming from a warehouse like BigQuery or Snowflake, the positioning is different, ClickHouse is optimized for [analytics at scale](https://www.tinybird.co/blog/analytics-at-scale) rather than scheduled batch queries:

```sql
-- Count distinct users who performed an action, by day, last 90 days
SELECT
    toDate(event_time)      AS day,
    event_type,
    count()                 AS events,
    uniq(user_id)           AS unique_users
FROM events
WHERE event_time >= now() - INTERVAL 90 DAY
GROUP BY day, event_type
ORDER BY day;
```

This query scans 90 days of event data across potentially billions of rows. In ClickHouse, with a well-designed schema, it returns in under a second. The reason is physical: because column values are stored sequentially on disk, only the four referenced columns are read. Everything else stays compressed and unread.

This block-wise storage alignment is specifically what makes analytical queries faster, no unnecessary data is loaded, compared to row-based storage, where entire rows including irrelevant columns are always read.

## The SQL you already know, plus some useful extensions

ClickHouse supports the SQL standard you're used to: `SELECT`, `WHERE`, `GROUP BY`, `ORDER BY`, `JOIN`, `HAVING`, window functions, CTEs. A few additions are worth learning early.

### `uniq()` instead of `COUNT(DISTINCT ...)`

`COUNT(DISTINCT col)` in standard SQL is exact but expensive at scale. ClickHouse's `uniq()` uses a HyperLogLog sketch, approximately 2% error, but orders of magnitude faster on large datasets. For even faster approximation with slightly less accuracy, `uniqHLL12()` uses a smaller sketch:

```sql
-- Approximate (fast, ~2% error)
SELECT uniq(user_id) AS unique_users FROM events;

-- Even faster approximation (useful on very large tables)
SELECT uniqHLL12(user_id) AS unique_users FROM events;

-- Exact (slower, uses more memory)
SELECT uniqExact(user_id) AS unique_users FROM events;
```

For most analyst use cases, understanding scale, spotting trends, comparing segments, the approximation is fine. Use `uniqExact()` when you need the precise number, such as billing reconciliation or compliance reporting.

### `countIf()` and `sumIf()` for conditional aggregation

Instead of multiple subqueries or CASE statements, use the `If` variants. One pass over the data, multiple conditional aggregations:

```sql
SELECT
    campaign_id,
    countIf(event_type = 'impression') AS impressions,
    countIf(event_type = 'click')      AS clicks,
    round(
        countIf(event_type = 'click') /
        nullIf(countIf(event_type = 'impression'), 0) * 100,
        2
    )                                  AS ctr_pct,
    sumIf(revenue, event_type = 'conversion') AS revenue
FROM ad_events
WHERE event_time >= now() - INTERVAL 7 DAY
GROUP BY campaign_id
ORDER BY impressions DESC;
```

`nullIf(x, 0)` prevents division by zero when a campaign has impressions but no clicks, returning NULL instead of an error.

### `PREWHERE` for selective filters

ClickHouse's `PREWHERE` is a performance optimization not available in standard SQL. It reads only the filter column first, identifies matching rows, and then reads the remaining columns only for those rows:

```sql
-- Standard WHERE: reads all columns for all rows, then filters
SELECT user_id, revenue FROM events
WHERE event_type = 'conversion';

-- PREWHERE: reads event_type first, then user_id/revenue only for matches
SELECT user_id, revenue FROM events
PREWHERE event_type = 'conversion';
```

On a table with many columns where the filter is highly selective, `PREWHERE` can cut I/O dramatically. On a table with many columns where the filter is highly selective, `PREWHERE` can cut I/O dramatically. In Tinybird, the query optimizer applies it automatically in many cases, but knowing the concept helps you write queries that the optimizer can work with.

### Date and time functions

ClickHouse has a rich set of date functions more ergonomic than standard SQL:

```sql
-- Truncate to time buckets (useful for GROUP BY)
toStartOfHour(event_time)
toStartOfDay(event_time)
toStartOfWeek(event_time)    -- Monday
toStartOfMonth(event_time)

-- Relative time filters
WHERE event_time >= now() - INTERVAL 7 DAY
WHERE event_time >= today() - 30
WHERE toDate(event_time) = yesterday()

-- Extract components
toDayOfWeek(event_time)   -- 1 = Monday, 7 = Sunday
toHour(event_time)        -- 0-23
toISOWeek(event_time)     -- ISO week number
```

### Array and JSON functions for semi-structured data

If your events store extra attributes as JSON or arrays:

```sql
-- Extract typed fields from JSON
SELECT
    JSONExtractString(properties, 'plan')    AS plan,
    JSONExtractInt(properties, 'seat_count') AS seats,
    JSONExtractFloat(properties, 'mrr')      AS mrr
FROM accounts;

-- Flatten arrays into rows (useful for tag analysis)
SELECT arrayJoin(tags) AS tag, count() AS cnt
FROM content
GROUP BY tag
ORDER BY cnt DESC;
```

In Tinybird, the workspace UI shows column types and lets you autocomplete JSON paths once your datasource is loaded, useful when exploring event schemas you didn't write yourself.

## Funnel analysis

Funnel queries are one of the most common analyst requests. ClickHouse handles them efficiently with `countIf` and the built-in `windowFunnel` function.

A simple step-by-step funnel:

```sql
SELECT
    countIf(event = 'signed_up')                  AS step_1_signup,
    countIf(event = 'onboarding_completed')        AS step_2_onboarding,
    countIf(event = 'first_query_run')             AS step_3_first_query,
    round(countIf(event = 'onboarding_completed') /
          nullIf(countIf(event = 'signed_up'), 0) * 100, 1) AS step1_to_step2_pct,
    round(countIf(event = 'first_query_run') /
          nullIf(countIf(event = 'onboarding_completed'), 0) * 100, 1) AS step2_to_step3_pct
FROM user_events
WHERE event_time >= now() - INTERVAL 30 DAY;
```

For strict ordered funnels, where a user must complete step N before counting for step N+1, use `windowFunnel`:

```sql
SELECT level, count() AS users
FROM (
    SELECT
        user_id,
        windowFunnel(7 * 86400)(   -- 7-day window in seconds
            event_time,
            event = 'signed_up',
            event = 'onboarding_completed',
            event = 'first_query_run',
            event = 'invited_teammate'
        ) AS level
    FROM user_events
    WHERE event_time >= now() - INTERVAL 90 DAY
    GROUP BY user_id
)
GROUP BY level
ORDER BY level;
```

`windowFunnel` returns the deepest step reached within the time window. In Tinybird, this becomes a Pipe endpoint your product team calls with `?days=30` or `?days=90`, no SQL access required on their side.

## Cohort retention analysis

Retention is a weekly or monthly analysis in most BI tools. In ClickHouse, a full cohort retention table runs in seconds even with years of history:

```sql
WITH
    cohorts AS (
        SELECT user_id, toStartOfWeek(min(event_time)) AS cohort_week
        FROM user_events
        WHERE event = 'signed_up'
        GROUP BY user_id
    ),
    activity AS (
        SELECT DISTINCT user_id, toStartOfWeek(event_time) AS active_week
        FROM user_events
        WHERE event_time >= now() - INTERVAL 52 WEEK
    )
SELECT
    c.cohort_week,
    dateDiff('week', c.cohort_week, a.active_week) AS weeks_since_signup,
    count(DISTINCT a.user_id)                       AS retained_users,
    count(DISTINCT c2.user_id)                      AS cohort_size,
    round(count(DISTINCT a.user_id) /
          nullIf(count(DISTINCT c2.user_id), 0) * 100, 1) AS retention_pct
FROM cohorts c
JOIN cohorts c2 ON c.cohort_week = c2.cohort_week
JOIN activity a ON c.user_id = a.user_id
WHERE dateDiff('week', c.cohort_week, a.active_week) BETWEEN 0 AND 12
GROUP BY c.cohort_week, weeks_since_signup
ORDER BY c.cohort_week, weeks_since_signup;
```

## Percentile and distribution analysis

Standard SQL's `AVG` hides a lot. For latency, revenue, and engagement, percentiles tell a more honest story:

```sql
SELECT
    endpoint,
    count()                         AS requests,
    avg(duration_ms)                AS avg_ms,
    quantile(0.50)(duration_ms)     AS p50_ms,
    quantile(0.90)(duration_ms)     AS p90_ms,
    quantile(0.99)(duration_ms)     AS p99_ms
FROM api_requests
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY endpoint
ORDER BY p99_ms DESC;
```

`quantile()` uses T-Digest approximation. For exact percentiles in SLA reporting, use `quantileExact()`.

## Period-over-period comparison with `lagInFrame`

`lagInFrame` lets you access the previous period's value for period-over-period comparisons without a self-join:

```sql
SELECT
    day,
    revenue,
    lagInFrame(revenue, 1, 0) OVER (ORDER BY day) AS prev_day_revenue,
    round(
        (revenue - lagInFrame(revenue, 1, 0) OVER (ORDER BY day)) /
        nullIf(lagInFrame(revenue, 1, 0) OVER (ORDER BY day), 0) * 100,
        1
    ) AS pct_change
FROM daily_revenue
ORDER BY day;
```

This is cleaner than a self-join and significantly faster on large tables.

## Window functions for rankings and running totals

ClickHouse supports full SQL window function syntax:

```sql
-- 7-day rolling average revenue
SELECT
    day,
    revenue,
    round(avg(revenue) OVER (
        ORDER BY day
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2) AS rolling_7d_avg
FROM daily_revenue
ORDER BY day;

-- Rank users by spend within each plan tier
SELECT
    user_id,
    plan,
    total_spend,
    rank() OVER (PARTITION BY plan ORDER BY total_spend DESC) AS spend_rank
FROM user_spend
WHERE month = '2026-05';
```

## Reading query performance

ClickHouse returns statistics with every result. In the Tinybird query editor, check `elapsed`, `rows_read`, and `bytes_read` at the bottom of every query run:

```json
{
  "statistics": {
    "elapsed": 0.043,
    "rows_read": 4820000,
    "bytes_read": 96400000
  }
}
```

43ms to scan 4.8 million rows. If `rows_read` is much higher than you expected, your query is scanning too much. Common causes:

- A function wraps an indexed column: `toDate(timestamp) = '2026-05-25'` disables the index. Rewrite as `timestamp >= '2026-05-25' AND timestamp < '2026-05-26'`
- The time filter column isn't first in the table's `ORDER BY` key
- No partition key, so a monthly filter still reads the whole table

## Sampling for fast exploration

When exploring a large dataset for the first time, use `SAMPLE` to get a representative subset without waiting for a full scan:

```sql
-- Query 1% of the data, scale results ×100
SELECT event_type, count() * 100 AS estimated_count
FROM events SAMPLE 0.01
WHERE event_time >= now() - INTERVAL 30 DAY
GROUP BY event_type
ORDER BY estimated_count DESC;
```

Results scale linearly. This is how you explore a 10-billion-row dataset in under a second before deciding which query to optimize. More patterns are in [how to make SQL database faster](https://www.tinybird.co/blog/how-to-make-sql-database-faster).

## Sharing analysis with your team via Tinybird

One of the recurring friction points for analysts is re-running the same query when stakeholders ask for slightly different cuts of the data. Tinybird solves this with Pipes: SQL queries you write once and publish as parameterized HTTP endpoints.

Write the cohort analysis query, publish it as an endpoint:

```sql
-- retention_by_cohort.pipe
NODE cohorts
SQL >
  SELECT user_id, toStartOfWeek(min(event_time)) AS cohort_week
  FROM user_events
  WHERE event = 'signed_up'
  GROUP BY user_id

NODE retention
SQL >
  %
  SELECT
      c.cohort_week,
      dateDiff('week', c.cohort_week, a.active_week) AS week_number,
      round(count(DISTINCT a.user_id) /
            nullIf(count(DISTINCT c2.user_id), 0) * 100, 1) AS retention_pct
  FROM cohorts c
  JOIN cohorts c2 ON c.cohort_week = c2.cohort_week
  JOIN (
      SELECT DISTINCT user_id, toStartOfWeek(event_time) AS active_week
      FROM user_events
      WHERE event_time >= now() - INTERVAL {{ Int32(weeks, 12) }} WEEK
  ) a ON c.user_id = a.user_id
  WHERE week_number BETWEEN 0 AND {{ Int32(max_week, 8) }}
  GROUP BY c.cohort_week, week_number
  ORDER BY c.cohort_week, week_number

TYPE ENDPOINT
```

Your product team calls `retention_by_cohort.json?weeks=24&max_week=12` themselves. Your dashboard connects to the same URL. You update the SQL once and everyone gets fresh results. No Slack message asking you to re-run the pivot table.

Tinybird also connects to BI tools, [Grafana](https://www.tinybird.co/blog/clickhouse-grafana-example), Metabase, Superset, via its ClickHouse-compatible query interface. If your team already has a dashboard tool, you can point it at Tinybird and get ClickHouse performance without changing workflows.

## What ClickHouse is NOT good at

**Point lookups are slower than Postgres.** Fetching a single row by primary key is what Postgres is optimized for. ClickHouse's strength is scanning millions of rows, not returning one.

**Updates and deletes are expensive.** ClickHouse is append-first. `ALTER TABLE ... DELETE` and `UPDATE` are asynchronous and heavyweight. Don't use it for data that changes frequently at the row level.

**JOINs on two large tables require care.** ClickHouse loads the right-side table into memory during a hash join. Filter aggressively before joining, and put the smaller table on the right:

```sql
SELECT e.user_id, u.plan
FROM events e
INNER JOIN (
    SELECT user_id, plan FROM users WHERE plan IN ('pro', 'enterprise')
) u ON e.user_id = u.user_id
WHERE e.event_time >= now() - INTERVAL 30 DAY;
```

{% cta
  title="Query your data at ClickHouse speed"
  text="Tinybird gives analysts a SQL editor backed by managed ClickHouse®. No cluster setup, no infra tickets. Share queries as live API endpoints your team can call directly."
  button={href: "https://cloud.tinybird.co/signup", target: "_blank", text: "Try Tinybird free"}
/%}
