---
title: "ClickHouse data scientists"
excerpt: "Data scientists need a database that handles large-scale feature computation, fast iteration in notebooks, and statistical queries over billions of rows without waiting minutes for results. ClickHouse fits that workflow better than most tools built for data science."
authors: "Tinybird"
categories: "AI Resources"
createdOn: "2026-06-03 00:00:00"
publishedOn: "2026-06-03 00:00:00"
updatedOn: "2026-06-03 00:00:00"
status: "published"
---

Most data science workflows hit database performance limits long before they hit algorithm limits. You're not waiting on the model. You're waiting on the query that computes the features the model needs, or the aggregation that validates the hypothesis you're testing.

ClickHouse® solves the database side of that problem. It's columnar, vectorized, and built for analytical queries over large datasets, which is exactly the shape of data science workloads: scan many rows, aggregate into a smaller result, iterate on the query until the output is useful.

This post covers how to connect ClickHouse to Python-based data science workflows, which built-in statistical functions are worth knowing, how to design feature tables for ML, patterns for fast exploratory analysis, and A/B test analysis.

## Python connectivity

The two main Python clients for ClickHouse are `clickhouse-driver` (binary protocol, fastest for large result sets) and `clickhouse-connect` (HTTP, easier to set up). See [ClickHouse Python example](https://www.tinybird.co/blog/clickhouse-python-example) for a comparison of both.

```python
# clickhouse-connect: recommended for data science workflows
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='your-host',
    port=8443,
    username='default',
    password='your-password',
    secure=True
)

# Query directly to pandas DataFrame
df = client.query_df("""
    SELECT
        toDate(event_time)      AS day,
        user_id,
        count()                 AS sessions,
        sum(revenue)            AS total_revenue,
        avg(session_duration)   AS avg_duration_sec
    FROM user_events
    WHERE event_time >= now() - INTERVAL 90 DAY
    GROUP BY day, user_id
""")
```

`query_df` returns a pandas DataFrame directly. For large results that don't fit in memory, stream with `query_row_block_stream`:

```python
# Streaming for large result sets
with client.query_row_block_stream(
    "SELECT * FROM features WHERE feature_date = today()"
) as stream:
    for block in stream:
        df_chunk = pd.DataFrame(block, columns=stream.source.columns)
        process_chunk(df_chunk)
```

For production pipelines that bulk-export large result sets, `clickhouse-driver` using the native binary protocol is faster. For Jupyter notebooks and interactive work, `clickhouse-connect` is more convenient. Both support Apache Arrow format for zero-copy transfers to pandas or polars via `query_arrow`.

For [SQL and Python](https://www.tinybird.co/blog/sql-and-python) workflows more broadly, ClickHouse SQL is standard enough that most data scientists can adopt it without learning new syntax.

## Statistical functions built in

ClickHouse has a broad set of statistical functions that data scientists often implement manually in pandas. Running these in-database is almost always faster than pulling raw data and computing in Python.

**Pearson correlation and covariance:**

```sql
SELECT
    corr(feature_a, feature_b)              AS pearson_r,
    covarPop(feature_a, feature_b)          AS covariance,
    covarSamp(feature_a, feature_b)         AS sample_covariance
FROM ml_features
WHERE training_set = 1;
```

**Full distribution analysis in one query:**

```sql
SELECT
    avg(session_duration)                    AS mean,
    stddevPop(session_duration)              AS stddev,
    quantile(0.25)(session_duration)         AS p25,
    quantile(0.50)(session_duration)         AS p50,
    quantile(0.75)(session_duration)         AS p75,
    quantile(0.95)(session_duration)         AS p95,
    quantile(0.99)(session_duration)         AS p99,
    skewPop(session_duration)                AS skewness,
    kurtPop(session_duration)                AS kurtosis,
    min(session_duration)                    AS min_val,
    max(session_duration)                    AS max_val
FROM sessions
WHERE event_date >= today() - 30;
```

**Histogram for quick distribution checks:**

```sql
SELECT arrayJoin(histogram(50)(revenue)) AS bin
FROM orders
WHERE revenue > 0;
```

`histogram(50)(revenue)` returns 50 bins with lower bound, upper bound, and height. Useful for spotting skewed distributions or outliers before deciding how to scale a feature, without pulling the raw data into Python first.

**Linear regression coefficients:**

```sql
SELECT
    simpleLinearRegression(session_duration, revenue) AS (slope, intercept)
FROM user_sessions
WHERE session_duration > 0;
```

All of these run in-database, over billions of rows, in seconds. The equivalent pandas operation on a sampled CSV takes longer and requires you to manage the sampling.

## Feature computation for ML

Feature engineering at scale is where ClickHouse makes the biggest difference for data science workflows. Computing features at query time over billions of raw events is fast enough that you can iterate on feature definitions without a long feedback loop.

Tinybird describes this pattern in detail in [using Tinybird as a serverless online feature store](https://www.tinybird.co/blog/using-tinybird-as-a-serverless-online-feature-store). The core idea: ClickHouse computes features on the fly from raw event data, and a parameterized endpoint exposes them to your model serving layer.

A typical feature table for a user-level ML model:

```sql
CREATE TABLE user_features
(
    user_id             UInt64,
    feature_date        Date,
    sessions_7d         UInt32,
    sessions_30d        UInt32,
    avg_session_sec     Float32,
    total_revenue_30d   Decimal(12, 2),
    distinct_products   UInt32,
    days_since_last     UInt16,
    conversion_rate     Float32
)
ENGINE = ReplacingMergeTree(feature_date)
ORDER BY (user_id, feature_date);
```

Populate it with a materialized view that aggregates raw events at insert time:

```sql
CREATE MATERIALIZED VIEW user_features_mv TO user_features AS
SELECT
    user_id,
    today()                                                 AS feature_date,
    countIf(event_time >= now() - INTERVAL 7 DAY)          AS sessions_7d,
    countIf(event_time >= now() - INTERVAL 30 DAY)         AS sessions_30d,
    avg(session_duration)                                   AS avg_session_sec,
    sumIf(revenue, event_time >= now() - INTERVAL 30 DAY)  AS total_revenue_30d,
    uniq(product_id)                                        AS distinct_products,
    dateDiff('day', max(event_time), now())                 AS days_since_last,
    countIf(converted = 1) / count()                        AS conversion_rate
FROM user_events
GROUP BY user_id;
```

With this setup, your model training pipeline pulls the feature table directly rather than recomputing features from raw events on every run. Feature freshness is controlled by how often you refresh the view, not by how long the computation takes.

Understanding [what materialized views are and why they matter](https://www.tinybird.co/blog/what-are-materialized-views-and-why-do-they-matter-for-realtime) is key to using this pattern correctly. Unlike views in traditional databases, ClickHouse materialized views run at insert time and store results, so reads against them are always pre-computed.

## Sampling for fast exploration

At scale, you rarely need to query every row for exploratory work. ClickHouse's `SAMPLE` clause gives you a statistically representative subset:

```sql
-- Explore 1% of data for fast EDA
SELECT
    event_type,
    avg(revenue)            AS avg_revenue,
    stddevPop(revenue)      AS stddev,
    count() * 100           AS estimated_count
FROM events SAMPLE 0.01
WHERE event_date >= today() - 90
GROUP BY event_type
ORDER BY estimated_count DESC;
```

The sample fraction applies deterministically by row hash, so the same `SAMPLE 0.01` on the same table always returns the same rows. This is useful for reproducibility when you're validating a hypothesis before running the full query.

`LIMIT` is not sampling. `LIMIT 1000000` reads the first million rows (however they're stored), which is biased toward recently inserted data. `SAMPLE 0.001` reads a random 0.1% uniformly distributed across the entire dataset. For exploratory data analysis over a 500-billion-row dataset, the difference matters.

## A/B test analysis

ClickHouse handles the statistical significance calculations for A/B tests in-database, which is faster and less error-prone than exporting data to Python for analysis:

```sql
SELECT
    variant,
    count()                                             AS users,
    countIf(converted = 1)                              AS conversions,
    round(countIf(converted = 1) / count() * 100, 3)   AS cvr_pct,
    -- Approximate confidence interval using normal approximation
    round(
        1.96 * sqrt(
            (countIf(converted = 1) / count()) *
            (1 - countIf(converted = 1) / count()) /
            count()
        ) * 100, 3
    )                                                   AS margin_of_error_pct
FROM experiment_assignments
INNER JOIN conversions USING (user_id)
WHERE experiment_id = 'checkout_flow_v2'
  AND assignment_date >= '2026-05-01'
GROUP BY variant
ORDER BY variant;
```

For t-test style significance on continuous metrics (revenue, session duration), ClickHouse has `mannWhitneyUTest` and `welchTTest` aggregate functions:

```sql
SELECT
    welchTTest(0.05)(
        if(variant = 'control', revenue, NULL),
        if(variant = 'treatment', revenue, NULL)
    ) AS (p_value, t_statistic)
FROM experiment_results
WHERE experiment_id = 'pricing_change';
```

`welchTTest` returns the p-value and t-statistic. Run this query over tens of millions of user observations in seconds, rather than waiting for a pandas script to load the data.

## Window functions for time series analysis

Product and behavioral data science work frequently involves time series calculations: rolling averages, rank by cohort, moving sums. ClickHouse supports full window function syntax:

```sql
SELECT
    user_id,
    event_date,
    daily_revenue,
    avg(daily_revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    )                   AS rolling_7d_avg,
    sum(daily_revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )                   AS cumulative_revenue,
    rank() OVER (
        PARTITION BY toStartOfMonth(event_date)
        ORDER BY daily_revenue DESC
    )                   AS daily_rank_in_month
FROM (
    SELECT
        user_id,
        toDate(event_time)  AS event_date,
        sum(revenue)        AS daily_revenue
    FROM orders
    GROUP BY user_id, event_date
)
ORDER BY user_id, event_date;
```

This runs in-database on your full dataset. The equivalent operation in pandas requires loading all user history into memory first, which at tens of millions of users is impractical.

## Cohort analysis

Cohort retention is a common data science output for measuring product stickiness. ClickHouse computes it in a single query:

```sql
SELECT
    toStartOfMonth(first_event)                          AS cohort_month,
    dateDiff('month', first_event, event_month)          AS months_since_first,
    uniq(user_id)                                        AS retained_users
FROM (
    SELECT
        user_id,
        toStartOfMonth(min(event_time)) OVER (PARTITION BY user_id) AS first_event,
        toStartOfMonth(event_time)                                   AS event_month
    FROM user_events
)
GROUP BY cohort_month, months_since_first
ORDER BY cohort_month, months_since_first;
```

For deeper analysis, ClickHouse handles [real-time analytics with billion rows at scale](https://www.tinybird.co/blog/real-time-analytics-with-billion-rows-at-scale), meaning cohort queries over full user histories run in seconds rather than the minutes they'd take in a traditional data warehouse.

## Tinybird for data scientists

For data scientists who want ClickHouse without running a cluster, Tinybird is managed ClickHouse with a [Python SDK](https://www.tinybird.co/blog/python-sdk) and HTTP API.

Connect your notebook via `clickhouse-connect` pointing at your Tinybird host, or use the Tinybird Python SDK to query endpoints. Features computed in Tinybird Pipes become parameterized HTTP endpoints your application can call, turning your feature engineering SQL into a production serving layer without rewriting anything. The same SQL you iterate on in a notebook becomes the endpoint your model server calls in production.

{% cta
  title="Query billions of rows from your notebook"
  text="Tinybird is managed ClickHouse with a Python SDK and HTTP API. Connect pandas, run SQL, iterate fast."
  button={href: "https://cloud.tinybird.co/signup", target: "_blank", text: "Try Tinybird free"}
/%}
