Most data science workflows hit database performance limits long before they hit algorithm limits. You're not waiting on the model. You're waiting on the query that computes the features the model needs, or the aggregation that validates the hypothesis you're testing.
ClickHouse® solves the database side of that problem. It's columnar, vectorized, and built for analytical queries over large datasets, which is exactly the shape of data science workloads: scan many rows, aggregate into a smaller result, iterate on the query until the output is useful.
This post covers how to connect ClickHouse to Python-based data science workflows, which built-in statistical functions are worth knowing, how to design feature tables for ML, patterns for fast exploratory analysis, and A/B test analysis.
Python connectivity
The two main Python clients for ClickHouse are clickhouse-driver (binary protocol, fastest for large result sets) and clickhouse-connect (HTTP, easier to set up). See ClickHouse Python example for a comparison of both.
# clickhouse-connect: recommended for data science workflows
import clickhouse_connect
client = clickhouse_connect.get_client(
host='your-host',
port=8443,
username='default',
password='your-password',
secure=True
)
# Query directly to pandas DataFrame
df = client.query_df("""
SELECT
toDate(event_time) AS day,
user_id,
count() AS sessions,
sum(revenue) AS total_revenue,
avg(session_duration) AS avg_duration_sec
FROM user_events
WHERE event_time >= now() - INTERVAL 90 DAY
GROUP BY day, user_id
""")
query_df returns a pandas DataFrame directly. For large results that don't fit in memory, stream with query_row_block_stream:
# Streaming for large result sets
with client.query_row_block_stream(
"SELECT * FROM features WHERE feature_date = today()"
) as stream:
for block in stream:
df_chunk = pd.DataFrame(block, columns=stream.source.columns)
process_chunk(df_chunk)
For production pipelines that bulk-export large result sets, clickhouse-driver using the native binary protocol is faster. For Jupyter notebooks and interactive work, clickhouse-connect is more convenient. Both support Apache Arrow format for zero-copy transfers to pandas or polars via query_arrow.
For SQL and Python workflows more broadly, ClickHouse SQL is standard enough that most data scientists can adopt it without learning new syntax.
Statistical functions built in
ClickHouse has a broad set of statistical functions that data scientists often implement manually in pandas. Running these in-database is almost always faster than pulling raw data and computing in Python.
Pearson correlation and covariance:
SELECT
corr(feature_a, feature_b) AS pearson_r,
covarPop(feature_a, feature_b) AS covariance,
covarSamp(feature_a, feature_b) AS sample_covariance
FROM ml_features
WHERE training_set = 1;
Full distribution analysis in one query:
SELECT
avg(session_duration) AS mean,
stddevPop(session_duration) AS stddev,
quantile(0.25)(session_duration) AS p25,
quantile(0.50)(session_duration) AS p50,
quantile(0.75)(session_duration) AS p75,
quantile(0.95)(session_duration) AS p95,
quantile(0.99)(session_duration) AS p99,
skewPop(session_duration) AS skewness,
kurtPop(session_duration) AS kurtosis,
min(session_duration) AS min_val,
max(session_duration) AS max_val
FROM sessions
WHERE event_date >= today() - 30;
Histogram for quick distribution checks:
SELECT arrayJoin(histogram(50)(revenue)) AS bin
FROM orders
WHERE revenue > 0;
histogram(50)(revenue) returns 50 bins with lower bound, upper bound, and height. Useful for spotting skewed distributions or outliers before deciding how to scale a feature, without pulling the raw data into Python first.
Linear regression coefficients:
SELECT
simpleLinearRegression(session_duration, revenue) AS (slope, intercept)
FROM user_sessions
WHERE session_duration > 0;
All of these run in-database, over billions of rows, in seconds. The equivalent pandas operation on a sampled CSV takes longer and requires you to manage the sampling.
Feature computation for ML
Feature engineering at scale is where ClickHouse makes the biggest difference for data science workflows. Computing features at query time over billions of raw events is fast enough that you can iterate on feature definitions without a long feedback loop.
Tinybird describes this pattern in detail in using Tinybird as a serverless online feature store. The core idea: ClickHouse computes features on the fly from raw event data, and a parameterized endpoint exposes them to your model serving layer.
A typical feature table for a user-level ML model:
CREATE TABLE user_features
(
user_id UInt64,
feature_date Date,
sessions_7d UInt32,
sessions_30d UInt32,
avg_session_sec Float32,
total_revenue_30d Decimal(12, 2),
distinct_products UInt32,
days_since_last UInt16,
conversion_rate Float32
)
ENGINE = ReplacingMergeTree(feature_date)
ORDER BY (user_id, feature_date);
Populate it with a materialized view that aggregates raw events at insert time:
CREATE MATERIALIZED VIEW user_features_mv TO user_features AS
SELECT
user_id,
today() AS feature_date,
countIf(event_time >= now() - INTERVAL 7 DAY) AS sessions_7d,
countIf(event_time >= now() - INTERVAL 30 DAY) AS sessions_30d,
avg(session_duration) AS avg_session_sec,
sumIf(revenue, event_time >= now() - INTERVAL 30 DAY) AS total_revenue_30d,
uniq(product_id) AS distinct_products,
dateDiff('day', max(event_time), now()) AS days_since_last,
countIf(converted = 1) / count() AS conversion_rate
FROM user_events
GROUP BY user_id;
With this setup, your model training pipeline pulls the feature table directly rather than recomputing features from raw events on every run. Feature freshness is controlled by how often you refresh the view, not by how long the computation takes.
Understanding what materialized views are and why they matter is key to using this pattern correctly. Unlike views in traditional databases, ClickHouse materialized views run at insert time and store results, so reads against them are always pre-computed.
Sampling for fast exploration
At scale, you rarely need to query every row for exploratory work. ClickHouse's SAMPLE clause gives you a statistically representative subset:
-- Explore 1% of data for fast EDA
SELECT
event_type,
avg(revenue) AS avg_revenue,
stddevPop(revenue) AS stddev,
count() * 100 AS estimated_count
FROM events SAMPLE 0.01
WHERE event_date >= today() - 90
GROUP BY event_type
ORDER BY estimated_count DESC;
The sample fraction applies deterministically by row hash, so the same SAMPLE 0.01 on the same table always returns the same rows. This is useful for reproducibility when you're validating a hypothesis before running the full query.
LIMIT is not sampling. LIMIT 1000000 reads the first million rows (however they're stored), which is biased toward recently inserted data. SAMPLE 0.001 reads a random 0.1% uniformly distributed across the entire dataset. For exploratory data analysis over a 500-billion-row dataset, the difference matters.
A/B test analysis
ClickHouse handles the statistical significance calculations for A/B tests in-database, which is faster and less error-prone than exporting data to Python for analysis:
SELECT
variant,
count() AS users,
countIf(converted = 1) AS conversions,
round(countIf(converted = 1) / count() * 100, 3) AS cvr_pct,
-- Approximate confidence interval using normal approximation
round(
1.96 * sqrt(
(countIf(converted = 1) / count()) *
(1 - countIf(converted = 1) / count()) /
count()
) * 100, 3
) AS margin_of_error_pct
FROM experiment_assignments
INNER JOIN conversions USING (user_id)
WHERE experiment_id = 'checkout_flow_v2'
AND assignment_date >= '2026-05-01'
GROUP BY variant
ORDER BY variant;
For t-test style significance on continuous metrics (revenue, session duration), ClickHouse has mannWhitneyUTest and welchTTest aggregate functions:
SELECT
welchTTest(0.05)(
if(variant = 'control', revenue, NULL),
if(variant = 'treatment', revenue, NULL)
) AS (p_value, t_statistic)
FROM experiment_results
WHERE experiment_id = 'pricing_change';
welchTTest returns the p-value and t-statistic. Run this query over tens of millions of user observations in seconds, rather than waiting for a pandas script to load the data.
Window functions for time series analysis
Product and behavioral data science work frequently involves time series calculations: rolling averages, rank by cohort, moving sums. ClickHouse supports full window function syntax:
SELECT
user_id,
event_date,
daily_revenue,
avg(daily_revenue) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg,
sum(daily_revenue) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue,
rank() OVER (
PARTITION BY toStartOfMonth(event_date)
ORDER BY daily_revenue DESC
) AS daily_rank_in_month
FROM (
SELECT
user_id,
toDate(event_time) AS event_date,
sum(revenue) AS daily_revenue
FROM orders
GROUP BY user_id, event_date
)
ORDER BY user_id, event_date;
This runs in-database on your full dataset. The equivalent operation in pandas requires loading all user history into memory first, which at tens of millions of users is impractical.
Cohort analysis
Cohort retention is a common data science output for measuring product stickiness. ClickHouse computes it in a single query:
SELECT
toStartOfMonth(first_event) AS cohort_month,
dateDiff('month', first_event, event_month) AS months_since_first,
uniq(user_id) AS retained_users
FROM (
SELECT
user_id,
toStartOfMonth(min(event_time)) OVER (PARTITION BY user_id) AS first_event,
toStartOfMonth(event_time) AS event_month
FROM user_events
)
GROUP BY cohort_month, months_since_first
ORDER BY cohort_month, months_since_first;
For deeper analysis, ClickHouse handles real-time analytics with billion rows at scale, meaning cohort queries over full user histories run in seconds rather than the minutes they'd take in a traditional data warehouse.
Tinybird for data scientists
For data scientists who want ClickHouse without running a cluster, Tinybird is managed ClickHouse with a Python SDK and HTTP API.
Connect your notebook via clickhouse-connect pointing at your Tinybird host, or use the Tinybird Python SDK to query endpoints. Features computed in Tinybird Pipes become parameterized HTTP endpoints your application can call, turning your feature engineering SQL into a production serving layer without rewriting anything. The same SQL you iterate on in a notebook becomes the endpoint your model server calls in production.
