Machine learning pipelines have a data infrastructure problem that sits between training and serving. Training works on static snapshots. Inference needs features computed over live data, at low latency, for every prediction request. The gap between how features are computed for training and how they are served in production is the source of most ML feature skew and a large portion of ML infrastructure complexity.
ClickHouse® fits this pipeline as the analytical layer that computes features over event streams fast enough for online serving. It stores the raw events your models depend on, materializes feature aggregations incrementally as data arrives, and serves them via parameterized SQL endpoints fast enough to use at inference time.
This post covers the feature engineering patterns, online feature store architecture, training data extraction, model monitoring, and A/B experiment analytics that ML teams build with ClickHouse.
Event log as feature source
ML features for behavioral models are typically aggregations over time windows: how many purchases in the last 7 days, average session duration in the last 30 days, time since last login. These are computations over an event log, which is exactly what ClickHouse is designed for.
The raw event table stores the full behavioral history:
CREATE TABLE user_events
(
event_time DateTime64(3),
event_type LowCardinality(String),
user_id String,
session_id String,
item_id Nullable(String),
category LowCardinality(Nullable(String)),
amount Nullable(Float64),
duration_ms Nullable(UInt32),
source LowCardinality(String),
properties String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time);
Feature computation over this table uses ClickHouse's aggregation functions with conditional logic, running across billions of rows in seconds:
SELECT
user_id,
countIf(event_type = 'purchase') AS purchases_30d,
countIf(event_type = 'view') AS views_30d,
countIf(event_type = 'session_start') AS sessions_30d,
sumIf(amount, event_type = 'purchase') AS spend_30d,
avg(duration_ms) AS avg_session_ms_30d,
uniq(category) AS distinct_categories_30d,
dateDiff('day', max(event_time), now()) AS days_since_last_event,
dateDiff('day',
minIf(event_time, event_type = 'purchase'),
now()
) AS days_since_first_purchase
FROM user_events
WHERE user_id = {{ String(user_id, required=True) }}
AND event_time >= now() - INTERVAL 30 DAY
GROUP BY user_id;
This query, running against a multi-billion row event table with user_id first in the sort key, returns in milliseconds. It is fast enough to call at inference time.
Materialized feature store
For features needed by many models at high request rates, pre-materializing them into an AggregatingMergeTree avoids recomputing the aggregation on every inference request. This is the serverless online feature store pattern:
CREATE TABLE user_features_1d
(
day Date,
user_id String,
purchases UInt32,
views UInt32,
sessions UInt32,
spend Float64,
avg_session_ms AggregateFunction(avg, UInt32),
distinct_cats AggregateFunction(uniq, String)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (user_id, day);
CREATE MATERIALIZED VIEW user_features_1d_mv TO user_features_1d AS
SELECT
toStartOfDay(event_time) AS day,
user_id,
countIf(event_type = 'purchase') AS purchases,
countIf(event_type = 'view') AS views,
countIf(event_type = 'session_start') AS sessions,
sumIf(amount, event_type = 'purchase') AS spend,
avgState(duration_ms) AS avg_session_ms,
uniqState(category) AS distinct_cats
FROM user_events
GROUP BY day, user_id;
Querying the 30-day feature window from the pre-aggregated table reads 30 rows per user instead of scanning all their events:
SELECT
user_id,
sum(purchases) AS purchases_30d,
sum(views) AS views_30d,
sum(sessions) AS sessions_30d,
sum(spend) AS spend_30d,
avgMerge(avg_session_ms) AS avg_session_ms_30d,
uniqMerge(distinct_cats) AS distinct_cats_30d
FROM user_features_1d
WHERE user_id = {{ String(user_id, required=True) }}
AND day >= today() - 30
GROUP BY user_id;
Training data extraction
For batch training jobs, ClickHouse exports labeled training datasets efficiently by reading only the columns needed for the feature set:
-- Point-in-time correct feature extraction for training
-- Each row is a label event with features computed as of that event's time
SELECT
e.user_id,
e.event_time AS label_time,
e.amount AS label_amount,
h.purchases_7d,
h.views_7d,
h.sessions_7d,
h.spend_7d,
h.days_since_last_session
FROM user_events AS e
INNER JOIN (
SELECT
user_id,
event_time AS as_of_time,
countIf(
event_type = 'purchase'
AND event_time >= event_time - INTERVAL 7 DAY
AND event_time < event_time
) AS purchases_7d,
countIf(
event_type = 'view'
AND event_time >= event_time - INTERVAL 7 DAY
) AS views_7d,
countIf(event_type = 'session_start') AS sessions_7d,
sumIf(amount, event_type = 'purchase') AS spend_7d,
dateDiff('day', max(event_time), event_time) AS days_since_last_session
FROM user_events
WHERE event_time >= today() - 90
GROUP BY user_id, event_time
) AS h ON e.user_id = h.user_id AND e.event_time = h.as_of_time
WHERE e.event_type = 'purchase'
AND e.event_time BETWEEN '2026-01-01' AND '2026-06-01'
ORDER BY e.event_time;
ClickHouse exports to Parquet, CSV, or JSON natively. For Python-based training pipelines:
# Using clickhouse-connect for training data export
import clickhouse_connect
import pandas as pd
client = clickhouse_connect.get_client(host='your-host', ...)
df = client.query_df("""
SELECT user_id, purchases_30d, spend_30d, sessions_30d, label
FROM training_features
WHERE created_date >= '2026-01-01'
""")
The ClickHouse Python client and SQL and Python integration patterns cover the connection setup, query streaming for large datasets, and integration with pandas and Arrow for the ML toolchain.
Model monitoring and drift detection
Once a model is in production, monitoring feature distributions over time detects training-serving skew and data drift before model performance degrades:
SELECT
toStartOfWeek(event_time) AS week,
avg(
JSONExtractFloat(properties, 'feature_purchases_30d')
) AS avg_purchases_feature,
stddevPop(
JSONExtractFloat(properties, 'feature_purchases_30d')
) AS stddev_purchases_feature,
quantile(0.5)(
JSONExtractFloat(properties, 'feature_spend_30d')
) AS median_spend_feature,
quantile(0.95)(
JSONExtractFloat(properties, 'prediction_score')
) AS p95_score,
avg(JSONExtractFloat(properties, 'prediction_score')) AS avg_score
FROM inference_log
WHERE model_name = {{ String(model_name, 'churn_v3') }}
AND event_time >= today() - 90
GROUP BY week
ORDER BY week;
Week-over-week drift in feature distributions, before model performance metrics are available, is an early warning signal. ClickHouse handles the volume of inference logs (millions of predictions per day) without requiring a separate monitoring database.
Batch scoring and prediction pipelines
For models that score all users nightly rather than scoring on demand, ClickHouse exports the full feature set for batch inference efficiently. Rather than scanning the raw event table for each user independently, the pre-aggregated feature table reads at most 30 rows per user:
INSERT INTO batch_scoring_input
SELECT
user_id,
now() AS scoring_time,
sum(purchases) AS purchases_30d,
sum(views) AS views_30d,
sum(sessions) AS sessions_30d,
sum(spend) AS spend_30d,
avgMerge(avg_session_ms) AS avg_session_ms_30d,
uniqMerge(distinct_cats) AS distinct_cats_30d
FROM user_features_1d
WHERE day >= today() - 30
GROUP BY user_id
HAVING purchases_30d > 0 OR sessions_30d >= 3;
Predictions write back as events to Tinybird via the Events API, and a second materialized view tracks the score distribution over time for drift monitoring. The cycle from feature extraction to score write-back is fully SQL-driven and runs against the same datasource that serves online features, with no separate offline pipeline.
Feature importance and correlation analysis
ClickHouse supports statistical functions that are useful for quick feature evaluation before expensive model training runs. Computing pairwise correlations between candidate features and a target variable:
SELECT
corr(purchases_30d, label_converted) AS purchases_corr,
corr(sessions_30d, label_converted) AS sessions_corr,
corr(spend_30d, label_converted) AS spend_corr,
corr(days_since_last_event, label_converted) AS recency_corr,
corr(distinct_categories_30d, label_converted) AS diversity_corr
FROM training_features
WHERE training_split = 'train'
AND created_date >= today() - 90;
For computing feature distributions by label class, useful for detecting features with good class separation:
SELECT
label_converted,
avg(purchases_30d) AS avg_purchases,
stddevPop(purchases_30d) AS stddev_purchases,
quantile(0.5)(spend_30d) AS median_spend,
quantile(0.95)(spend_30d) AS p95_spend,
avg(sessions_30d) AS avg_sessions
FROM training_features
WHERE created_date >= today() - 30
GROUP BY label_converted;
These queries run against billions of rows in ClickHouse in seconds, making rapid feature iteration practical. Testing a new feature hypothesis, computing its correlation with the target across the full training set, costs a single SQL query rather than a full training run.
A/B experiment analytics
For measuring ML model improvements in production via real-time personalization experiments:
SELECT
experiment_id,
variant,
count() AS exposures,
countIf(converted) AS conversions,
round(
countIf(converted) / count() * 100, 2
) AS conversion_rate_pct,
avg(revenue) AS avg_revenue_per_user,
quantile(0.95)(revenue) AS p95_revenue
FROM experiment_events
WHERE experiment_id = {{ String(experiment_id, required=True) }}
AND event_time >= today() - 14
GROUP BY experiment_id, variant
ORDER BY variant;
Tinybird for ML pipelines
ML engineering teams at product companies face a consistent infrastructure challenge: the feature computation that works in a Jupyter notebook does not serve at inference latency without significant engineering investment. Building a feature store from scratch, with online serving, offline training exports, and real-time ingestion, is a multi-month project.
Tinybird is managed ClickHouse that acts as the serverless online feature store for ML teams who want feature serving without feature store infrastructure. The materialized view patterns in this post run inside Tinybird datasources, computing features incrementally as events arrive. The feature query becomes a Tinybird Pipe, published as an HTTP endpoint your inference service calls with a user_id parameter.
For vector search and embedding workloads, Tinybird has added native vector similarity functions that make nearest-neighbor retrieval feasible inside SQL Pipes, enabling semantic retrieval and approximate matching alongside traditional analytical features in the same query.
For Python SDK integration, the Tinybird Python client handles authentication, query parameterization, and response parsing in a few lines of code. Training data exports run as standard SQL against Tinybird datasources, with results streamed to pandas DataFrames or Arrow tables for direct use in PyTorch, scikit-learn, or XGBoost pipelines. The same datasource that serves online features serves offline training exports, eliminating the training-serving skew that comes from maintaining two separate feature computation paths.
On model deployment, the feature endpoint and inference log datasource live in the same Tinybird workspace. The ML team defines features in SQL Pipes, the inference service calls those Pipes at prediction time, and the inference log writes back to Tinybird via the Events API. Monitoring queries run against the inference log. The entire ML data loop, from raw events through feature serving through model monitoring, stays in one platform.
Resend, processing 100TB per month on Tinybird, measured 62ms p90 query latency in production without caching. Feature endpoints for online inference run inside that latency envelope. Your model's feature fetch adds under 100ms to inference latency. Tinybird is SOC 2 Type II certified. The pipeline from raw events to a serving feature API takes hours to build rather than months.
The same data, twice
Most ML infrastructure complexity comes from having two paths through the same data: one for training, one for serving. Training reads from a data warehouse or object store. Serving reads from a Redis cache or a feature store. When the paths diverge, features drift. Debugging production model degradation means reconciling two separate computation environments.
ClickHouse collapses those two paths. The same materialized view that computes training features computes serving features. The same SQL that generates training exports generates online feature responses. There is no synchronization problem because there is no synchronization. The data is in one place. The queries run in one place. The gap between what the model was trained on and what it receives at inference time narrows to the materialization latency of the view, typically seconds.
