Kafka is capable of producing millions of events per second, but those events only become useful when you can consume and query them. ClickHouse® is a popular database for analyzing Kafka topic streams, and there are several ways to consume Kafka streams into ClickHouse®, directly through its built-in Kafka table engine as well as through managed connectors from Tinybird
This guide walks through complete examples of connecting Kafka to ClickHouse®, from basic table setup to production-ready streaming pipelines with materialized views and API endpoints.
What is the ClickHouse® Kafka engine?
The Kafka table engine is a built-in ClickHouse® feature that reads streaming data directly from Apache Kafka topics. It acts as a consumer that continuously pulls messages from Kafka and makes them queryable in ClickHouse® without needing separate ETL tools, batch loading scripts, or complex stream processors like Flink, which add unnecessary complexity for most teams.
Unlike batch ingestion that loads data at scheduled intervals, the Kafka engine provides continuous data flow that 90% of organizations consider important or very important for their analytics needs. Messages arrive in ClickHouse® as soon as they're published to Kafka, so your analytics reflect what's happening right now rather than what happened hours ago. This is a capability that 59% of SMBs are now using for real-time analytics.
Here's how it works: you create a special table type that connects to your Kafka cluster and subscribes to one or more topics. When you query this table, ClickHouse® reads the latest messages from Kafka. The data isn't stored in this table permanently though, so you'll typically use a materialized view to move it into a MergeTree table for long-term storage.
Prerequisites for a Kafka-to-ClickHouse® pipeline
Before connecting Kafka to ClickHouse®, you'll want a few components in place. These requirements make sure your pipeline can establish connections, authenticate properly, and handle data flow between systems.
Kafka broker
You'll want a running Kafka instance with network accessibility from your ClickHouse® server. The Kafka broker handles message storage and delivery, and you'll want permissions to create topics and produce or consume messages from them.
ClickHouse® server or Tinybird workspace
You can use either a self-hosted ClickHouse® installation or a managed ClickHouse® service like Tinybird. Self-hosting gives you complete control but requires expertise in distributed systems, storage optimization, and performance tuning, while 57.7% of organizations have already moved to cloud-based streaming analytics solutions. Tinybird provides a managed ClickHouse® service that handles infrastructure automatically, letting you focus on building data pipelines rather than managing clusters.
Network and auth requirements
Your ClickHouse® server requires network access to your Kafka brokers, which might mean configuring firewall rules or security groups. You'll also want Kafka connection strings, authentication credentials (if your cluster uses SASL or SSL), and any consumer group configurations your organization requires.
Quickstart example: streaming JSON from Kafka to ClickHouse®
This example walks through the complete workflow for streaming JSON data from a Kafka topic into a ClickHouse® table using the Kafka table engine. You'll create a Kafka topic, define the necessary ClickHouse® tables, and verify that data flows correctly through the pipeline.
1. Create a Kafka topic
First, create a Kafka topic to hold your streaming data. This command creates a topic called user_events with a single partition:
kafka-topics.sh --create --topic user_events \
--bootstrap-server localhost:9092 \
--partitions 1 \
--replication-factor 1
2. Create the Kafka engine table
The Kafka engine table acts as a consumer that reads from your topic. This table definition specifies the Kafka broker address, topic name, consumer group, and message format:
CREATE TABLE user_events_kafka (
user_id String,
event_type String,
timestamp DateTime64(3),
properties String
)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'user_events',
kafka_group_name = 'clickhouse_consumer',
kafka_format = 'JSONEachRow';
The JSONEachRow format expects one JSON object per line, which is how most Kafka producers send data.
3. Create the target MergeTree table
Data from the Kafka engine table requires a permanent home. Create a MergeTree table with the same schema to store your events:
CREATE TABLE user_events (
user_id String,
event_type String,
timestamp DateTime64(3),
properties String
)
ENGINE = MergeTree()
ORDER BY (event_type, timestamp);
The ORDER BY clause determines how ClickHouse® sorts and stores data on disk, which affects query performance. Ordering by event_type and timestamp works well for queries that filter by event type and time range.
4. Insert sample messages
Push test messages to your Kafka topic using the Kafka console producer:
echo '{"user_id":"user_123","event_type":"page_view","timestamp":"2024-01-15 10:30:00.000","properties":"{}"}' | \
kafka-console-producer.sh --topic user_events --bootstrap-server localhost:9092
You can send multiple messages by repeating this command or by reading from a file containing one JSON object per line.
5. Verify the data
Query the Kafka engine table to see the latest messages. This query reads directly from Kafka without storing anything:
SELECT * FROM user_events_kafka LIMIT 5;
The Kafka engine table only shows messages that haven't been consumed yet. Once you set up a materialized view in the next section, those messages will move to permanent storage automatically.
Alternative approach: Stream with HTTP
If you don't need the full complexity of Kafka but still want streaming ingestion, Tinybird's Events API provides a lightweight HTTP-based alternative. Instead of managing Kafka brokers, topics, and consumer groups, you can stream data directly to ClickHouse® using standard HTTP POST requests.
The Events API works well for applications that generate events at moderate volumes (thousands to hundreds of thousands of events per second) where the overhead of maintaining Kafka infrastructure isn't justified. You get the same real-time analytics capabilities without the operational complexity.
To use the Events API, you define your schema with JSON path mappings, then send NDJSON (newline-delimited JSON) to the auto-generated endpoint:
curl -X POST 'https://api.tinybird.co/v0/events?name=user_events' \
-H "Authorization: Bearer YOUR_TOKEN" \
-d $'{"user_id": "123", "event_type": "page_view", "timestamp": "2024-01-15 10:30:00"}\n{"user_id": "456", "event_type": "click", "timestamp": "2024-01-15 10:31:00"}'
The Events API handles load balancing, batching, and optimized writes to ClickHouse® automatically. Your application code stays simple while Tinybird manages the complexities of efficient ingestion and merge optimization behind the scenes.
Alternative approach: Managed ingestion with Tinybird
Tinybird offers an alternative approach that removes infrastructure management while providing the same Kafka integration capabilities. You define your data sources and streaming connectors as code, then deploy them to a managed ClickHouse® environment.
1. Install Tinybird CLI
Install the Tinybird CLI and authenticate with your account. If you don't have an account yet, the login process will prompt you to create one:
curl -L tinybird.co | sh
tb login
2. Create a Kafka connection
Before creating a data source, you need to establish a connection to your Kafka cluster. The CLI provides a guided process that prompts you for the necessary connection details:
tb connection create kafka
This command starts an interactive wizard that asks for:
- Connection name: A friendly identifier for your Kafka cluster (e.g.,
my_kafka_connection) - Bootstrap servers: Your Kafka broker addresses and ports (e.g.,
localhost:9092orbroker1:9092,broker2:9092) - Security protocol: Choose
PLAINTEXTfor local development orSASL_SSLfor production clusters with authentication - SASL mechanism: If using
SASL_SSL, selectPLAIN,SCRAM-SHA-256,SCRAM-SHA-512, orOAUTHBEARER - Authentication credentials: For SASL/PLAIN or SCRAM, provide your Kafka key (username) and secret (password)
The CLI creates a .connection file in your project's connections/ directory that looks like this:
TYPE kafka
KAFKA_BOOTSTRAP_SERVERS {{ tb_secret("KAFKA_PROD_SERVER", "localhost:9092") }}
KAFKA_SECURITY_PROTOCOL {{ tb_secret("KAFKA_PROD_SECURITY_PROTOCOL", "PLAINTEXT") }}
KAFKA_SASL_MECHANISM PLAIN
KAFKA_KEY {{ tb_secret("KAFKA_KEY", "key") }}
KAFKA_SECRET {{ tb_secret("KAFKA_SECRET", "secret") }}
The tb_secret() function references secrets stored securely in Tinybird rather than hardcoding credentials in your config files. You can set secrets using:
tb secret set KAFKA_KEY your_kafka_username
tb secret set KAFKA_SECRET your_kafka_password
Before deploying, validate your connection configuration locally:
tb deploy --check
This command verifies that Tinybird can reach your Kafka brokers and authenticate successfully without actually deploying anything.
3. Define the data source file
Create a .datasource file that specifies your Kafka connection details and schema. This file defines how Tinybird connects to Kafka and what the incoming data looks like:
SCHEMA >
`user_id` String `json:$.user_id`,
`event_type` String `json:$.event_type`,
`timestamp` DateTime64(3) `json:$.timestamp`,
`properties` String `json:$.properties`
ENGINE "MergeTree"
ENGINE_SORTING_KEY "event_type, timestamp"
KAFKA_CONNECTION_NAME my_kafka_connection
KAFKA_TOPIC user_events
KAFKA_GROUP_ID tinybird_consumer
KAFKA_AUTO_OFFSET_RESET earliest
The schema maps JSON fields from your Kafka messages to ClickHouse® columns using the json:$.field_name syntax. The KAFKA_CONNECTION_NAME must match the name you used when creating the connection in step 2.
The Tinybird syntax abstracts away some complexity you'd encounter with raw ClickHouse® SQL while still giving you control over table engines and sorting keys.
4. Deploy the data source
Use tb deploy to deploy your data source configuration to Tinybird's managed infrastructure. This command creates the necessary tables and starts consuming from your Kafka topic:
# Deploy to cloud
tb --cloud deploy
# Drop --cloud flag to deploy to Tinybird Local
Tinybird handles the underlying ClickHouse® cluster, materialized views, and consumer management automatically. You don't manually create Kafka engine tables or set up materialized views, instead Tinybird generates the appropriate ClickHouse® objects based on your .datasource file.
5. Query and verify data
Test your data pipeline by querying it through the Tinybird CLI. This command runs queries against your Tinybird workspace and returns results in your terminal:
tb sql "SELECT event_type, count() as event_count FROM user_events GROUP BY event_type"
The Tinybird local dev experience provides a faster feedback loop than connecting to ClickHouse® directly, especially when you're iterating on queries or troubleshooting data issues. You can also monitor Kafka ingestion status using the service data sources:
tb sql "SELECT * FROM tinybird.kafka_ops_log WHERE datasource_id = 'user_events' ORDER BY timestamp DESC LIMIT 10"
This query shows recent Kafka operations including lag, processed messages, and any errors or warnings.
Creating a materialized view to parse Kafka data
In Tinybird, materialized views use pipes instead of raw ClickHouse® CREATE MATERIALIZED VIEW statements. A materialized pipe transforms data as it arrives and writes it to a destination data source, handling the entire ingestion-to-storage workflow.
You need two components: a destination data source and a materialized pipe that populates it.
Define the destination data source
First, create a .datasource file for your target table where transformed data will be stored:
DESCRIPTION >
Parsed user events with extracted properties
SCHEMA >
`user_id` String,
`event_type` String,
`timestamp` DateTime64(3),
`page_url` String,
`session_duration` Int32
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "event_type, timestamp"
Save this as datasources/user_events_parsed.datasource. The partition key organizes data by month, which speeds up time-based queries by letting ClickHouse® skip entire months of data when filtering by timestamp.
Create the materialized pipe
Next, create a .pipe file that defines how to transform data from your Kafka source to the destination:
DESCRIPTION >
Materialized pipe to parse JSON properties from Kafka events
NODE parse_events
SQL >
SELECT
user_id,
event_type,
timestamp,
JSONExtractString(properties, 'page_url') as page_url,
JSONExtractInt(properties, 'session_duration') as session_duration
FROM user_events
WHERE timestamp > now() - INTERVAL 7 DAY
TYPE materialized
DATASOURCE user_events_parsed
Save this as pipes/user_events_mv.pipe. The materialized pipe runs automatically as new data arrives in the source user_events data source. Transforming data during ingestion means you parse JSON once when it arrives, not every time you query it.
Deploy both resources
Deploy the destination data source and materialized pipe together:
tb deploy
Tinybird creates the target table and starts the materialization process automatically. The materialized pipe continuously processes new data from your Kafka stream and writes transformed results to user_events_parsed.
Incremental aggregations with state functions
For pre-aggregated metrics, use AggregatingMergeTree with state functions to maintain incremental aggregates. Define the destination data source with aggregate function columns:
SCHEMA >
`hour` DateTime,
`event_type` String,
`event_count` AggregateFunction(count, UInt64),
`unique_users` AggregateFunction(uniq, String)
ENGINE "AggregatingMergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(hour)"
ENGINE_SORTING_KEY "hour, event_type"
Then create a materialized pipe that uses state functions:
NODE hourly_aggregation
SQL >
SELECT
toStartOfHour(timestamp) AS hour,
event_type,
countState() AS event_count,
uniqState(user_id) AS unique_users
FROM user_events
GROUP BY hour, event_type
TYPE materialized
DATASOURCE user_events_hourly
Query the aggregated data using merge functions to combine the pre-computed states:
SELECT
hour,
event_type,
countMerge(event_count) as total_events,
uniqMerge(unique_users) as unique_users
FROM user_events_hourly
GROUP BY hour, event_type
Exposing real-time queries as RESTful APIs
Once data flows from Kafka into ClickHouse®, you'll want to query it from your applications. Tinybird turns SQL queries into HTTP API endpoints with authentication and parameter handling built in.
Create a .pipe file that defines your query with parameters. This example creates an API that returns event counts filtered by event type and time range:
TOKEN events_api_read READ
NODE event_counts
SQL >
%
SELECT
event_type,
count() as event_count
FROM user_events
WHERE timestamp >= {{DateTime(start_time, '2024-01-01 00:00:00')}}
AND timestamp < {{DateTime(end_time, '2024-01-02 00:00:00')}}
{/% if defined(event_type) %}
AND event_type = {{String(event_type)}}
{/% end %}
GROUP BY event_type
ORDER BY event_count DESC
TYPE endpoint
The double curly braces define query parameters with type validation and default values. The if defined() syntax makes parameters optional, which gives API consumers flexibility in how they filter data.
Deploy your pipe to get a hosted API endpoint:
tb --cloud deploy
After deploying, you'll get a URL like https://api.tinybird.co/v0/pipes/event_counts.json?start_time=2024-01-15T00:00:00&token=p.ey.... The token authenticates requests without requiring you to set up OAuth, API gateways, or other authentication infrastructure.
ClickHouse® Apache Kafka integration best practices
Running Kafka to ClickHouse® pipelines in production requires attention to batching, parallelism, and monitoring. These settings affect both throughput and reliability.
ClickHouse® batches messages before inserting them into tables, which amortizes the overhead of each insert operation. The kafka_max_block_size setting controls batch size:
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'user_events',
kafka_group_name = 'clickhouse_consumer',
kafka_format = 'JSONEachRow',
kafka_max_block_size = 10000,
kafka_poll_timeout_ms = 5000;
Larger batches improve throughput but increase latency—messages wait longer before being inserted. A kafka_max_block_size of 10,000 messages with a 5-second poll timeout balances these tradeoffs for most use cases.
ClickHouse® can run multiple consumer threads to read from different Kafka partitions simultaneously. Set kafka_num_consumers to match the number of partitions in your Kafka topic:
ENGINE = Kafka
SETTINGS
kafka_num_consumers = 4;
More consumers than partitions provides no benefit, since Kafka assigns one partition per consumer within a group.
Consumer lag measures how far behind your ClickHouse® consumers are from the latest messages in Kafka. High lag indicates your consumers can't keep up with the message rate. Monitor lag using Kafka's consumer group commands:
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group clickhouse_consumer \
--describe
This shows the current offset, log end offset, and lag for each partition. If lag grows consistently, you're not processing messages fast enough and risk falling further behind over time. For insights on managing high-throughput ingestion and backpressure in production, see our experience operating large ClickHouse® clusters.
Why use a managed ClickHouse® service like Tinybird
Self-hosting ClickHouse® gives you complete control but requires expertise in distributed systems, storage optimization, and performance tuning. You'll handle cluster setup, replication configuration, backup strategies, and scaling as your data grows.
Tinybird abstracts infrastructure concerns by providing a fully managed ClickHouse® platform designed for developers who want to integrate real-time analytics into their applications. The platform handles infrastructure scaling, query optimization, and operational monitoring automatically.
Beyond managed infrastructure, Tinybird adds developer-focused tooling:
- CLI for local development: Test queries and pipelines on your machine before deploying to production
- Built-in API generation: Turn SQL queries into authenticated HTTP endpoints without writing backend code
- Version control integration: Define data pipelines as code and deploy them through CI/CD workflows
If you're building a product that requires real-time analytics powered by ClickHouse®, sign up for a free Tinybird account and deploy your first Kafka pipeline in minutes rather than configuring clusters and infrastructure yourself.
Frequently asked questions about Kafka to ClickHouse®
Can ClickHouse® Kafka engine handle Avro or Protobuf formats?
Yes, ClickHouse® supports multiple serialization formats beyond JSON, including Avro and Protobuf. You specify the format in the kafka_format setting when creating your Kafka engine table. For Avro, use kafka_format = 'Avro' and provide a schema. For Protobuf, use kafka_format = 'Protobuf' along with a format schema that defines your message structure.
How are Kafka partitions distributed across ClickHouse® consumers?
The best practice is to set kafka_num_consumers equal to the number of partitions in your Kafka topic. Kafka's consumer group protocol assigns partitions to consumers automatically, ensuring each partition is read by exactly one consumer. If you have fewer consumers than partitions, some consumers will handle multiple partitions. If you have more consumers than partitions, the extra consumers will sit idle.
What causes Kafka to ClickHouse® pipeline failures?
Network timeouts between ClickHouse® and Kafka brokers are the most frequent issue, especially in cloud environments with restrictive security groups. Schema mismatches cause failures when message formats don't match your table definition; use input_format_skip_unknown_fields to handle extra fields gracefully. Insufficient consumer capacity leads to growing lag and eventual pipeline stalls when message rates exceed processing throughput./
