ClickHouse® + Elasticsearch — 3 Ways to Connect in {{ year }}
These are the main options for a ClickHouse® integration Elasticsearch pipeline:
- Tinybird
- ClickHouse® Cloud + ClickPipes (Kafka source)
- Self-managed or custom (Kafka bridge, Logstash, or batch via scroll API)
Elasticsearch is a distributed search and analytics engine used for full-text search, log analytics, observability, and APM. Many teams want a copy of that data in ClickHouse® for columnar analytical queries, reporting, and real-time dashboards at scale.
A ClickHouse® integration Elasticsearch setup uses Logstash, Kafka bridging, or the Elasticsearch scroll API to move Elasticsearch data into ClickHouse® for real-time analytics without putting analytical load on Elasticsearch.
Below we compare all three options in depth—architecture, real configuration examples, trade-offs, and when to use each.
Looking for minimal ops and instant APIs?
Tinybird combines managed ingestion from your Logstash→Kafka or Logstash→HTTP path, managed ClickHouse®, and one-click API publishing from SQL—no Kafka engine to operate yourself.
Three ways to implement ClickHouse® integration Elasticsearch
This section is the core: the three options to connect Elasticsearch to ClickHouse®, in order.
Option 1: Tinybird — managed ClickHouse® with API layer
Tinybird is a real-time data platform built on ClickHouse®. It combines ingestion, storage, and API publishing in one product.
How it works: configure a Logstash pipeline with an HTTP or Kafka output that forwards documents to Tinybird. The http output sends documents directly to the Tinybird Events API. The kafka output writes to a topic consumed by the Tinybird Kafka connector.
Data lands in Tinybird's ClickHouse®-backed data sources. You define Pipes (SQL) and publish them as REST endpoints.
Logstash pipeline → Tinybird Events API:
input {
elasticsearch {
hosts => ["https://localhost:9200"]
index => "logs-*"
user => "elastic"
password => "${ES_PASSWORD}"
schedule => "* * * * *"
query => '{"query": {"range": {"@timestamp": {"gte": "now-1m"}}}}'
}
}
filter {
mutate {
rename => { "@timestamp" => "ts" }
remove_field => ["@version", "tags"]
}
}
output {
http {
url => "https://api.tinybird.co/v0/events?name=es_logs"
http_method => "post"
headers => { "Authorization" => "Bearer ${TB_TOKEN}" }
format => "json_batch"
codec => "json"
}
}
When Tinybird fits:
- You want getting Elasticsearch data into ClickHouse® with minimal ops
- You need APIs and real-time dashboards from the same data
- You prefer an Elasticsearch to ClickHouse® pipeline with an API layer built in
Prerequisites: Logstash 8.x with the elasticsearch input and either http or kafka output. Data flows: Elasticsearch → Logstash → Kafka or HTTP → Tinybird. You run Elasticsearch and Logstash; Tinybird runs ingestion, storage, and API publishing.
Option 2: ClickHouse® Cloud + ClickPipes (Kafka source)
ClickHouse® Cloud's ClickPipes supports Kafka as a data source. There is no native Elasticsearch connector.
How it works: configure a Logstash pipeline with a Kafka output that writes documents to a topic. Then create a Kafka ClickPipe in the ClickHouse® Cloud console pointing to your broker and topic.
Logstash pipeline → Kafka:
input {
elasticsearch {
hosts => ["https://localhost:9200"]
index => "logs-*"
schedule => "* * * * *"
}
}
output {
kafka {
bootstrap_servers => "kafka:9092"
topic_id => "es-logs"
codec => "json"
}
}
ClickHouse® Cloud destination table:
CREATE TABLE es_logs
(
ts DateTime,
index_name LowCardinality(String),
level LowCardinality(String),
service LowCardinality(String),
message String,
doc_id String
)
ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (index_name, level, ts, doc_id);
When it fits:
- You want managed ClickHouse® with a Kafka-based ingestion path
- You're already on ClickHouse® Cloud and can configure Logstash→Kafka
- You'll add your own API or BI layer on top of ClickHouse® Cloud
Prerequisites: Logstash with a Kafka output. Data flows: Elasticsearch → Logstash → Kafka → ClickPipes → ClickHouse® Cloud.
Option 3: Self-managed or custom (Kafka, Logstash, or batch scroll API)
With self-managed ClickHouse®, the streaming pattern uses Logstash → Kafka → Kafka table engine → materialized view → MergeTree.
ClickHouse® Kafka table engine + materialized view:
-- Kafka engine reads from topic
CREATE TABLE es_logs_kafka
(
ts DateTime,
index_name String,
level String,
service String,
message String,
doc_id String
)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'es-logs',
kafka_group_name = 'clickhouse_es_consumer',
kafka_format = 'JSONEachRow';
-- Target MergeTree table
CREATE TABLE es_logs
(
ts DateTime,
index_name LowCardinality(String),
level LowCardinality(String),
service LowCardinality(String),
message String,
doc_id String
)
ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (index_name, level, ts, doc_id);
-- Materialized view
CREATE MATERIALIZED VIEW es_logs_mv TO es_logs AS
SELECT * FROM es_logs_kafka;
Batch sync via the scroll API (for initial load or periodic sync):
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
import clickhouse_driver
es = Elasticsearch("https://localhost:9200", basic_auth=("elastic", "password"))
ch = clickhouse_driver.Client("localhost")
rows = []
for doc in scan(es, index="logs-*", query={"query": {"match_all": {}}}):
src = doc["_source"]
rows.append({
"ts": src.get("@timestamp"),
"index_name": doc["_index"],
"level": src.get("level", ""),
"service": src.get("service", ""),
"message": src.get("message", ""),
"doc_id": doc["_id"]
})
if len(rows) >= 1000:
ch.execute("INSERT INTO es_logs VALUES", rows)
rows.clear()
if rows:
ch.execute("INSERT INTO es_logs VALUES", rows)
When it fits:
- You already run ClickHouse® and Kafka and want full control
- You have data-engineering capacity to manage the full stack
- Batch scroll export works when sub-minute freshness is not required
Decision framework: which option fits your situation
The right choice depends on four variables: freshness requirement, team capacity, whether you need an API layer, and cost tolerance.
| Situation | Recommended option |
|---|---|
| Real-time analytics, need REST APIs, minimal ops | Tinybird |
| Already on ClickHouse® Cloud, have Kafka + Logstash | ClickPipes + Kafka |
| Self-managed ClickHouse® + Kafka, full control | Self-managed |
| One-time migration or periodic batch sync | Batch scroll API |
| Observability platform with long-retention analytics | Tinybird or self-managed |
Choose Tinybird when you need real-time data ingestion from Elasticsearch and REST APIs from the same data—without operating ClickHouse® infrastructure.
Choose ClickPipes when you're already on ClickHouse® Cloud and have Logstash→Kafka running. You get managed ClickHouse® ingestion but build your own serving layer.
Choose self-managed when your team is comfortable operating Logstash, Kafka, and ClickHouse® and needs full schema and configuration control.
Summary table
| Option | Ingestion path | API layer | Ops burden | Kafka required |
|---|---|---|---|---|
| Tinybird | Logstash HTTP or Kafka | Built in (Pipes) | Low | No |
| ClickHouse® Cloud ClickPipes | Logstash → Kafka | Build your own | Medium | Yes |
| Self-managed | Logstash → Kafka or batch scroll | Build your own | High | Depends |
What is Elasticsearch and why integrate it with ClickHouse®?
Elasticsearch as the data source
Elasticsearch is a distributed search engine built on Lucene. It provides an HTTP REST API, inverted-index storage, and a flexible JSON document model. Teams use it for full-text search, log ingestion, observability (APM traces, metrics), and Kibana dashboards.
For analytical workloads at scale—long-retention analytics, aggregations across millions of documents, or user-facing analytics—running heavy Elasticsearch aggregations becomes expensive and impacts search performance.
Sending a copy of that data to ClickHouse® gives you a dedicated analytical store: columnar, optimized for large scans and low latency aggregations, without affecting Elasticsearch query performance.
How to get data out of Elasticsearch
Elasticsearch does not have a native ClickHouse® output. Data moves out via Logstash pipelines.
The elasticsearch input plugin supports scheduled queries with query DSL. The kafka output writes documents to Kafka topics. The http output sends documents to HTTP endpoints (e.g. Tinybird Events API).
For batch: the scroll API or point-in-time + search_after supports paginated export by time range. The Python elasticsearch.helpers.scan() helper abstracts the scroll protocol, processing documents in configurable batch sizes.
Why route Elasticsearch data to ClickHouse®
An Elasticsearch to ClickHouse® pipeline lets you run analytical queries over Elasticsearch data at scale without impacting search performance or scaling Elasticsearch for analytical loads.
ClickHouse®'s columnar storage achieves 10x–100x compression compared to Elasticsearch's inverted-index model for typical log data. Aggregation queries that take seconds in Elasticsearch often run in milliseconds in ClickHouse®.
Schema and pipeline design
Mapping Elasticsearch documents to ClickHouse® tables
Elasticsearch document data typically includes @timestamp, _index, level, service, message, and nested JSON fields.
A well-designed ClickHouse® target schema:
CREATE TABLE es_logs
(
ts DateTime,
index_name LowCardinality(String), -- Elasticsearch index name
level LowCardinality(String), -- log/error/warn
service LowCardinality(String), -- source service
message String,
doc_id String, -- Elasticsearch _id for dedup
extra String -- JSON for variable fields
)
ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (index_name, service, level, ts, doc_id);
Use LowCardinality(String) for low-cardinality fields (level, index, service) to improve compression and filtering speed. Use ReplacingMergeTree with doc_id for deduplication.
Handling large indices and schema drift
For batch scroll of large indices, chunk by time range using search_after with a point-in-time (PIT):
# Open a point-in-time
pit = es.open_point_in_time(index="logs-*", keep_alive="5m")
pit_id = pit["id"]
search_after = None
while True:
body = {
"query": {"range": {"@timestamp": {"gte": "2026-01-01", "lt": "2026-02-01"}}},
"sort": [{"@timestamp": "asc"}, {"_id": "asc"}],
"size": 1000,
"pit": {"id": pit_id, "keep_alive": "5m"}
}
if search_after:
body["search_after"] = search_after
result = es.search(body=body)
hits = result["hits"]["hits"]
if not hits:
break
# process hits...
search_after = hits[-1]["sort"]
Schema drift: new Elasticsearch fields require updating the ClickHouse® schema. Plan for nullable columns (Nullable(String)) for optional fields, or a catch-all extra String JSON column.
Failure modes to plan for
- Duplicates from retries: Logstash at-least-once delivery can re-send documents. Use
_idas deduplication key withReplacingMergeTree. - Pipeline lag: monitor Kafka consumer lag and define a freshness SLA. Alert when lag exceeds your threshold.
- Large index scans: for batch scroll, avoid unbounded queries. Always filter by time range and use
search_after+ PIT for large indices. - Elasticsearch query load: Logstash input queries add load to Elasticsearch. Schedule during off-peak hours or use a dedicated replica node.
Why ClickHouse® for Elasticsearch analytics
ClickHouse® is a columnar OLAP database built for analytical queries over large volumes. MergeTree tables and vectorized execution deliver sub-second queries on billions of rows.
ClickHouse®'s columnar compression reduces I/O for log and event analytics by 10x–100x compared to Elasticsearch's inverted-index model. For the same log data, ClickHouse® stores more data, queries faster, and costs less for analytical workloads.
A ClickHouse® integration Elasticsearch setup fits real-time analytics, long-retention log analytics, and user-facing analytics—without scaling Elasticsearch to meet analytical demand.
Why Tinybird is the best Elasticsearch to ClickHouse® option
Most teams don't need to operate ClickHouse® infrastructure—they need fast analytics on Elasticsearch data exposed as APIs.
Tinybird is purpose-built for this. You configure a Logstash pipeline that sends data to Tinybird's Events API or Kafka connector. Tinybird handles ingestion, storage, and API publishing in one product.
You avoid operating the ClickHouse® Kafka engine and the overhead of maintaining a separate API layer. Define Pipes in SQL, publish as REST endpoints, and serve dashboards or product features with sub-100ms latency and automatic scaling.
The Logstash http output replaces the need for Kafka entirely: documents go directly from Elasticsearch → Logstash → Tinybird.
Frequently Asked Questions (FAQs)
Does ClickHouse® Cloud support Elasticsearch natively?
ClickHouse® Cloud does not have a native Elasticsearch connector in ClickPipes. You use the Kafka data source: configure a Logstash pipeline with a Kafka output, then create a Kafka ClickPipe to load into ClickHouse® Cloud.
You operate the Elasticsearch → Logstash → Kafka path; ClickHouse® Cloud ingests from Kafka.
Can I use Tinybird for Elasticsearch to ClickHouse® without Kafka?
Yes. Use the Logstash HTTP output to send documents directly to the Tinybird Events API. No Kafka cluster required.
Tinybird stores data in ClickHouse®-backed data sources and lets you publish Pipes as REST APIs. You own the Logstash pipeline; Tinybird is the destination and API layer.
How do I handle document deduplication in ClickHouse® integration Elasticsearch?
Use the Elasticsearch _id field as your deduplication key in ClickHouse®. Configure ReplacingMergeTree where the version column is derived from a monotonic timestamp or update counter.
With Logstash at-least-once delivery, the same document may arrive more than once. ReplacingMergeTree collapses duplicates during background merges. Use FINAL at query time for strong consistency.
What is the difference between Logstash streaming and scroll API batch sync?
Logstash streaming sends documents as they are indexed in Elasticsearch—near real-time (seconds latency). This keeps ClickHouse® data fresh but adds continuous load to Elasticsearch.
Scroll API batch sync exports documents on a schedule (e.g. hourly or daily). Latency equals your schedule interval. Use batch sync for one-time migrations or when freshness requirements are relaxed.
Is a ClickHouse® integration Elasticsearch good for long-retention log analytics?
Yes—this is one of the primary motivations. Elasticsearch is expensive for long-retention: storage and compute scale linearly with data volume, and hot-tier costs are high.
Move aged logs to ClickHouse® where columnar compression reduces storage by 10x and analytical query costs drop significantly. Keep Elasticsearch for recent logs (hot tier, search) and ClickHouse® for historical analytics and reporting.
How does ClickHouse® compare to Elasticsearch for aggregation queries?
ClickHouse® significantly outperforms Elasticsearch for analytical aggregations. Columnar storage means queries only read the columns they need. Elasticsearch's document-oriented inverted index reads full documents and aggregates at query time.
For GROUP BY aggregations, time-series rollups, and large-range scans, ClickHouse® typically runs 10x–100x faster than Elasticsearch at equivalent data volumes.
