Apache Iceberg has become the open table format of choice for building data lakes. ClickHouse® is the de facto standard for real-time analytics. Combining them unlocks a powerful pattern: query your Iceberg tables directly from ClickHouse® without moving data or managing complex ETL pipelines.

But there's a catch. Making it fast requires understanding how ClickHouse® reads Iceberg metadata, which table functions and engines to use, and how to structure your tables for acceptable query latency.

In this guide, we'll walk through exactly how to integrate Iceberg with ClickHouse, explore the tradeoffs, and show you when this pattern makes sense for your architecture.

What you'll learn

How Iceberg table functions (iceberg()) and engines (Iceberg*) work in ClickHouse
The tradeoffs between different integration approaches
Real-world schema evolution patterns in Iceberg tables
Optimization strategies for acceptable query performance
When to query Iceberg directly vs. copying data into ClickHouse

What is Apache Iceberg?

If you're already familiar with Iceberg, skip to the next section. If not, here's the TL;DR:

Apache Iceberg is an open table format for large-scale analytics. It was originally built at Netflix to solve the limitations of older formats like Hive. The key innovation: ACID transactions + schema evolution without rewriting your data.

Key features:

ACID compliance — Transactions, atomicity, isolation across distributed writes
Schema evolution — Add, rename, reorder, or remove columns without downtime or data rewrites
Hidden partitioning — Partition pruning handled automatically; no "partition folder" directory structures
Compatibility — Read with Spark, Trino, Flink, DuckDB, ClickHouse, and many others
Snapshot isolation — Time-travel queries; read historical versions of your data

The philosophy: Write once, query many times. You write data once (e.g., via Kafka, Spark, or an ETL), and then query it through any compatible engine without duplication.

In practice, many data teams use Iceberg as their source of truth, and then query it through different engines depending on their use case (real-time analytics, batch processing, ML pipelines, etc.).

How ClickHouse® reads Iceberg tables

ClickHouse® supports reading Iceberg tables via two mechanisms:

1. Table Functions (Recommended for read-only)

Table functions let you query Iceberg tables directly without creating a persistent table:

SELECT count() FROM iceberg(
  's3://my-bucket/iceberg/db/my_table',
  aws_access_key_id,
  aws_secret_access_key
)
WHERE event_date > '2026-01-01'

When to use: Ad-hoc queries, one-off analysis, or when you want zero management overhead.

Pros:

No table creation needed
Always reads fresh metadata
Great for exploration

Cons:

Every query reads Iceberg metadata (overhead)
Query latency is higher than against materialized tables
Not suitable for high-frequency API endpoints

2. Table Engines (`Iceberg*`)

Table engines create a persistent table mapping to an Iceberg table:

CREATE TABLE my_iceberg_table
ENGINE = IcebergS3(
  's3://my-bucket/iceberg/db/my_table',
  aws_access_key_id,
  aws_secret_access_key
)

When to use: When you want a long-lived reference to an Iceberg table and plan frequent queries.

Pros:

Persistent table reference in ClickHouse®
Can be queried repeatedly with metadata caching
Simpler than table functions for frequent access

Cons:

Still read-only (no writes through ClickHouse®)
Metadata reading is still slower than native ClickHouse® tables
Limited support for all Iceberg features (e.g., some transformations)

Comparison

Aspect	Table Function	Table Engine
Syntax	`iceberg(...)` in FROM clause	`CREATE TABLE ... ENGINE = Iceberg*`
Persistence	No	Yes
Metadata caching	Limited	Better
Best for	Ad-hoc queries	Frequent queries
Write support	No	No
Schema evolution	Supported	Supported

Real-world example: Querying a data lake

Let's say you have an Iceberg table in S3 with e-commerce events. You want to query it from ClickHouse®.

Setting up credentials

First, store your S3 credentials securely:

CREATE SECRET aws_secret (
  access_key_id = 'YOUR_KEY',
  secret_access_key = 'YOUR_SECRET'
)

(In production, use environment variables or a secrets manager instead.)

Approach 1: Direct table function query

SELECT
  toDate(event_time) as date,
  count() as event_count,
  uniq(user_id) as unique_users
FROM iceberg(
  's3://ecommerce-lake/events/events_v2',
  (SELECT access_key_id FROM aws_secret),
  (SELECT secret_access_key FROM aws_secret)
)
WHERE event_time > now() - interval 7 day
GROUP BY date
ORDER BY date DESC

Query latency: ~2-5 seconds (depends on Iceberg table size and partitioning)

Why it's slow: ClickHouse® must read Iceberg metadata, scan Parquet files, and decompress data. All in one request.

Approach 2: Materialized table + periodic refresh

For better performance, copy Iceberg data into a native ClickHouse® table:

-- Create native ClickHouse table
CREATE TABLE events_local (
  event_id UUID,
  user_id UUID,
  event_time DateTime,
  event_type String,
  product_id UInt64,
  revenue Decimal(10, 2)
)
ENGINE = MergeTree()
ORDER BY (event_time, user_id)
PARTITION BY toDate(event_time)

Then create a copy pipe to refresh incrementally:

-- Scheduled copy (e.g., every hour)
INSERT INTO events_local
SELECT * FROM iceberg(
  's3://ecommerce-lake/events/events_v2',
  aws_key,
  aws_secret
)
WHERE event_time > (
  SELECT max(event_time) FROM events_local
)

Query latency on local table: ~100ms or less

Tradeoff: You now have a copy of data. More storage, but vastly better query performance.

Schema evolution: How to handle changing Iceberg schemas

One of Iceberg's superpowers is schema evolution. Say your events table adds a new field. Iceberg handles it gracefully.

Example: Adding a new column

In Iceberg (via Spark or another writer):

-- Writer: Add a new column via Spark/Trino/etc.
ALTER TABLE events ADD COLUMN session_id UUID

Reading with schema evolution in ClickHouse®

ClickHouse® automatically detects new columns:

-- This query will see the new column
SELECT * FROM iceberg(...)
-- New column (session_id) is automatically included

Important: ClickHouse® applies schema evolution rules:

New columns are nullable or have defaults
Column removals are supported
Column reordering is supported
Type changes (int → long, float → double) are supported

-- If column type changed, ClickHouse handles it:
SELECT
  CAST(product_id AS UInt64) as product_id_new,
  event_type
FROM iceberg(...)

Optimization strategies for acceptable performance

Querying Iceberg directly from ClickHouse® works, but it's slower than querying native ClickHouse® tables. Here's how to optimize:

1. Partition your Iceberg tables

Partitioning reduces the number of files ClickHouse has to scan. Instead of reading 1000 files, read 50.

-- Bad: No partitioning
CREATE TABLE events (
  event_id UUID,
  event_time DateTime
)
USING ICEBERG

-- Better: Partition by date
CREATE TABLE events (
  event_id UUID,
  event_time DateTime
)
USING ICEBERG
PARTITIONED BY (date(event_time))

With daily partitioning, a query filtering on event_time > '2026-01-01' will skip unrelated partitions entirely.

2. Sort within partitions

Even within a partition, sorting matters. Store related data together:

CREATE TABLE events (
  event_id UUID,
  user_id UUID,
  event_time DateTime
)
USING ICEBERG
PARTITIONED BY (date(event_time))
CLUSTER BY (user_id)

Clustered by user_id means queries filtering by user read far fewer files.

3. Compact small files

Streaming writes to Iceberg often create small files (the "small files problem"). Compact them periodically:

-- In Spark, compact partitions
CALL system.rewrite_data_files('events')

Fewer, larger files mean faster scans in ClickHouse.

4. Use column selection

Always project only the columns you need:

-- Bad: SELECT *
SELECT * FROM iceberg(...)

-- Better: Select specific columns
SELECT user_id, event_type, event_time FROM iceberg(...)

ClickHouse (and the Parquet files) only read the columns you select.

5. Precompute aggregations with materialized views

For frequently accessed aggregates, use a materialized view:

CREATE MATERIALIZED VIEW events_by_day AS
SELECT
  toDate(event_time) as date,
  count() as event_count
FROM iceberg(...)
GROUP BY date

This keeps aggregates up-to-date as new data arrives.

When to use direct Iceberg queries vs. copying data

Query Iceberg directly if:

You need occasional access (ad-hoc analytics, exploratory queries)
Query latency <2-5 seconds is acceptable
Data freshness is critical (no copy lag)
You want to avoid storage duplication

Copy to ClickHouse® if:

You need sub-100ms query latency (user-facing APIs)
You query frequently (dashboards, real-time insights)
You're building real-time applications
You can tolerate a small copy lag (minutes to hours)

Hybrid approach (recommended):

Keep Iceberg as your source of truth and data lake
Copy frequently-accessed data into ClickHouse® for fast APIs
Use direct Iceberg queries for occasional, large analytical workloads
This is the pattern Tinybird uses for real-time analytics over Iceberg

Real-world tradeoffs: The performance cliff

Querying Iceberg through ClickHouse® hits a performance cliff around 1-2 seconds of data scan time.

Why?

Metadata overhead — ClickHouse® must read Iceberg manifest files (can be tens of MB)
Small files — Iceberg partitions often create hundreds of small Parquet files
Decompression — Decompressing Parquet is slower than scanning native ClickHouse® MergeTree

Real numbers:

Querying 1 day of Iceberg data: ~500ms
Querying 1 week of Iceberg data: ~2-3 seconds
Querying 1 month of Iceberg data: ~10+ seconds (unacceptable for APIs)

This is why production systems copy data from Iceberg into ClickHouse® for real-time access, and reserve direct Iceberg queries for occasional analytical workloads.

Getting started

Prerequisites

ClickHouse® instance (23.12+)
S3 bucket with Iceberg tables
AWS credentials

Step 1: Create a test table function

SELECT count() FROM iceberg(
  's3://your-bucket/path/to/iceberg/table',
  'YOUR_AWS_KEY',
  'YOUR_AWS_SECRET'
)

If you don't have Iceberg tables yet, you can use public datasets like GitHub Archive or create one via Spark.

Step 2: Query it

SELECT * FROM iceberg(...) LIMIT 10

Step 3: For production, decide on your approach

Direct queries: Use table functions, add monitoring for slow queries
Copy approach: Create a scheduled INSERT job to refresh data
Hybrid: Use Tinybird or similar to automate the copy + materialization layer

Conclusion

Apache Iceberg + ClickHouse® is a powerful combination for data lakes. You get:

Iceberg's ACID guarantees, schema evolution, and broad compatibility
ClickHouse®'s sub-100ms query performance for real-time analytics

The tradeoff: Direct Iceberg queries are slower than native ClickHouse®. For production real-time APIs, you'll want to copy data into ClickHouse® and accept a small freshness delay.

The sweet spot for most teams: Keep Iceberg as your data lake and source of truth, and use ClickHouse® (with periodic copies) as the query engine for real-time applications.

Ready to try it? Here are next steps:

Explore ClickHouse®'s Iceberg docs
Check out Apache Iceberg documentation
Learn how to build real-time analytics on Apache Iceberg with Tinybird

In this guide, we'll walk through exactly how to integrate Iceberg with ClickHouse, explore the tradeoffs, and show you when this pattern makes sense for your architecture.

What you'll learn

How Iceberg table functions (iceberg()) and engines (Iceberg*) work in ClickHouse
The tradeoffs between different integration approaches
Real-world schema evolution patterns in Iceberg tables
Optimization strategies for acceptable query performance
When to query Iceberg directly vs. copying data into ClickHouse

What is Apache Iceberg?

If you're already familiar with Iceberg, skip to the next section. If not, here's the TL;DR:

Key features:

ACID compliance — Transactions, atomicity, isolation across distributed writes
Schema evolution — Add, rename, reorder, or remove columns without downtime or data rewrites
Hidden partitioning — Partition pruning handled automatically; no "partition folder" directory structures
Compatibility — Read with Spark, Trino, Flink, DuckDB, ClickHouse, and many others
Snapshot isolation — Time-travel queries; read historical versions of your data

The philosophy: Write once, query many times. You write data once (e.g., via Kafka, Spark, or an ETL), and then query it through any compatible engine without duplication.

How ClickHouse® reads Iceberg tables

ClickHouse® supports reading Iceberg tables via two mechanisms:

1. Table Functions (Recommended for read-only)

Table functions let you query Iceberg tables directly without creating a persistent table:

SELECT count() FROM iceberg(
  's3://my-bucket/iceberg/db/my_table',
  aws_access_key_id,
  aws_secret_access_key
)
WHERE event_date > '2026-01-01'

When to use: Ad-hoc queries, one-off analysis, or when you want zero management overhead.

Pros:

No table creation needed
Always reads fresh metadata
Great for exploration

Cons:

Every query reads Iceberg metadata (overhead)
Query latency is higher than against materialized tables
Not suitable for high-frequency API endpoints

2. Table Engines (`Iceberg*`)

Table engines create a persistent table mapping to an Iceberg table:

CREATE TABLE my_iceberg_table
ENGINE = IcebergS3(
  's3://my-bucket/iceberg/db/my_table',
  aws_access_key_id,
  aws_secret_access_key
)

When to use: When you want a long-lived reference to an Iceberg table and plan frequent queries.

Pros:

Persistent table reference in ClickHouse®
Can be queried repeatedly with metadata caching
Simpler than table functions for frequent access

Cons:

Still read-only (no writes through ClickHouse®)
Metadata reading is still slower than native ClickHouse® tables
Limited support for all Iceberg features (e.g., some transformations)

Comparison

Aspect	Table Function	Table Engine
Syntax	`iceberg(...)` in FROM clause	`CREATE TABLE ... ENGINE = Iceberg*`
Persistence	No	Yes
Metadata caching	Limited	Better
Best for	Ad-hoc queries	Frequent queries
Write support	No	No
Schema evolution	Supported	Supported

Real-world example: Querying a data lake

Let's say you have an Iceberg table in S3 with e-commerce events. You want to query it from ClickHouse®.

Setting up credentials

First, store your S3 credentials securely:

CREATE SECRET aws_secret (
  access_key_id = 'YOUR_KEY',
  secret_access_key = 'YOUR_SECRET'
)

(In production, use environment variables or a secrets manager instead.)

Approach 1: Direct table function query

SELECT
  toDate(event_time) as date,
  count() as event_count,
  uniq(user_id) as unique_users
FROM iceberg(
  's3://ecommerce-lake/events/events_v2',
  (SELECT access_key_id FROM aws_secret),
  (SELECT secret_access_key FROM aws_secret)
)
WHERE event_time > now() - interval 7 day
GROUP BY date
ORDER BY date DESC

Query latency: ~2-5 seconds (depends on Iceberg table size and partitioning)

Why it's slow: ClickHouse® must read Iceberg metadata, scan Parquet files, and decompress data. All in one request.

Approach 2: Materialized table + periodic refresh

For better performance, copy Iceberg data into a native ClickHouse® table:

-- Create native ClickHouse table
CREATE TABLE events_local (
  event_id UUID,
  user_id UUID,
  event_time DateTime,
  event_type String,
  product_id UInt64,
  revenue Decimal(10, 2)
)
ENGINE = MergeTree()
ORDER BY (event_time, user_id)
PARTITION BY toDate(event_time)

Then create a copy pipe to refresh incrementally:

-- Scheduled copy (e.g., every hour)
INSERT INTO events_local
SELECT * FROM iceberg(
  's3://ecommerce-lake/events/events_v2',
  aws_key,
  aws_secret
)
WHERE event_time > (
  SELECT max(event_time) FROM events_local
)

Query latency on local table: ~100ms or less

Tradeoff: You now have a copy of data. More storage, but vastly better query performance.

Schema evolution: How to handle changing Iceberg schemas

One of Iceberg's superpowers is schema evolution. Say your events table adds a new field. Iceberg handles it gracefully.

Example: Adding a new column

In Iceberg (via Spark or another writer):

-- Writer: Add a new column via Spark/Trino/etc.
ALTER TABLE events ADD COLUMN session_id UUID

Reading with schema evolution in ClickHouse®

ClickHouse® automatically detects new columns:

-- This query will see the new column
SELECT * FROM iceberg(...)
-- New column (session_id) is automatically included

Important: ClickHouse® applies schema evolution rules:

New columns are nullable or have defaults
Column removals are supported
Column reordering is supported
Type changes (int → long, float → double) are supported

-- If column type changed, ClickHouse handles it:
SELECT
  CAST(product_id AS UInt64) as product_id_new,
  event_type
FROM iceberg(...)

Optimization strategies for acceptable performance

Querying Iceberg directly from ClickHouse® works, but it's slower than querying native ClickHouse® tables. Here's how to optimize:

1. Partition your Iceberg tables

Partitioning reduces the number of files ClickHouse has to scan. Instead of reading 1000 files, read 50.

-- Bad: No partitioning
CREATE TABLE events (
  event_id UUID,
  event_time DateTime
)
USING ICEBERG

-- Better: Partition by date
CREATE TABLE events (
  event_id UUID,
  event_time DateTime
)
USING ICEBERG
PARTITIONED BY (date(event_time))

With daily partitioning, a query filtering on event_time > '2026-01-01' will skip unrelated partitions entirely.

2. Sort within partitions

Even within a partition, sorting matters. Store related data together:

CREATE TABLE events (
  event_id UUID,
  user_id UUID,
  event_time DateTime
)
USING ICEBERG
PARTITIONED BY (date(event_time))
CLUSTER BY (user_id)

Clustered by user_id means queries filtering by user read far fewer files.

3. Compact small files

Streaming writes to Iceberg often create small files (the "small files problem"). Compact them periodically:

-- In Spark, compact partitions
CALL system.rewrite_data_files('events')

Fewer, larger files mean faster scans in ClickHouse.

4. Use column selection

Always project only the columns you need:

-- Bad: SELECT *
SELECT * FROM iceberg(...)

-- Better: Select specific columns
SELECT user_id, event_type, event_time FROM iceberg(...)

ClickHouse (and the Parquet files) only read the columns you select.

5. Precompute aggregations with materialized views

For frequently accessed aggregates, use a materialized view:

CREATE MATERIALIZED VIEW events_by_day AS
SELECT
  toDate(event_time) as date,
  count() as event_count
FROM iceberg(...)
GROUP BY date

This keeps aggregates up-to-date as new data arrives.

When to use direct Iceberg queries vs. copying data

Query Iceberg directly if:

You need occasional access (ad-hoc analytics, exploratory queries)
Query latency <2-5 seconds is acceptable
Data freshness is critical (no copy lag)
You want to avoid storage duplication

Copy to ClickHouse® if:

You need sub-100ms query latency (user-facing APIs)
You query frequently (dashboards, real-time insights)
You're building real-time applications
You can tolerate a small copy lag (minutes to hours)

Hybrid approach (recommended):

Keep Iceberg as your source of truth and data lake
Copy frequently-accessed data into ClickHouse® for fast APIs
Use direct Iceberg queries for occasional, large analytical workloads
This is the pattern Tinybird uses for real-time analytics over Iceberg

Real-world tradeoffs: The performance cliff

Querying Iceberg through ClickHouse® hits a performance cliff around 1-2 seconds of data scan time.

Why?

Metadata overhead — ClickHouse® must read Iceberg manifest files (can be tens of MB)
Small files — Iceberg partitions often create hundreds of small Parquet files
Decompression — Decompressing Parquet is slower than scanning native ClickHouse® MergeTree

Real numbers:

Querying 1 day of Iceberg data: ~500ms
Querying 1 week of Iceberg data: ~2-3 seconds
Querying 1 month of Iceberg data: ~10+ seconds (unacceptable for APIs)

This is why production systems copy data from Iceberg into ClickHouse® for real-time access, and reserve direct Iceberg queries for occasional analytical workloads.

Getting started

Prerequisites

ClickHouse® instance (23.12+)
S3 bucket with Iceberg tables
AWS credentials

Step 1: Create a test table function

SELECT count() FROM iceberg(
  's3://your-bucket/path/to/iceberg/table',
  'YOUR_AWS_KEY',
  'YOUR_AWS_SECRET'
)

If you don't have Iceberg tables yet, you can use public datasets like GitHub Archive or create one via Spark.

Step 2: Query it

SELECT * FROM iceberg(...) LIMIT 10

Step 3: For production, decide on your approach

Direct queries: Use table functions, add monitoring for slow queries
Copy approach: Create a scheduled INSERT job to refresh data
Hybrid: Use Tinybird or similar to automate the copy + materialization layer

Conclusion

Apache Iceberg + ClickHouse® is a powerful combination for data lakes. You get:

Iceberg's ACID guarantees, schema evolution, and broad compatibility
ClickHouse®'s sub-100ms query performance for real-time analytics

The tradeoff: Direct Iceberg queries are slower than native ClickHouse®. For production real-time APIs, you'll want to copy data into ClickHouse® and accept a small freshness delay.

The sweet spot for most teams: Keep Iceberg as your data lake and source of truth, and use ClickHouse® (with periodic copies) as the query engine for real-time applications.

Ready to try it? Here are next steps:

Explore ClickHouse®'s Iceberg docs
Check out Apache Iceberg documentation
Learn how to build real-time analytics on Apache Iceberg with Tinybird

Skip the infra work. Deploy your first ClickHouse® project now.

Blog

Skip the infra work. Deploy your first ClickHouse® project now.

Apache Iceberg with ClickHouse: Integration & Optimization

What you'll learn

What is Apache Iceberg?

How ClickHouse® reads Iceberg tables

1. Table Functions (Recommended for read-only)

2. Table Engines (Iceberg*)

Comparison

Real-world example: Querying a data lake

Setting up credentials

Approach 1: Direct table function query

Approach 2: Materialized table + periodic refresh

Schema evolution: How to handle changing Iceberg schemas

Example: Adding a new column

Reading with schema evolution in ClickHouse®

Optimization strategies for acceptable performance

1. Partition your Iceberg tables

2. Sort within partitions

3. Compact small files

4. Use column selection

5. Precompute aggregations with materialized views

When to use direct Iceberg queries vs. copying data

Real-world tradeoffs: The performance cliff

Getting started

Prerequisites

Step 1: Create a test table function

Step 2: Query it

Step 3: For production, decide on your approach

Conclusion

Related posts

Ship faster with Tinybird

Skip the infra work. Deploy your first ClickHouse project now

Skip the infra work. Deploy your first ClickHouse® project now.

Apache Iceberg with ClickHouse: Integration & Optimization

What you'll learn

What is Apache Iceberg?

How ClickHouse® reads Iceberg tables

1. Table Functions (Recommended for read-only)

2. Table Engines (Iceberg*)

Comparison

Real-world example: Querying a data lake

Setting up credentials

Approach 1: Direct table function query

Approach 2: Materialized table + periodic refresh

Schema evolution: How to handle changing Iceberg schemas

Example: Adding a new column

Reading with schema evolution in ClickHouse®

Optimization strategies for acceptable performance

1. Partition your Iceberg tables

2. Sort within partitions

3. Compact small files

4. Use column selection

5. Precompute aggregations with materialized views

When to use direct Iceberg queries vs. copying data

Real-world tradeoffs: The performance cliff

Getting started

Prerequisites

Step 1: Create a test table function

Step 2: Query it

Step 3: For production, decide on your approach

Conclusion

Related posts

Ship faster with Tinybird

Skip the infra work. Deploy your first ClickHouse project now

Skip the infra work. Deploy your first ClickHouse^®
project now.

Skip the infra work. Deploy your first ClickHouse^®
project now.

2. Table Engines (`Iceberg*`)

Ship faster
with Tinybird

Skip the infra work. Deploy your first ClickHouse^®
project now.

2. Table Engines (`Iceberg*`)

Ship faster
with Tinybird