---
title: "Apache Iceberg with ClickHouse: Integration & Optimization"
excerpt: "Learn how to read and query Apache Iceberg tables directly from ClickHouse, with optimization tips and real-world examples for building scalable data lakes."
authors: "Tinybird"
categories: "AI Resources"
createdOn: "2026-02-04 00:00:00"
publishedOn: "2026-02-04 00:00:00"
updatedOn: "2026-02-04 00:00:00"
status: "published"
---

Apache Iceberg has become the open table format of choice for building data lakes. ClickHouse® is the de facto standard for real-time analytics. Combining them unlocks a powerful pattern: **query your Iceberg tables directly from ClickHouse® without moving data or managing complex ETL pipelines.**

But there's a catch. Making it fast requires understanding how ClickHouse® reads Iceberg metadata, which table functions and engines to use, and how to structure your tables for acceptable query latency.

In this guide, we'll walk through exactly how to integrate Iceberg with ClickHouse, explore the tradeoffs, and show you when this pattern makes sense for your architecture.

## What you'll learn

- How Iceberg table functions (`iceberg()`) and engines (`Iceberg*`) work in ClickHouse
- The tradeoffs between different integration approaches
- Real-world schema evolution patterns in Iceberg tables
- Optimization strategies for acceptable query performance
- When to query Iceberg directly vs. copying data into ClickHouse

---

## What is Apache Iceberg?

If you're already familiar with Iceberg, skip to the next section. If not, here's the TL;DR:

**Apache Iceberg** is an open table format for large-scale analytics. It was originally built at Netflix to solve the limitations of older formats like Hive. The key innovation: **ACID transactions + schema evolution** without rewriting your data.

Key features:

- **ACID compliance** — Transactions, atomicity, isolation across distributed writes
- **Schema evolution** — Add, rename, reorder, or remove columns without downtime or data rewrites
- **Hidden partitioning** — Partition pruning handled automatically; no "partition folder" directory structures
- **Compatibility** — Read with Spark, Trino, Flink, DuckDB, ClickHouse, and many others
- **Snapshot isolation** — Time-travel queries; read historical versions of your data

The philosophy: **Write once, query many times.** You write data once (e.g., via Kafka, Spark, or an ETL), and then query it through any compatible engine without duplication.

In practice, many data teams use Iceberg as their **source of truth**, and then query it through different engines depending on their use case (real-time analytics, batch processing, ML pipelines, etc.).

---

## How ClickHouse® reads Iceberg tables

ClickHouse® supports reading Iceberg tables via two mechanisms:

### 1. Table Functions (Recommended for read-only)

Table functions let you query Iceberg tables directly without creating a persistent table:

```sql
SELECT count() FROM iceberg(
  's3://my-bucket/iceberg/db/my_table',
  aws_access_key_id,
  aws_secret_access_key
)
WHERE event_date > '2026-01-01'
```

**When to use:** Ad-hoc queries, one-off analysis, or when you want zero management overhead.

**Pros:**
- No table creation needed
- Always reads fresh metadata
- Great for exploration

**Cons:**
- Every query reads Iceberg metadata (overhead)
- Query latency is higher than against materialized tables
- Not suitable for high-frequency API endpoints

### 2. Table Engines (`Iceberg*`)

Table engines create a persistent table mapping to an Iceberg table:

```sql
CREATE TABLE my_iceberg_table
ENGINE = IcebergS3(
  's3://my-bucket/iceberg/db/my_table',
  aws_access_key_id,
  aws_secret_access_key
)
```

**When to use:** When you want a long-lived reference to an Iceberg table and plan frequent queries.

**Pros:**
- Persistent table reference in ClickHouse®
- Can be queried repeatedly with metadata caching
- Simpler than table functions for frequent access

**Cons:**
- Still read-only (no writes through ClickHouse®)
- Metadata reading is still slower than native ClickHouse® tables
- Limited support for all Iceberg features (e.g., some transformations)

### Comparison

| Aspect | Table Function | Table Engine |
|--------|---------------|--------------|
| Syntax | `iceberg(...)` in FROM clause | `CREATE TABLE ... ENGINE = Iceberg*` |
| Persistence | No | Yes |
| Metadata caching | Limited | Better |
| Best for | Ad-hoc queries | Frequent queries |
| Write support | No | No |
| Schema evolution | Supported | Supported |

---

## Real-world example: Querying a data lake

Let's say you have an Iceberg table in S3 with e-commerce events. You want to query it from ClickHouse®.

### Setting up credentials

First, store your S3 credentials securely:

```sql
CREATE SECRET aws_secret (
  access_key_id = 'YOUR_KEY',
  secret_access_key = 'YOUR_SECRET'
)
```

(In production, use environment variables or a secrets manager instead.)

### Approach 1: Direct table function query

```sql
SELECT
  toDate(event_time) as date,
  count() as event_count,
  uniq(user_id) as unique_users
FROM iceberg(
  's3://ecommerce-lake/events/events_v2',
  (SELECT access_key_id FROM aws_secret),
  (SELECT secret_access_key FROM aws_secret)
)
WHERE event_time > now() - interval 7 day
GROUP BY date
ORDER BY date DESC
```

**Query latency:** ~2-5 seconds (depends on Iceberg table size and partitioning)

**Why it's slow:** ClickHouse® must read Iceberg metadata, scan Parquet files, and decompress data. All in one request.

### Approach 2: Materialized table + periodic refresh

For better performance, copy Iceberg data into a native ClickHouse® table:

```sql
-- Create native ClickHouse table
CREATE TABLE events_local (
  event_id UUID,
  user_id UUID,
  event_time DateTime,
  event_type String,
  product_id UInt64,
  revenue Decimal(10, 2)
)
ENGINE = MergeTree()
ORDER BY (event_time, user_id)
PARTITION BY toDate(event_time)
```

Then create a **copy pipe** to refresh incrementally:

```sql
-- Scheduled copy (e.g., every hour)
INSERT INTO events_local
SELECT * FROM iceberg(
  's3://ecommerce-lake/events/events_v2',
  aws_key,
  aws_secret
)
WHERE event_time > (
  SELECT max(event_time) FROM events_local
)
```

**Query latency on local table:** ~100ms or less

**Tradeoff:** You now have a copy of data. More storage, but vastly better query performance.

---

## Schema evolution: How to handle changing Iceberg schemas

One of Iceberg's superpowers is schema evolution. Say your events table adds a new field. Iceberg handles it gracefully.

### Example: Adding a new column

In Iceberg (via Spark or another writer):

```sql
-- Writer: Add a new column via Spark/Trino/etc.
ALTER TABLE events ADD COLUMN session_id UUID
```

### Reading with schema evolution in ClickHouse®

ClickHouse® automatically detects new columns:

```sql
-- This query will see the new column
SELECT * FROM iceberg(...)
-- New column (session_id) is automatically included
```

**Important:** ClickHouse® applies schema evolution rules:

- New columns are nullable or have defaults
- Column removals are supported
- Column reordering is supported
- Type changes (int → long, float → double) are supported

```sql
-- If column type changed, ClickHouse handles it:
SELECT
  CAST(product_id AS UInt64) as product_id_new,
  event_type
FROM iceberg(...)
```

---

## Optimization strategies for acceptable performance

Querying Iceberg directly from ClickHouse® works, but it's slower than querying native ClickHouse® tables. Here's how to optimize:

### 1. Partition your Iceberg tables

Partitioning reduces the number of files ClickHouse has to scan. Instead of reading 1000 files, read 50.

```sql
-- Bad: No partitioning
CREATE TABLE events (
  event_id UUID,
  event_time DateTime
)
USING ICEBERG

-- Better: Partition by date
CREATE TABLE events (
  event_id UUID,
  event_time DateTime
)
USING ICEBERG
PARTITIONED BY (date(event_time))
```

With daily partitioning, a query filtering on `event_time > '2026-01-01'` will skip unrelated partitions entirely.

### 2. Sort within partitions

Even within a partition, sorting matters. Store related data together:

```sql
CREATE TABLE events (
  event_id UUID,
  user_id UUID,
  event_time DateTime
)
USING ICEBERG
PARTITIONED BY (date(event_time))
CLUSTER BY (user_id)
```

Clustered by `user_id` means queries filtering by user read far fewer files.

### 3. Compact small files

Streaming writes to Iceberg often create small files (the "small files problem"). Compact them periodically:

```sql
-- In Spark, compact partitions
CALL system.rewrite_data_files('events')
```

Fewer, larger files mean faster scans in ClickHouse.

### 4. Use column selection

Always project only the columns you need:

```sql
-- Bad: SELECT *
SELECT * FROM iceberg(...)

-- Better: Select specific columns
SELECT user_id, event_type, event_time FROM iceberg(...)
```

ClickHouse (and the Parquet files) only read the columns you select.

### 5. Precompute aggregations with materialized views

For frequently accessed aggregates, use a materialized view:

```sql
CREATE MATERIALIZED VIEW events_by_day AS
SELECT
  toDate(event_time) as date,
  count() as event_count
FROM iceberg(...)
GROUP BY date
```

This keeps aggregates up-to-date as new data arrives.

---

## When to use direct Iceberg queries vs. copying data

**Query Iceberg directly if:**
- You need occasional access (ad-hoc analytics, exploratory queries)
- Query latency <2-5 seconds is acceptable
- Data freshness is critical (no copy lag)
- You want to avoid storage duplication

**Copy to ClickHouse® if:**
- You need sub-100ms query latency (user-facing APIs)
- You query frequently (dashboards, real-time insights)
- You're building real-time applications
- You can tolerate a small copy lag (minutes to hours)

**Hybrid approach (recommended):**
- Keep Iceberg as your **source of truth** and data lake
- Copy frequently-accessed data into ClickHouse® for fast APIs
- Use direct Iceberg queries for occasional, large analytical workloads
- This is the pattern Tinybird uses for real-time analytics over Iceberg

---

## Real-world tradeoffs: The performance cliff

Querying Iceberg through ClickHouse® hits a performance cliff around **1-2 seconds of data scan time**.

Why?

1. **Metadata overhead** — ClickHouse® must read Iceberg manifest files (can be tens of MB)
2. **Small files** — Iceberg partitions often create hundreds of small Parquet files
3. **Decompression** — Decompressing Parquet is slower than scanning native ClickHouse® MergeTree

**Real numbers:**
- Querying 1 day of Iceberg data: ~500ms
- Querying 1 week of Iceberg data: ~2-3 seconds
- Querying 1 month of Iceberg data: ~10+ seconds (unacceptable for APIs)

This is why production systems copy data from Iceberg into ClickHouse® for real-time access, and reserve direct Iceberg queries for occasional analytical workloads.

---

## Getting started

### Prerequisites

- ClickHouse® instance (23.12+)
- S3 bucket with Iceberg tables
- AWS credentials

### Step 1: Create a test table function

```sql
SELECT count() FROM iceberg(
  's3://your-bucket/path/to/iceberg/table',
  'YOUR_AWS_KEY',
  'YOUR_AWS_SECRET'
)
```

If you don't have Iceberg tables yet, you can use public datasets like GitHub Archive or create one via Spark.

### Step 2: Query it

```sql
SELECT * FROM iceberg(...) LIMIT 10
```

### Step 3: For production, decide on your approach

- **Direct queries:** Use table functions, add monitoring for slow queries
- **Copy approach:** Create a scheduled INSERT job to refresh data
- **Hybrid:** Use Tinybird or similar to automate the copy + materialization layer

---

## Conclusion

Apache Iceberg + ClickHouse® is a powerful combination for data lakes. You get:

- Iceberg's ACID guarantees, schema evolution, and broad compatibility
- ClickHouse®'s sub-100ms query performance for real-time analytics

The tradeoff: Direct Iceberg queries are slower than native ClickHouse®. For production real-time APIs, you'll want to copy data into ClickHouse® and accept a small freshness delay.

The sweet spot for most teams: Keep Iceberg as your **data lake and source of truth**, and use ClickHouse® (with periodic copies) as the **query engine for real-time applications**.

Ready to try it? Here are next steps:

- Explore [ClickHouse®'s Iceberg docs](https://clickhouse.com/docs/engines/table-engines/integrations/iceberg)
- Check out [Apache Iceberg documentation](https://iceberg.apache.org/)
- Learn how to build [real-time analytics on Apache Iceberg with Tinybird](https://www.tinybird.co/blog/real-time-analytics-on-apache-iceberg-with-tinybird)

---

## Related posts

- [Real-Time Analytics on Apache Iceberg with Tinybird](https://www.tinybird.co/blog/real-time-analytics-on-apache-iceberg-with-tinybird)
- [Optimizing Apache Iceberg tables for real-time analytics](https://www.tinybird.co/blog/optimizing-apache-iceberg-tables-for-real-time-analytics)
- [Tinybird vs ClickHouse: What's the difference?](https://www.tinybird.co/blog/tinybird-vs-clickhouse)
