Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

Apache Iceberg has all the features needed for high-performance analytics, but success depends on how you use them.

While Iceberg excels at analytical workloads, adapting it for real-time analytics requires understanding the performance characteristics and trade-offs.

The most common mistakes engineers make are:

Ignoring the fundamentals: Proper partitioning and sorting can make queries 1000x faster
Optimizing before understanding: Always analyze query patterns first, then optimize
Assuming more features = better performance: Each optimization has trade-offs
Not understanding the streaming challenges: Small files, metadata explosion, and compaction overhead

Let's explore Iceberg's partitioning, sorting, and compaction features to build high-performance real-time analytics systems.

1. Partitioning

Partitioning is a way to organize your data into logical groups. Think of partitioning as "Which folder should I look in?"

❌ Without partitioning, every query scans all files.
✅ Partition by your most common query filters.

Some practical examples:

Basic time-based partitioning

-- Partition by day for time-based queries
CREATE TABLE ecommerce_events (
  event_id BIGINT,
  user_id BIGINT,
  event_time TIMESTAMP,
  event_type STRING,
  product_id BIGINT,
  region STRING,
  session_id STRING,
  revenue DECIMAL(10,2)
) USING ICEBERG
PARTITIONED BY (days(event_time));

Directory structure

ecommerce_events/
├── event_time_day=2024-01-15/
│   └── data files
├── event_time_day=2024-01-16/
│   └── data files
└── event_time_day=2024-01-17/
    └── data files

Sample query

SELECT COUNT(*) FROM ecommerce_events
WHERE event_time = '2024-01-15';
-- Files read: Only event_time_day=2024-01-15 ✅

Query plan

-- Query plan shows partition filters:
== Physical Plan ==
... PartitionFilters: [isnotnull(event_time_day), (event_time_day = 2024-01-15)]

Multi-dimensional partitioning

-- Partition by time AND region for better pruning
CREATE TABLE ecommerce_events (
  event_id BIGINT,
  user_id BIGINT,
  event_time TIMESTAMP,
  event_type STRING,
  product_id BIGINT,
  region STRING,
  session_id STRING,
  revenue DECIMAL(10,2)
) USING ICEBERG
PARTITIONED BY (
  days(event_time),
  region,
  bucket(32, user_id)  -- Hash bucketing for load distribution
);

Directory structure

ecommerce_events/
├── event_time_day=2024-01-15/
│   ├── region=US/
│   │   ├── user_id_bucket=0/
│   │   │   ├── part-00000.parquet
│   │   │   └── part-00001.parquet
│   │   ├── user_id_bucket=1/
│   │   │   └── part-00002.parquet
│   │   └── ... (buckets 2-31)
│   ├── region=EU/
│   │   ├── user_id_bucket=0/
│   │   └── ... (buckets 1-31)
├── event_time_day=2024-01-16/
│   ├── region=US/
│   │   ├── user_id_bucket=0/
│   │   └── ... (buckets 1-31)
│   ├── region=EU/
│   └── region=APAC/
└── event_time_day=2024-01-17/
    ├── region=US/
    ├── region=EU/
    └── region=APAC/

Sample query

SELECT COUNT(*) FROM ecommerce_events
WHERE event_time = '2024-01-15' AND region = 'US';
-- Files read: Only event_time_day=2024-01-15/region=US/ ✅

Query plan

...
PartitionFilters: [
  isnotnull(event_time_day#999),
  (event_time_day#999 = 2024-01-15),     ← Time partition elimination
  isnotnull(region#789),
  (region#789 = US)                      ← Region partition elimination
]
...
PartitionsRead: 32 (out of 15,360 total partitions)
FilesRead: 64 (out of 245,760 total files)

Always define partitions based on actual query patterns. Some examples:

For time-series queries:

PARTITIONED BY (
  days(event_time),        -- Primary: time filtering (days or months depending on volume)
  bucket(8, user_id)       -- Secondary: load balancing only
)

For multi-tenant queries:

PARTITIONED BY (
  tenant_id,               -- Primary: perfect isolation
  days(event_time),        -- Secondary: time pruning
  bucket(4, user_id)       -- Tertiary: small buckets
)

For region-heavy queries:

PARTITIONED BY (
  region,                  -- Primary: geographic filtering
  days(event_time),        -- Secondary: time pruning
  bucket(16, user_id)      -- Tertiary: parallelism
)

Rules of thumb when partitioning Iceberg tables

Maximum 2-3 partition columns
10-100 files per partition
1GB-100GB total size per partition
Monitor and evolve partitions when required

Anti-patterns

Too many small partitions:

-- BAD: Creates tiny partitions
PARTITIONED BY (
  hours(event_time),       -- 24 partitions per day
  region,                  -- × 10 regions
  event_type,              -- × 20 event types
  user_segment             -- × 5 segments
)
-- Result: 24,000 tiny partitions per day

High cardinality partitions:

-- BAD: Partition explosion
PARTITIONED BY (
  user_id,                 -- Millions of partitions
  session_id               -- Even more partitions
)
-- Result: Metadata larger than data

Ignoring query patterns:

-- BAD: Partitioned by write pattern, not read pattern
PARTITIONED BY (
  ingestion_batch_id       -- How data arrives
)
-- But queries filter by:
WHERE event_time > '...' AND region = '...'  -- Different columns!

When to repartition

Inspect your table metadata and query patterns frequently to evolve partitioning.

-- Partition health check
WITH partition_stats AS (
  SELECT
    partition,
    COUNT(*) as file_count,
    SUM(file_size_in_bytes) as partition_bytes,
    AVG(file_size_in_bytes) as avg_file_bytes
  FROM table_name.files
  GROUP BY partition
)
SELECT
  COUNT(*) as total_partitions,
  AVG(file_count) as avg_files_per_partition,
  MAX(file_count) as max_files_per_partition,
  AVG(partition_bytes) / (1024*1024*1024) as avg_partition_gb,
  COUNT(CASE WHEN file_count > 1000 THEN 1 END) as problematic_partitions
FROM partition_stats;

Too many files per partition: > 1,000 files → Add bucketing
Too few files per partition: < 5 files → Reduce time granularity
Skewed partitions: 1 partition > 10× average → Add sub-partitioning
Query performance degraded: → Align partitioning with query patterns

Start building with Tinybird!

If you've read this far, you might want to use Tinybird as your analytics backend. You can just get started, on the free plan.

2. Sorting

Sorting is a way to organize your data into a specific order. Think of sorting as a way to "skip files within partitions and blocks within files"

❌ Even within partitions, data might be randomly distributed across files.
✅ Sort data within files to create useful min/max statistics.

Regular sorting

-- Sort by primary access pattern
-- This creates a strict ordering: first by user_id, then by event_time
CALL catalog.system.rewrite_data_files('db.ecommerce_events',
  strategy => 'sort',
  sort_order => 'user_id, event_time'
);

File layout after regular sorting:

user_id	event_time
1	10:00
1	10:10
1	10:30
2	10:15
2	10:25
2	10:35
3	10:05
3	10:20

All user_id records are together, then event_time

Primary sort key: [■■■■■■■■■■] ← Excellent clustering
Secondary sort key: [■ ■ ■ ■ ■] ← Scattered across file

When to use regular sorting:

Single primary access pattern (e.g., 90% of queries filter by event_time)

SELECT * FROM ecommerce_events WHERE event_time BETWEEN '2024-01-01' AND '2024-01-02';

Hierarchical access (tenant → date → user)

SELECT * FROM multi_tenant_events
WHERE tenant_id = 'company_a' AND date >= '2024-01-01' AND user_id = 12345;

Time-series with mostly recent data access

sort_order => 'timestamp DESC'  -- Most recent data first

Z-Ordering for multi-dimensional queries

Z-order for queries filtering on multiple dimensions. Data gets interleaved to preserve locality in both dimensions

CALL catalog.system.rewrite_data_files('db.ecommerce_events',
  strategy => 'sort',
  sort_order => 'zorder(user_id, event_time)'
);

File layout after z-ordering:

user_id	event_time
1	10:00
2	10:15
1	10:10
3	10:05
1	10:30
2	10:25
3	10:20
2	10:35

All sort keys: [■■■ ■■■ ■■■] ← Balanced clustering

When to use z-ordering:

Queries filter on 2-4 columns together

SELECT * FROM ecommerce_events WHERE user_id = 12345 AND event_time > '2024-01-01';
SELECT * FROM ecommerce_events WHERE user_id BETWEEN 1000 AND 2000 AND event_time BETWEEN '2024-01-01' AND '2024-01-31';

Range queries on multiple columns (such as geospatial data)

SELECT * FROM locations WHERE latitude BETWEEN 40.0 AND 41.0 AND longitude BETWEEN -74.0 AND -73.0;

High-cardinality combinations

SELECT * FROM purchases
WHERE user_id IN (1,2,3) AND product_id IN (100,200) AND timestamp > '2024-01-01';

Z-ordering trades some single-dimension performance for better multi-dimensional performance, while regular sorting optimizes for the primary sort column at the expense of secondary columns.

File-Level statistics impact

-- Before sorting: user_id randomly distributed
-- File 1: user_id range [1-1000, 5000-6000] (fragmented)
-- File 2: user_id range [2000-3000, 500-800] (fragmented)

-- After sorting: clean ranges
-- File 1: user_id range [1-1000] ← min=1, max=1000
-- File 2: user_id range [1001-2000] ← min=1001, max=2000

SELECT * FROM ecommerce_events
WHERE user_id = 150 AND event_time >= '2024-01-15';
-- Only reads File 1 (150 is between 1-1000) ✅

3. Compaction

Iceberg partitions are composed of immutable files, writes may create many small files, hurting query performance.

Compaction is a process, typically run in the background, to merge several small files into a bigger one.

The process may look simple; just run the compaction process:

-- Compact all small files
CALL catalog.system.rewrite_data_files('db.ecommerce_events',
  strategy => 'binpack',
  options => map('target-file-size-bytes', '268435456') -- 256MB
);

But in reality, this is costly. Instead, you want to run compaction only over the necessary files (unless you want to rewrite a full table due to some schema evolution):

-- Compact only files from the last hour
CALL catalog.system.rewrite_data_files(
  table => 'db.ecommerce_events',
  strategy => 'binpack',
  where => 'event_time >= current_timestamp() - INTERVAL 1 HOUR',
  options => map(
    'target-file-size-bytes', '268435456',
    'min-input-files', '5',  -- Only compact if at least 5 files
    'max-concurrent-file-group-rewrites', '5' -- Parallelism level
  )
);

In real-world workloads you need to be smarter and customize your compaction logic so it runs only when needed:

-- Check if compaction is needed
WITH small_files_check AS (
  SELECT
    COUNT(*) as small_file_count,
    AVG(file_size_in_bytes) as avg_size_bytes
  FROM ecommerce_events.files
  WHERE file_size_in_bytes < 67108864  -- < 64MB
    AND partition LIKE '%2024-01-15%'  -- Today's data
)
SELECT
  CASE
    WHEN small_file_count >= 10 THEN 'COMPACT_RECOMMENDED'
    ELSE 'NO_ACTION_NEEDED'
  END as recommendation
FROM small_files_check;

Metadata optimization

An explosion of metadata files can hurt performance. Two good practices here:

Expire snapshots, and
Compact manifest files

-- Enable snapshot expiration for faster metadata reads
ALTER TABLE ecommerce_events SET TBLPROPERTIES (
  'history.expire.min-snapshots-to-keep' = '5',
  'history.expire.max-snapshot-age-ms' = '86400000' -- 1 day
);

-- Compact manifests regularly
CALL catalog.system.rewrite_manifests('db.ecommerce_events');

The 6-Level Data Elimination Stack

When you combine partitioning, sorting, and compaction, Iceberg creates a powerful 6-level data elimination system.

Let's trace through how a single query gets optimized:

SELECT user_id, event_type
FROM ecommerce_events
WHERE event_time = '2024-01-15' AND user_id = 150;

Level 1: Partition pruning

Iceberg skips entire directories based on partition columns
Only reads partitions where event_time = '2024-01-15'

-- Table structure:
ecommerce_events/
├── event_time_day=2024-01-14/  ← SKIP (date < 2024-01-15)
├── event_time_day=2024-01-15/  ← READ (matches filter)
├── event_time_day=2024-01-16/  ← SKIP (not needed)
└── event_time_day=2024-01-17/  ← SKIP (not needed)

-- Result: Skip entire directories based on partition columns
-- Data eliminated: 75% (3 out of 4 days)

Level 2: File elimination

Within event_time_day=2024-01-15/, skip files where user_id range doesn't include 150
Uses file-level statistics (min/max values) to skip files

-- Within event_time_day=2024-01-15/:
├── file_001.parquet  ← user_id range [1-1000]     ← READ (150 is in range)
├── file_002.parquet  ← user_id range [1001-2000]  ← SKIP (150 < 1001)
├── file_003.parquet  ← user_id range [2001-3000]  ← SKIP (150 < 2001)
└── file_004.parquet  ← user_id range [3001-4000]  ← SKIP (150 < 3001)

-- Result: Skip files using min/max statistics
-- Data eliminated: 75% more (3 out of 4 files)

Level 3: Row group elimination (inside parquet)

Within the chosen file, skip row groups where user_id min/max doesn't include 150
Uses Parquet column statistics

-- Within file_001.parquet:
├── Row Group 1  ← user_id range [1-500]    ← READ (150 is in range)
├── Row Group 2  ← user_id range [501-1000] ← SKIP (150 < 501)
└── Row Group 3  ← user_id range [1001-1500] ← SKIP (150 < 1001)

-- Result: Skip row groups using Parquet statistics
-- Data eliminated: 66% more (2 out of 3 row groups)

Level 4: Column elimination (projection pushdown)

Only read user_id and event_type column chunks, skip others

-- Row Group 1 contains columns:
├── user_id column chunk     ← READ (needed for SELECT)
├── event_type column chunk  ← READ (needed for SELECT)
├── event_time column chunk  ← SKIP (only used in WHERE, already filtered)
├── product_id column chunk  ← SKIP (not needed)
├── revenue column chunk     ← SKIP (not needed)
└── session_id column chunk  ← SKIP (not needed)

-- Result: Only deserialize needed columns
-- Data eliminated: 66% more (4 out of 6 columns)

Level 5: Page elimination (within row groups)

Within row groups, skip pages where user_id range doesn't include 150

-- Within user_id column chunk:
├── Page 1  ← user_id range [1-100]   ← SKIP (150 > 100)
├── Page 2  ← user_id range [101-200] ← READ (150 is in range)
├── Page 3  ← user_id range [201-300] ← SKIP (150 < 201)
└── Page 4  ← user_id range [301-400] ← SKIP (150 < 301)

-- Result: Skip pages within column chunks
-- Data eliminated: 75% more (3 out of 4 pages)

Level 6: Row filtering (post-deserialization)

Important: Unlike the previous 5 levels, this is filtering, not elimination.

The entire page must be read and deserialized from storage
Row-level filtering happens in memory after reading the page

-- Within Page 2 (user_id range [101-200]):
├── Read entire page from storage ← ALL rows [101-200] deserialized
├── Apply filter in memory: user_id = 150
└── Return only matching row(s)

-- Result: I/O saved = 0% (entire page must be read)
-- Processing saved = ~99% (only matching rows processed further)

Performance impact

Here's the cumulative effect of the 6-level optimization:

Level	Elimination Target	Data Remaining	Cumulative I/O Savings
Original Dataset	-	1TB (100%)	-
1. Partition	Skip directories	250GB (25%)	75% saved
2. File	Skip files	62GB (6.25%)	94% saved
3. Row Group	Skip row groups	21GB (2.1%)	98% saved
4. Column	Skip columns	7GB (0.7%)	99.3% saved
5. Page	Skip pages	1.7GB (0.17%)	99.8% saved
6. Row	Filter rows	1.7GB (0.17%)	99.8% I/O saved

Levels 1-5: True data elimination

These levels skip reading data entirely from storage
Provide both I/O savings and processing savings
This is where the massive performance gains come from

Level 6: In-memory filtering

Still reads the data from storage (no additional I/O savings)
Filters out unwanted rows in memory after deserialization
Provides CPU/memory savings but not I/O savings
This is traditional SQL row-level filtering

Final Result: Read 1.7GB instead of 1TB - that's 99.8% less I/O!

When Iceberg isn't enough

While the optimizations above work well for analytical workloads, streaming and real-time analytics expose fundamental limitations in Iceberg's architecture.

Streaming writes to Iceberg create several fundamental challenges due to its architecture based on immutable snapshots.

Challenge 1: Small file explosion

See this sample code:

# Streaming scenario: New ecommerce_events every second
stream.writeStream \
  .format("iceberg") \
  .trigger(processingTime="1 second") \
  .start()

# Results in:
# ecommerce_events/
# ├── file_001.parquet (1MB - 1 second of data)
# ├── file_002.parquet (1MB - next second)
# ├── file_003.parquet (1MB - next second)
# └── ... (86,400 tiny files per day!)

Since Iceberg files are immutable, each micro-batch creates new files instead of appending to existing ones.

Optimal file size is 128MB-1GB, but streaming creates 1-10MB files - 100x more files than optimal.

Challenge 2: Metadata explosion

Metadata files are written on each new file written.

// After 1 hour of streaming (3,600 commits):
{
  "snapshots": [
    {"snapshot-id": 1, "manifest-list": "snap-001.avro"},
    {"snapshot-id": 2, "manifest-list": "snap-002.avro"},
    // ... 3,598 more snapshots
    {"snapshot-id": 3600, "manifest-list": "snap-3600.avro"}
  ]
}

Table metadata grows linearly with commits. Query planning becomes slower than actual queries as it scans thousands of manifests.

After one hour a simple query like this must open thousands of small files:

SELECT COUNT(*) FROM ecommerce_events WHERE date = '2024-01-15';

Challenge 3: The compaction performance window

Compaction overcomes the small files and metadata explosion problems but it introduces other problems in streaming scenarios, such as windows of degraded performance.

-- Timeline of degraded performance:
-- 10:00 - Stream starts writing small files
-- 10:30 - 1,800 small files created (30 min × 1/sec)
-- 10:30 - Background compaction starts
-- 10:35 - Compaction completes, creates 2 large files

-- Query at 10:29: Reads 1,800 small files (SLOW - 45 seconds)
-- Query at 10:36: Reads 2 large files (FAST - 3 seconds)
-- Problem: 30-minute window of degraded performance!

Challenge 4: Concurrent writer conflicts

Scaling writes implies several concurrent workers writing to the same Iceberg table.

Due to the optimistic nature of concurrency in Iceberg, multiple streaming jobs writing to the same table may conflict, which can lead to exponential backoff retries and delays.

# Job 1: Processing user ecommerce_events
# Job 2: Processing system ecommerce_events
# Job 3: Processing audit ecommerce_events

# All trying to commit simultaneously:
# Commit 1: snapshot-1234 → SUCCESS
# Commit 2: snapshot-1234 → CONFLICT! (optimistic concurrency)
# Commit 3: snapshot-1234 → CONFLICT!

# Jobs 2 & 3 must retry, causing delays and wasted work

Challenge 5: Real-time aggregations and multiple indices

Real applications have multiple query patterns that need different sort orders:

-- Query Pattern 1: User analytics (90% of queries)
SELECT * FROM ecommerce_events WHERE user_id = 12345 AND event_time > '2024-01-01';
-- Needs: ORDER BY user_id, event_time

-- Query Pattern 2: Product analytics (5% of queries)
SELECT * FROM ecommerce_events WHERE product_id = 789 AND event_time > '2024-01-01';
-- Needs: ORDER BY product_id, event_time

-- Query Pattern 3: Geographic analysis (5% of queries)
SELECT * FROM ecommerce_events WHERE region = 'US' AND event_type = 'purchase';
-- Needs: ORDER BY region, event_type

and pre-computed aggregations:

-- Users want real-time dashboards showing:
SELECT
  region,
  event_type,
  COUNT(*) as ecommerce_events,
  SUM(revenue) as total_revenue,
  COUNT(DISTINCT user_id) as unique_users
FROM ecommerce_events
WHERE event_time >= current_timestamp() - INTERVAL 1 HOUR
GROUP BY region, event_type;

Iceberg approach: Run the aggregation query each time

-- Manual pre-aggregation with Iceberg
CREATE TABLE hourly_region_stats AS
SELECT
  date_trunc('hour', event_time) as hour,
  region,
  event_type,
  COUNT(*) as ecommerce_events,
  SUM(revenue) as total_revenue
FROM ecommerce_events
GROUP BY date_trunc('hour', event_time), region, event_type;

-- But now you need to:
-- 1. Keep it in sync with source table
-- 2. Handle late-arriving data
-- 3. Manage incremental updates
-- 4. Deal with duplicate processing

Subscribe to our newsletter

Get 10 links weekly to the Data and AI articles the Tinybird team is reading.

Loading…

Specialized Real-Time Analytics Platforms

Apache Iceberg provides foundational features for high-performance and scalable analytics workloads, but real-time, user-facing analytics require specialized tools.

Use Iceberg when

Batch analytics with infrequent writes
Data lake scenarios with complex ETL
Schema evolution and time travel are critical
Query latency of 5-30 seconds is acceptable

Use specialized platforms when

Real-time apps need high-concurrency and consistent sub-second queries
High-frequency streaming writes (>1000 events/sec)
Multiple query patterns need different indexes
Pre-aggregations must update incrementally
Developer friendly workflows

┌─────────────┐    ┌────────────────────────┐    ┌─────────────────────┐
│             │    │                        │    │                     │
│ Events      │    │ Tinybird               │    │ Iceberg             │
│ stream      │--->│ (Real-time analytics)  │<---│ (Long-term storage) │
│             │  │ │                        │  │ │                     │
└─────────────┘  │ └────────────────────────┘  │ └─────────────────────┘
           Kafka ┘             │               └ iceberg() function
                               │ real-time API
                               │
                   ┌────────────────────────┐
                   │                        │
                   │ Real-time              │
                   │ application            │
                   │                        │
                   └────────────────────────┘

Tinybird is a real-time analytics platform that can ingest event streams and iceberge tables. To learn more about this pattern, you can read this Iceberg + Redpanda + Tinybird blog post, which explains how to leverage native Kafka streaming ingestion and incremental materialized views for sub-second public APIs while using Apache Iceberg as durable storage for analytics workloads.

Skip the infra work. Deploy your first ClickHouse project now.

Product /

Company /

Resources /

Integrations /

Use Cases /

Optimizing Apache Iceberg tables for real-time analytics