Apache Iceberg has all the features needed for high-performance analytics, but success depends on how you use them.
While Iceberg excels at analytical workloads, adapting it for real-time analytics requires understanding the performance characteristics and trade-offs.
The most common mistakes engineers make are:
- Ignoring the fundamentals: Proper partitioning and sorting can make queries 1000x faster
- Optimizing before understanding: Always analyze query patterns first, then optimize
- Assuming more features = better performance: Each optimization has trade-offs
- Not understanding the streaming challenges: Small files, metadata explosion, and compaction overhead
Let's explore Iceberg's partitioning, sorting, and compaction features to build high-performance real-time analytics systems.
1. Partitioning
Partitioning is a way to organize your data into logical groups. Think of partitioning as "Which folder should I look in?"
- β Without partitioning, every query scans all files.
- β Partition by your most common query filters.
Some practical examples:
Basic time-based partitioning
-- Partition by day for time-based queries
CREATE TABLE ecommerce_events (
event_id BIGINT,
user_id BIGINT,
event_time TIMESTAMP,
event_type STRING,
product_id BIGINT,
region STRING,
session_id STRING,
revenue DECIMAL(10,2)
) USING ICEBERG
PARTITIONED BY (days(event_time));
Directory structure
ecommerce_events/
βββ event_time_day=2024-01-15/
β βββ data files
βββ event_time_day=2024-01-16/
β βββ data files
βββ event_time_day=2024-01-17/
βββ data files
Sample query
SELECT COUNT(*) FROM ecommerce_events
WHERE event_time = '2024-01-15';
-- Files read: Only event_time_day=2024-01-15 β
Query plan
-- Query plan shows partition filters:
== Physical Plan ==
... PartitionFilters: [isnotnull(event_time_day), (event_time_day = 2024-01-15)]
Multi-dimensional partitioning
-- Partition by time AND region for better pruning
CREATE TABLE ecommerce_events (
event_id BIGINT,
user_id BIGINT,
event_time TIMESTAMP,
event_type STRING,
product_id BIGINT,
region STRING,
session_id STRING,
revenue DECIMAL(10,2)
) USING ICEBERG
PARTITIONED BY (
days(event_time),
region,
bucket(32, user_id) -- Hash bucketing for load distribution
);
Directory structure
ecommerce_events/
βββ event_time_day=2024-01-15/
β βββ region=US/
β β βββ user_id_bucket=0/
β β β βββ part-00000.parquet
β β β βββ part-00001.parquet
β β βββ user_id_bucket=1/
β β β βββ part-00002.parquet
β β βββ ... (buckets 2-31)
β βββ region=EU/
β β βββ user_id_bucket=0/
β β βββ ... (buckets 1-31)
βββ event_time_day=2024-01-16/
β βββ region=US/
β β βββ user_id_bucket=0/
β β βββ ... (buckets 1-31)
β βββ region=EU/
β βββ region=APAC/
βββ event_time_day=2024-01-17/
βββ region=US/
βββ region=EU/
βββ region=APAC/
Sample query
SELECT COUNT(*) FROM ecommerce_events
WHERE event_time = '2024-01-15' AND region = 'US';
-- Files read: Only event_time_day=2024-01-15/region=US/ β
Query plan
...
PartitionFilters: [
isnotnull(event_time_day#999),
(event_time_day#999 = 2024-01-15), β Time partition elimination
isnotnull(region#789),
(region#789 = US) β Region partition elimination
]
...
PartitionsRead: 32 (out of 15,360 total partitions)
FilesRead: 64 (out of 245,760 total files)
Always define partitions based on actual query patterns. Some examples:
- For time-series queries:
PARTITIONED BY (
days(event_time), -- Primary: time filtering (days or months depending on volume)
bucket(8, user_id) -- Secondary: load balancing only
)
- For multi-tenant queries:
PARTITIONED BY (
tenant_id, -- Primary: perfect isolation
days(event_time), -- Secondary: time pruning
bucket(4, user_id) -- Tertiary: small buckets
)
- For region-heavy queries:
PARTITIONED BY (
region, -- Primary: geographic filtering
days(event_time), -- Secondary: time pruning
bucket(16, user_id) -- Tertiary: parallelism
)
Rules of thumb when partitioning Iceberg tables
- Maximum 2-3 partition columns
- 10-100 files per partition
- 1GB-100GB total size per partition
- Monitor and evolve partitions when required
Anti-patterns
Too many small partitions:
-- BAD: Creates tiny partitions
PARTITIONED BY (
hours(event_time), -- 24 partitions per day
region, -- Γ 10 regions
event_type, -- Γ 20 event types
user_segment -- Γ 5 segments
)
-- Result: 24,000 tiny partitions per day
High cardinality partitions:
-- BAD: Partition explosion
PARTITIONED BY (
user_id, -- Millions of partitions
session_id -- Even more partitions
)
-- Result: Metadata larger than data
Ignoring query patterns:
-- BAD: Partitioned by write pattern, not read pattern
PARTITIONED BY (
ingestion_batch_id -- How data arrives
)
-- But queries filter by:
WHERE event_time > '...' AND region = '...' -- Different columns!
When to repartition
Inspect your table metadata and query patterns frequently to evolve partitioning.
-- Partition health check
WITH partition_stats AS (
SELECT
partition,
COUNT(*) as file_count,
SUM(file_size_in_bytes) as partition_bytes,
AVG(file_size_in_bytes) as avg_file_bytes
FROM table_name.files
GROUP BY partition
)
SELECT
COUNT(*) as total_partitions,
AVG(file_count) as avg_files_per_partition,
MAX(file_count) as max_files_per_partition,
AVG(partition_bytes) / (1024*1024*1024) as avg_partition_gb,
COUNT(CASE WHEN file_count > 1000 THEN 1 END) as problematic_partitions
FROM partition_stats;
- Too many files per partition: > 1,000 files β Add bucketing
- Too few files per partition: < 5 files β Reduce time granularity
- Skewed partitions: 1 partition > 10Γ average β Add sub-partitioning
- Query performance degraded: β Align partitioning with query patterns
2. Sorting
Sorting is a way to organize your data into a specific order. Think of sorting as a way to "skip files within partitions and blocks within files"
- β Even within partitions, data might be randomly distributed across files.
- β Sort data within files to create useful min/max statistics.
Regular sorting
-- Sort by primary access pattern
-- This creates a strict ordering: first by user_id, then by event_time
CALL catalog.system.rewrite_data_files('db.ecommerce_events',
strategy => 'sort',
sort_order => 'user_id, event_time'
);
File layout after regular sorting:
user_id | event_time |
---|---|
1 | 10:00 |
1 | 10:10 |
1 | 10:30 |
2 | 10:15 |
2 | 10:25 |
2 | 10:35 |
3 | 10:05 |
3 | 10:20 |
All user_id
records are together, then event_time
- Primary sort key: [β β β β β β β β β β ] β Excellent clustering
- Secondary sort key: [β β β β β ] β Scattered across file
When to use regular sorting:
- Single primary access pattern (e.g., 90% of queries filter by event_time)
SELECT * FROM ecommerce_events WHERE event_time BETWEEN '2024-01-01' AND '2024-01-02';
- Hierarchical access (tenant β date β user)
SELECT * FROM multi_tenant_events
WHERE tenant_id = 'company_a' AND date >= '2024-01-01' AND user_id = 12345;
- Time-series with mostly recent data access
sort_order => 'timestamp DESC' -- Most recent data first
Z-Ordering for multi-dimensional queries
Z-order for queries filtering on multiple dimensions. Data gets interleaved to preserve locality in both dimensions
CALL catalog.system.rewrite_data_files('db.ecommerce_events',
strategy => 'sort',
sort_order => 'zorder(user_id, event_time)'
);
File layout after z-ordering:
user_id | event_time |
---|---|
1 | 10:00 |
2 | 10:15 |
1 | 10:10 |
3 | 10:05 |
1 | 10:30 |
2 | 10:25 |
3 | 10:20 |
2 | 10:35 |
All sort keys: [β β β β β β β β β ] β Balanced clustering
When to use z-ordering:
- Queries filter on 2-4 columns together
SELECT * FROM ecommerce_events WHERE user_id = 12345 AND event_time > '2024-01-01';
SELECT * FROM ecommerce_events WHERE user_id BETWEEN 1000 AND 2000 AND event_time BETWEEN '2024-01-01' AND '2024-01-31';
- Range queries on multiple columns (such as geospatial data)
SELECT * FROM locations WHERE latitude BETWEEN 40.0 AND 41.0 AND longitude BETWEEN -74.0 AND -73.0;
- High-cardinality combinations
SELECT * FROM purchases
WHERE user_id IN (1,2,3) AND product_id IN (100,200) AND timestamp > '2024-01-01';
Z-ordering trades some single-dimension performance for better multi-dimensional performance, while regular sorting optimizes for the primary sort column at the expense of secondary columns.
File-Level statistics impact
-- Before sorting: user_id randomly distributed
-- File 1: user_id range [1-1000, 5000-6000] (fragmented)
-- File 2: user_id range [2000-3000, 500-800] (fragmented)
-- After sorting: clean ranges
-- File 1: user_id range [1-1000] β min=1, max=1000
-- File 2: user_id range [1001-2000] β min=1001, max=2000
SELECT * FROM ecommerce_events
WHERE user_id = 150 AND event_time >= '2024-01-15';
-- Only reads File 1 (150 is between 1-1000) β
3. Compaction
Iceberg partitions are composed of immutable files, writes may create many small files, hurting query performance.
Compaction is a process, typically run in the background, to merge several small files into a bigger one.
The process may look simple; just run the compaction process:
-- Compact all small files
CALL catalog.system.rewrite_data_files('db.ecommerce_events',
strategy => 'binpack',
options => map('target-file-size-bytes', '268435456') -- 256MB
);
But in reality, this is costly. Instead, you want to run compaction only over the necessary files (unless you want to rewrite a full table due to some schema evolution):
-- Compact only files from the last hour
CALL catalog.system.rewrite_data_files(
table => 'db.ecommerce_events',
strategy => 'binpack',
where => 'event_time >= current_timestamp() - INTERVAL 1 HOUR',
options => map(
'target-file-size-bytes', '268435456',
'min-input-files', '5', -- Only compact if at least 5 files
'max-concurrent-file-group-rewrites', '5' -- Parallelism level
)
);
In real-world workloads you need to be smarter and customize your compaction logic so it runs only when needed:
-- Check if compaction is needed
WITH small_files_check AS (
SELECT
COUNT(*) as small_file_count,
AVG(file_size_in_bytes) as avg_size_bytes
FROM ecommerce_events.files
WHERE file_size_in_bytes < 67108864 -- < 64MB
AND partition LIKE '%2024-01-15%' -- Today's data
)
SELECT
CASE
WHEN small_file_count >= 10 THEN 'COMPACT_RECOMMENDED'
ELSE 'NO_ACTION_NEEDED'
END as recommendation
FROM small_files_check;
Metadata optimization
An explosion of metadata files can hurt performance. Two good practices here:
- Expire snapshots, and
- Compact manifest files
-- Enable snapshot expiration for faster metadata reads
ALTER TABLE ecommerce_events SET TBLPROPERTIES (
'history.expire.min-snapshots-to-keep' = '5',
'history.expire.max-snapshot-age-ms' = '86400000' -- 1 day
);
-- Compact manifests regularly
CALL catalog.system.rewrite_manifests('db.ecommerce_events');
The 6-Level Data Elimination Stack
When you combine partitioning, sorting, and compaction, Iceberg creates a powerful 6-level data elimination system.
Let's trace through how a single query gets optimized:
SELECT user_id, event_type
FROM ecommerce_events
WHERE event_time = '2024-01-15' AND user_id = 150;
Level 1: Partition pruning
- Iceberg skips entire directories based on partition columns
- Only reads partitions where
event_time = '2024-01-15'
-- Table structure:
ecommerce_events/
βββ event_time_day=2024-01-14/ β SKIP (date < 2024-01-15)
βββ event_time_day=2024-01-15/ β READ (matches filter)
βββ event_time_day=2024-01-16/ β SKIP (not needed)
βββ event_time_day=2024-01-17/ β SKIP (not needed)
-- Result: Skip entire directories based on partition columns
-- Data eliminated: 75% (3 out of 4 days)
Level 2: File elimination
- Within
event_time_day=2024-01-15/
, skip files whereuser_id
range doesn't include 150 - Uses file-level statistics (min/max values) to skip files
-- Within event_time_day=2024-01-15/:
βββ file_001.parquet β user_id range [1-1000] β READ (150 is in range)
βββ file_002.parquet β user_id range [1001-2000] β SKIP (150 < 1001)
βββ file_003.parquet β user_id range [2001-3000] β SKIP (150 < 2001)
βββ file_004.parquet β user_id range [3001-4000] β SKIP (150 < 3001)
-- Result: Skip files using min/max statistics
-- Data eliminated: 75% more (3 out of 4 files)
Level 3: Row group elimination (inside parquet)
- Within the chosen file, skip row groups where
user_id
min/max doesn't include 150 - Uses Parquet column statistics
-- Within file_001.parquet:
βββ Row Group 1 β user_id range [1-500] β READ (150 is in range)
βββ Row Group 2 β user_id range [501-1000] β SKIP (150 < 501)
βββ Row Group 3 β user_id range [1001-1500] β SKIP (150 < 1001)
-- Result: Skip row groups using Parquet statistics
-- Data eliminated: 66% more (2 out of 3 row groups)
Level 4: Column elimination (projection pushdown)
- Only read
user_id
andevent_type
column chunks, skip others
-- Row Group 1 contains columns:
βββ user_id column chunk β READ (needed for SELECT)
βββ event_type column chunk β READ (needed for SELECT)
βββ event_time column chunk β SKIP (only used in WHERE, already filtered)
βββ product_id column chunk β SKIP (not needed)
βββ revenue column chunk β SKIP (not needed)
βββ session_id column chunk β SKIP (not needed)
-- Result: Only deserialize needed columns
-- Data eliminated: 66% more (4 out of 6 columns)
Level 5: Page elimination (within row groups)
- Within row groups, skip pages where
user_id
range doesn't include 150
-- Within user_id column chunk:
βββ Page 1 β user_id range [1-100] β SKIP (150 > 100)
βββ Page 2 β user_id range [101-200] β READ (150 is in range)
βββ Page 3 β user_id range [201-300] β SKIP (150 < 201)
βββ Page 4 β user_id range [301-400] β SKIP (150 < 301)
-- Result: Skip pages within column chunks
-- Data eliminated: 75% more (3 out of 4 pages)
Level 6: Row filtering (post-deserialization)
Important: Unlike the previous 5 levels, this is filtering, not elimination.
- The entire page must be read and deserialized from storage
- Row-level filtering happens in memory after reading the page
-- Within Page 2 (user_id range [101-200]):
βββ Read entire page from storage β ALL rows [101-200] deserialized
βββ Apply filter in memory: user_id = 150
βββ Return only matching row(s)
-- Result: I/O saved = 0% (entire page must be read)
-- Processing saved = ~99% (only matching rows processed further)
Performance impact
Here's the cumulative effect of the 6-level optimization:
Level | Elimination Target | Data Remaining | Cumulative I/O Savings |
---|---|---|---|
Original Dataset | - | 1TB (100%) | - |
1. Partition | Skip directories | 250GB (25%) | 75% saved |
2. File | Skip files | 62GB (6.25%) | 94% saved |
3. Row Group | Skip row groups | 21GB (2.1%) | 98% saved |
4. Column | Skip columns | 7GB (0.7%) | 99.3% saved |
5. Page | Skip pages | 1.7GB (0.17%) | 99.8% saved |
6. Row | Filter rows | 1.7GB (0.17%) | 99.8% I/O saved |
Levels 1-5: True data elimination
- These levels skip reading data entirely from storage
- Provide both I/O savings and processing savings
- This is where the massive performance gains come from
Level 6: In-memory filtering
- Still reads the data from storage (no additional I/O savings)
- Filters out unwanted rows in memory after deserialization
- Provides CPU/memory savings but not I/O savings
- This is traditional SQL row-level filtering
Final Result: Read 1.7GB instead of 1TB - that's 99.8% less I/O!
When Iceberg isn't enough
While the optimizations above work well for analytical workloads, streaming and real-time analytics expose fundamental limitations in Iceberg's architecture.
Streaming writes to Iceberg create several fundamental challenges due to its architecture based on immutable snapshots.
Challenge 1: Small file explosion
See this sample code:
# Streaming scenario: New ecommerce_events every second
stream.writeStream \
.format("iceberg") \
.trigger(processingTime="1 second") \
.start()
# Results in:
# ecommerce_events/
# βββ file_001.parquet (1MB - 1 second of data)
# βββ file_002.parquet (1MB - next second)
# βββ file_003.parquet (1MB - next second)
# βββ ... (86,400 tiny files per day!)
Since Iceberg files are immutable, each micro-batch creates new files instead of appending to existing ones.
Optimal file size is 128MB-1GB, but streaming creates 1-10MB files - 100x more files than optimal.
Challenge 2: Metadata explosion
Metadata files are written on each new file written.
// After 1 hour of streaming (3,600 commits):
{
"snapshots": [
{"snapshot-id": 1, "manifest-list": "snap-001.avro"},
{"snapshot-id": 2, "manifest-list": "snap-002.avro"},
// ... 3,598 more snapshots
{"snapshot-id": 3600, "manifest-list": "snap-3600.avro"}
]
}
Table metadata grows linearly with commits. Query planning becomes slower than actual queries as it scans thousands of manifests.
After one hour a simple query like this must open thousands of small files:
SELECT COUNT(*) FROM ecommerce_events WHERE date = '2024-01-15';
Challenge 3: The compaction performance window
Compaction overcomes the small files and metadata explosion problems but it introduces other problems in streaming scenarios, such as windows of degraded performance.
-- Timeline of degraded performance:
-- 10:00 - Stream starts writing small files
-- 10:30 - 1,800 small files created (30 min Γ 1/sec)
-- 10:30 - Background compaction starts
-- 10:35 - Compaction completes, creates 2 large files
-- Query at 10:29: Reads 1,800 small files (SLOW - 45 seconds)
-- Query at 10:36: Reads 2 large files (FAST - 3 seconds)
-- Problem: 30-minute window of degraded performance!
Challenge 4: Concurrent writer conflicts
Scaling writes implies several concurrent workers writing to the same Iceberg table.
Due to the optimistic nature of concurrency in Iceberg, multiple streaming jobs writing to the same table may conflict, which can lead to exponential backoff retries and delays.
# Job 1: Processing user ecommerce_events
# Job 2: Processing system ecommerce_events
# Job 3: Processing audit ecommerce_events
# All trying to commit simultaneously:
# Commit 1: snapshot-1234 β SUCCESS
# Commit 2: snapshot-1234 β CONFLICT! (optimistic concurrency)
# Commit 3: snapshot-1234 β CONFLICT!
# Jobs 2 & 3 must retry, causing delays and wasted work
Challenge 5: Real-time aggregations and multiple indices
Real applications have multiple query patterns that need different sort orders:
-- Query Pattern 1: User analytics (90% of queries)
SELECT * FROM ecommerce_events WHERE user_id = 12345 AND event_time > '2024-01-01';
-- Needs: ORDER BY user_id, event_time
-- Query Pattern 2: Product analytics (5% of queries)
SELECT * FROM ecommerce_events WHERE product_id = 789 AND event_time > '2024-01-01';
-- Needs: ORDER BY product_id, event_time
-- Query Pattern 3: Geographic analysis (5% of queries)
SELECT * FROM ecommerce_events WHERE region = 'US' AND event_type = 'purchase';
-- Needs: ORDER BY region, event_type
and pre-computed aggregations:
-- Users want real-time dashboards showing:
SELECT
region,
event_type,
COUNT(*) as ecommerce_events,
SUM(revenue) as total_revenue,
COUNT(DISTINCT user_id) as unique_users
FROM ecommerce_events
WHERE event_time >= current_timestamp() - INTERVAL 1 HOUR
GROUP BY region, event_type;
Iceberg approach: Run the aggregation query each time
-- Manual pre-aggregation with Iceberg
CREATE TABLE hourly_region_stats AS
SELECT
date_trunc('hour', event_time) as hour,
region,
event_type,
COUNT(*) as ecommerce_events,
SUM(revenue) as total_revenue
FROM ecommerce_events
GROUP BY date_trunc('hour', event_time), region, event_type;
-- But now you need to:
-- 1. Keep it in sync with source table
-- 2. Handle late-arriving data
-- 3. Manage incremental updates
-- 4. Deal with duplicate processing
Specialized Real-Time Analytics Platforms
Apache Iceberg provides foundational features for high-performance and scalable analytics workloads, but real-time, user-facing analytics require specialized tools.
Use Iceberg when
- Batch analytics with infrequent writes
- Data lake scenarios with complex ETL
- Schema evolution and time travel are critical
- Query latency of 5-30 seconds is acceptable
Use specialized platforms when
- Real-time apps need high-concurrency and consistent sub-second queries
- High-frequency streaming writes (>1000 events/sec)
- Multiple query patterns need different indexes
- Pre-aggregations must update incrementally
- Developer friendly workflows
βββββββββββββββ ββββββββββββββββββββββββββ βββββββββββββββββββββββ
β β β β β β
β Events β β Tinybird β β Iceberg β
β stream β--->β (Real-time analytics) β<---β (Long-term storage) β
β β β β β β β β
βββββββββββββββ β ββββββββββββββββββββββββββ β βββββββββββββββββββββββ
Kafka β β β iceberg() function
β real-time API
β
ββββββββββββββββββββββββββ
β β
β Real-time β
β application β
β β
ββββββββββββββββββββββββββ
Tinybird is a real-time analytics platform that can ingest event streams and iceberge tables. To learn more about this pattern, you can read this Iceberg + Redpanda + Tinybird blog post, which explains how to leverage native Kafka streaming ingestion and incremental materialized views for sub-second public APIs while using Apache Iceberg as durable storage for analytics workloads.