👻 Phantom's crypto wallet is powered by Tinybird.
Read their story.
Back
May 20, 2025

Real-Time Analytics on Apache Iceberg with Tinybird

Learn how to build real-time analytics APIs that scale over your Iceberg tables
Alberto Romeu
Alberto RomeuBackend Developer
Victor Ramirez
Victor RamirezSoftware Engineer

We recently added experimental support for Apache Iceberg in Tinybird (in private beta). To test its capabilities in a real-world scenario, we chose the GitHub Archive dataset, a publicly available stream of GitHub event data.

We chose this well-known dataset for developers to answer this question:

Can you analyze real-time GitHub activity at scale, using Apache Iceberg as your source of truth, and Tinybird as the engine for building APIs and insights?

Short answer: yes. But we had to take a few detours to make it fast.

What is Apache Iceberg?

Apache Iceberg is a high-performance open table format designed for large-scale analytics datasets. It was originally developed at Netflix to overcome limitations of Hive and other legacy table formats. Iceberg provides:

  • ACID compliance for big data
  • Schema evolution and partition evolution
  • Hidden partitioning and snapshot isolation
  • Compatibility with multiple engines (e.g. Spark, Trino, Flink, ClickHouse)

One of Iceberg's most compelling promises is write once, use many times. In traditional architectures, each system (analytics, ML, search, etc.) often requires its own copy of the data. With petabytes of data, that duplication can become costly and operationally painful. Iceberg changes that: you can write your data once (e.g. via Kafka) and serve it through many engines without moving it again (with caveats).

That is the promise, but as you'll see, for real-time or low-latency scenarios, Iceberg doesn't always deliver. Iceberg is still incredibly valuable as a central data lake format, especially when paired with systems like Tinybird that can ingest and expose the data efficiently.

Start building with Tinybird!
If you've read this far, you might want to use Tinybird as your analytics backend. You can just get started, on the free plan.

Initial approach: Direct queries on Iceberg

We started with a naive but attractive solution: query the Iceberg table directly from a pipe using the iceberg() function. This gives you instant access to the data without ingestion or duplication.

Copy
DESCRIPTION >
  most liked repos in the last day

NODE endpoint
SQL >
  %
  SELECT count() c, repo_name
  FROM iceberg('s3://tinybird-gharchive/iceberg/db/github_events', {{tb_secret('aws_access_key_id')}}, {{tb_secret('aws_secret_access_key')}})
  WHERE created_at > now() - interval 1 day
  AND type = 'WatchEvent'
  ORDER BY c DESC

TYPE ENDPOINT

This worked, but barely. Query latency was high and unpredictable "by default". It wasn't viable for real-time analytics or frontend APIs for a number of reasons:

  1. Iceberg support in ClickHouse is still not suited for real-time APIs.
  2. While the Apache Iceberg table format is good for data warehouse-like loads, building real-time applications is a different beast.
  3. It requires some knowledge on how to properly sort and partition tables, compact parts or use aggregation indices in Iceberg, which may be not trivial.

This approach could be good enough for small-scale projects or for cases where you don't need actual sub-second latency in your APIs, but for most real-time analytics needs, it won't work.

Final approach: Copy + Materialize + API

We shifted to a hybrid architecture, where:

  • Iceberg remains the source of truth.
  • Tinybird handles data synchronization from Iceberg, transformation, and serving via fast HTTP endpoints.

Tinybird has three types of pipes to enable this process.

Copy pipes, when used with the iceberg table function, allow you to synchronize data from an Iceberg table to Tinybird. You can do one-off copy runs for backfilling and initial load purposes or run them on schedule to only synchronize deltas.

Materialized views are a very convenient way to transform data for real-time analytics. They work as insert triggers, so whenever a copy pipe syncs data from Iceberg to a Tinybird data source, the new data is incrementally aggregated together with the historic data.

Finally, endpoint pipes, expose your materialized aggregates as APIs, allowing for scalable, low-latency, and secured access from your application code.

Real-Time Analytics on Apache Iceberg

This is a breakdown of how this architecture works together.

Step 1: Copy from Iceberg into Tinybird

We built two copy pipes that synchronize data from the Iceberg table:

  • cp_github_events_historic: pulls historical ranges of events.
  • cp_github_events_rt: runs every 15 minutes to pull new events.

These populate two data sources:

  • github_events: long-term, durable events.
  • github_events_rt: short-term, real-time events with TTL.

The historic copy pipe:

Copy
DESCRIPTION >
    Copy pipe that exports GitHub events filtered by date range to a target datasource

NODE github_events_filtered
SQL >
    %
    SELECT
        type as event_type,
        actor.login as actor_login,
        repo.name as repo_name,
        assumeNotNull(toDateTime64(created_at, 3)) as created_at,
        action as action
    FROM iceberg(
        's3://tinybird-gharchive/iceberg/db/github_events',
        {{tb_secret('aws_access_key_id')}},
        {{tb_secret('aws_secret_access_key')}}
    )
    WHERE
    created_at > {{DateTime(from_date)}}
    AND created_at <= {{DateTime(to_date)}}

TYPE COPY
TARGET_DATASOURCE github_events

You use this pipe for backfilling purposes:

Copy
tb --cloud copy run cp_github_events_historic --param from_date='2020-01-01 00:00:00' --param to_date='2025-05-14 00:00:00' --wait

The incremental copy pipe:

Copy

This pipe runs on schedule, syncing new data from Iceberg to Tinybird.

Step 2: Materialize for fast queries

We use materialized views to pre-aggregate GitHub events by day, repo, and type. Materialized views work as insert triggers that incrementally perform aggregations without having to recompute all historical data each time. This step is key for low-latency endpoints.

Materialization on github_events:

Copy
DESCRIPTION >
    Materialize GitHub Watch events (stars) count by repository and date

NODE watch_events_count
SQL >
    SELECT
        toDate(created_at) as date,
        countState() as count,
        uniqExactState(actor_login) as actor_count,
        repo_name,
        event_type
    FROM github_events
    GROUP BY date, repo_name, event_type

TYPE MATERIALIZED
DATASOURCE github_events_by_day

The same logic is applied to github_events_rt.

Step 3: Build endpoints

Once the data is flowing and aggregated, building fast APIs is trivial.

For example, an endpoint to fetch top-starred repos in a given period using the previously materialized data:

Copy
TOKEN read_token READ

NODE endpoint
SQL >
  %
  SELECT
    repo_name,
    countMerge(count) AS stars
  FROM github_events_by_day
  WHERE
    event_type = 'WatchEvent'
    AND date BETWEEN {{start_date}} AND {{end_date}}
  GROUP BY repo_name
  ORDER BY stars DESC
  LIMIT {{limit}}

TYPE ENDPOINT

This returns a response in milliseconds and can power multi-tenant dashboards or user-facing applications with high-concurrency, low-latency requirements.

The full data flow

Real-Time Analytics on Apache Iceberg

Here's the complete lineage of the project:

  • Copy from Iceberg into Tinybird (historic + realtime)
  • Materialized views for github_events_by_day
  • API endpoints for top repositories and other queries

This structure decouples ingestion and analytics while preserving freshness and speed.

Why this works

  • You keep Apache Iceberg as your durable, cloud-native source of truth.
  • Tinybird gives you interactive speed and automatic API generation.
  • Copy pipes let you incrementally load what you need, without overloading query engines.

You get a system that is scalable, real-time, and cost-efficient. And more importantly, it integrates well with developers' workflows.

Next steps

There are several things we plan to explore further:

  • Handle schema evolution dynamically
  • Merge historic and realtime events while deduplicating overlaps
  • Explore an alternative event-sourcing architecture where Kafka persists data to Iceberg while Tinybird consumes directly from Kafka topics for real-time applications.
Subscribe to our newsletter
Get 10 links weekly to the Data and AI articles the Tinybird team is reading.

Try it yourself

Apache Iceberg support is now on private beta. Join the Tinybird community or get in touch to get access to the private beta or tell us your use case.

We've open-sourced the full Tinybird project for this setup. You can clone it, configure your Iceberg bucket, and run it locally or in the cloud.

Do you like this post? Spread it!

Skip the infra work. Ship your first API today.

Read the docs
Tinybird wordmark