Our friends at Estuary have just released Dekaf, a Kafka protocol-compatible interface to Estuary Flow. This is great news for Tinybird users; you can now use Tinybird's existing Kafka Connector to consume CDC streams from the many, many data sources that Estuary integrates!
Estuary's data sources cover a huge range of the data ecosystem, from the world's most popular database, PostgreSQL, to NoSQL stores like MongoDB and warehouses like Snowflake and BigQuery. There are too many to list here, so check out Estuary's connector page to find the connector you need!
Together, Estuary and Tinybird offer an incredible combination of price, performance, and productivity. Both platforms solve the developer experience problems common across the data ecosystem without compromising the computational efficiency of the underlying tech. In Tinybird's case, that's because we use the best-in-breed, open source database ClickHouse®. For Estuary, it's their open source Flow.
Like many, we had a warehouse-centric data platform that just couldn't support the latency and concurrency requirements for user-facing analytics at any reasonable cost. When we moved those use cases to Tinybird, we shipped them 5x faster and at a 10x lower cost than our prior approach.
Guy Needham, Staff Engineer at Canva
You can sign up for Tinybird and sign up for Estuary for free. No credit card is required. Need help? Join the Tinybird Slack Community and the Estuary Slack Community to get support with both products and the integration.
Continue reading to learn more about using Tinybird and Estuary together.
How it works
Estuary has released Dekaf, a new way to integrate with Estuary Flow that presents a Kafka protocol-compatible API, allowing any existing Kafka consumer to consume messages from Estuary Flow as if it were a Kafka broker. Thanks to Tinybird's existing Kafka Connector, that means Tinybird can now connect to Estuary Flow directly.
Tinybird provides various out-of-the-box Connectors to ingest data, with the most popular being Kafka, S3, and HTTP. While Tinybird also supports first-party integrations with tools like DynamoDB, Postgres, and Snowflake, it doesn't currently support the same breadth of data source integrations offered by Estuary. Which is why we're so excited not only about the integration with Estuary Flow, but with how it's built, too.
Dekaf is yet another example of the Kafka protocol becoming bigger than Kafka itself (something similar can also be said about the S3 API and the glut of S3-compatible storage systems available today). This has some pretty nice knock knock-on benefits. For you, it makes your data pipelines more consistent and easy to integrate with your existing stack and reduces vendor lock-in, as you can take advantage of the Kafka API's wide support in data tooling.
For Tinybird and Estuary, it means neither of our engineering teams needs to spend time building vendor-specific API integrations. Instead, we can focus on building the best real-time analytics platform, and Estuary can focus on solving real-time data capture.
We're huge fans of how Estuary is solving that problem. Real-time ETLs allow you to maintain data freshness in your pipeline and eliminate stale data with diminished value. When data arrives faster, it gets processed faster. When it gets processed faster, it provides answers faster. Fast data begets fast and responsive features and services that increase customer trust and adoption, turning your data into a revenue generator (instead of a cost center).
Get started with CDC using Tinybird and Estuary Flow's Dekaf
This example will walk you through setting up a PostgreSQL Change Data Capture pipeline that streams data into Tinybird in real time using Estuary Flow and Dekaf. You can also read more on the Tinybird docs, or see Estuary's docs.
Step 1: Create PostgreSQL Capture in Estuary Flow
With your PostgreSQL running and configured for CDC, create a capture in Estuary Flow.
Configure the Endpoint by entering your Postgres server address, user, password, and database.
Select the tables that you want to sync by defining collections.
Press Save & Publish to initialize the connector. This will kick off a backfill first, then automatically switch into an incremental continuous change data capture mode.
Step 2: Create Dekaf Materialization in Estuary Flow
In Estuary, create a new Dekaf materialization to use for the Tinybird connection. You can create it from the Estuary destinations tab. For detailed instructions, see the Tinybird Dekaf Estuary docs page.
When creating the Dekaf materialization, Estuary will provide you with an auth token. Save this token along with your materialization task name (in the format YOUR-ORG/YOUR-PREFIX/YOUR-MATERIALIZATION), as you'll need both for the Tinybird connection.
Step 3: Install and Set up Tinybird
Before connecting to Estuary Dekaf, you need to install the Tinybird CLI and authenticate with your account. This guide uses the CLI for a hands-on technical workflow.
Install Tinybird CLI
First, install the Tinybird CLI on your machine:
```bash curl -L tinybird.co | sh ```This installs the Tinybird CLI tool and sets up Tinybird Local for local development. For more installation options, see the Tinybird installation guide.
Authenticate with Tinybird
Next, authenticate with your Tinybird account:
```bash tb login ```This command opens a browser window where you can sign in to Tinybird Cloud. If you don't have an account yet, you can create one during this process. After signing in, create a new workspace or select an existing one.
For a complete quick start guide, see Get started with Tinybird.
Step 4: Connect Tinybird to Estuary Dekaf
With CDC events being captured in Estuary Flow and exposed via Dekaf, your next step is connecting Tinybird to Estuary Dekaf. This is quite simple using the Tinybird Kafka Connector, which will securely enable Tinybird to consume messages from Estuary Dekaf as if it were a Kafka broker.
The Kafka Connector is fully managed and requires no additional tooling. Simply connect Tinybird to Estuary Dekaf, choose a collection (topic), and Tinybird will automatically begin consuming messages. As part of the ingestion process, Tinybird will extract JSON event objects with attributes that are parsed and stored in its underlying real-time database.
Create a Kafka connection
First, create a connection to Estuary Dekaf using the Tinybird CLI. You'll need the materialization task name and auth token from Step 2.
Run the following command to start the interactive wizard:
```bash tb connection create kafka ```The wizard will prompt you to enter:
- A name for your connection (e.g.,
estuary_dekaf) - The bootstrap server:
dekaf.estuary-data.com - Security protocol:
SASL_SSL - SASL mechanism:
PLAIN - SASL username: Your materialization task name (e.g.,
YOUR-ORG/YOUR-PREFIX/YOUR-MATERIALIZATION) - SASL password: The auth token from your Estuary Dekaf materialization
If you're using Schema Registry (recommended for Avro messages), the wizard will also prompt you for:
- Schema Registry URL:
https://dekaf.estuary-data.com - Schema Registry username: Same as your materialization task name
- Schema Registry password: Same as your auth token
Create a Kafka Data Source
Now, create a Data Source that will consume messages from your Estuary collection. You can use the guided CLI process or create the files manually.
Option 1: Use the guided CLI process (recommended)
Run the following command to start the guided process:
```bash tb datasource create --kafka ```The CLI will prompt you to:
- Select or enter the connection name (use the name you created above, e.g.,
estuary_dekaf) - Enter the Kafka topic name (this is the Estuary collection name you want to ingest)
- Enter a consumer group ID (use a unique name, e.g.,
estuary_cdc_consumer) - Choose the offset reset behavior (
earliestto read from the beginning, orlatestto read only new messages)
Option 2: Manually create the Data Source files
Alternatively, you can manually create a .datasource file. First, create the connection file if you haven't already. Create a file named connections/estuary_dekaf.connection:
If you're using Schema Registry for Avro messages, add these settings to your connection file:
```tinybird KAFKA_SCHEMA_REGISTRY_URL https://dekaf.estuary-data.com KAFKA_SCHEMA_REGISTRY_USERNAME YOUR-ORG/YOUR-PREFIX/YOUR-MATERIALIZATION KAFKA_SCHEMA_REGISTRY_PASSWORD {{ tb_secret("ESTUARY_AUTH_TOKEN") }} ```Then, create a Data Source file (e.g., datasources/postgres_cdc.datasource) that references this connection. Here's an example that defines a Tinybird Data Source to hold the change events from your PostgreSQL table. In your case, the SCHEMA should match the data in your Estuary collection. Use JSONPath expressions to extract specific fields from the CDC events into separate columns:
Replace your_estuary_collection_name with the actual Estuary collection name. Adjust the schema fields to match the structure of your source data. The __timestamp column is automatically added by Tinybird and represents when the event was ingested.
Deploy the Data Source
After creating your connection and Data Source files, deploy them to Tinybird Cloud:
```bash tb --cloud deploy ```You can also validate the setup before deploying by running:
```bash tb --cloud deploy --check ```This will verify that Tinybird can connect to Estuary Dekaf with the provided credentials.
Once deployed, Tinybird will automatically begin consuming messages from your Estuary collection, and you'll start seeing CDC events stream into your Data Source as changes are made to the source data system.
Step 5: Handle Deduplication for CDC at Scale
When implementing CDC at scale, deduplication is essential. CDC streams can produce duplicate events due to network retries, connector restarts, or Kafka consumer rebalancing. Without proper deduplication, you'll have inconsistent analytics and incorrect aggregations.
There are several strategies for handling deduplication in Tinybird:
- ReplacingMergeTree Engine: Use Tinybird's
ReplacingMergeTreeengine to automatically deduplicate rows based on a primary key. This is ideal when you have a unique identifier (like a document ID) and a timestamp or version field that indicates the latest state. - Lambda Architecture: Implement a Lambda Architecture pattern where you maintain both a real-time stream and a batch layer. The batch layer provides the source of truth, while the stream layer provides low-latency updates. This approach is particularly effective for handling late-arriving data and ensuring eventual consistency.
- Query-time Deduplication: Use SQL functions like
argMaxto deduplicate at query time. This approach is flexible but can impact query performance on large datasets.
For detailed guidance on implementing these strategies, see:
- Deduplication Strategies: Comprehensive guide on deduplication techniques in Tinybird
- Lambda Architecture: Guide on implementing Lambda Architecture patterns for real-time analytics
Step 6: Start building real-time analytics with Tinybird
Now your CDC data pipeline should be up and running, capturing changes from your source database, streaming them through Estuary Flow and Dekaf, and then sinking them into a real-time, analytical datastore on Tinybird's real-time data platform.
You can now query, shape, join, and enrich your CDC data with SQL Pipes and instantly publish your transformations as high-concurrency, low-latency APIs to power your next use case.
For example, create a Pipe file (e.g., pipes/get_cdc_changes.pipe) to query your CDC data:
Deploy your Pipe to Tinybird Cloud:
```bash tb --cloud deploy ```After deployment, Tinybird automatically creates API endpoints for your Pipes. You can access your endpoint using the token you created. Here's an example of how to call the endpoint:
```bash curl "https://api.tinybird.co/v0/pipes/get_cdc_changes.json?token=YOUR_TOKEN" ```The endpoint returns data in JSON format by default. You can also request other formats:
.csvfor CSV format.ndjsonfor newline-delimited JSON.parquetfor Parquet format
Example response:
```json { "data": [ { "id": 1, "name": "John Doe", "email": "john.doe.com", "created_at": "2024-09-23 10:30:00", "updated_at": "2024-09-23 10:30:00" } ], "rows": 1, "statistics": { "elapsed": 0.001, "rows_read": 1, "bytes_read": 256 } } ```A simpler architecture for CDC and real-time analytics
In the last year, Kafka compatible layers like Estuary's Dekaf have quietly changed how teams think about CDC pipelines. Instead of standing up and operating full Kafka clusters, you can treat Estuary Flow as a Kafka compatible endpoint and let it fan in changes from hundreds of databases and SaaS tools into a single stream that Tinybird can consume. This pattern keeps the familiar Kafka API for your downstream tools while offloading the hard parts of capture, transformation, and delivery to Flow.
On the Tinybird side, this fits perfectly with its connector and real time analytics model. Estuary provides a low latency CDC feed from sources like PostgreSQL, MySQL, MongoDB, BigQuery, and many others, and Tinybird turns that feed into queryable tables, SQL based transformations, and HTTP APIs that can power dashboards or user facing product features with fresh data in milliseconds.
The result is a simpler architecture for real-time analytics: no custom ETL, no manual schema juggling, and far less infra to run, but still the ability to ship things like personalization, fraud checks, or cart abandonment alerts backed directly by CDC events.
Get started with Tinybird
Tinybird is the real-time platform for user-facing analytics. Unify your data from hundreds of sources (thanks, Estuary 😉), query it with SQL, and integrate it into your application via highly scalable, hosted REST API Endpoints.
You can sign up for Tinybird for free, with no time limit or credit card required. Need help? Join our Slack Community or check out the Tinybird docs for more resources.
