Kafka connectors can fail in ways that aren't always obvious. These problems are common to any Kafka to ClickHouse® deployment, but they're especially frustrating when you're managing infrastructure yourself.
After supporting hundreds of production deployments, we've found most issues fall into four categories. We've designed Tinybird's Kafka connector to handle these problems from a developer experience perspective, with built in solutions that prevent or quickly diagnose each failure mode.
This guide covers the common problems and how Tinybird addresses them.
1. Connection and Authentication Failures
Problem: Connection and authentication failures are common when building Kafka connectors. Issues like using internal broker addresses instead of advertised listeners, SASL mechanism mismatches, or firewall rules can cause hours of debugging.
How Tinybird solves it:
Tinybird's connection validation helps you catch these issues immediately. The tb connection data command validates connectivity, authentication and message consumption in one step:
tb connection data <connection_name>
If it fails, you'll see exactly where the problem is, whether it's the broker address, authentication method, or network connectivity. This eliminates the guesswork that comes with managing Kafka consumers yourself.
The CLI also guides you through connection setup with interactive prompts, reducing configuration errors. For AWS MSK, Confluent Cloud and self hosted clusters, Tinybird handles the connection details so you don't have to manage security groups, endpoints, or SASL mechanisms manually.
2. Consumer Lag
Problem: Consumer lag is a constant challenge with Kafka connectors. When lag grows, data arrives late and dashboards show stale data. Managing consumer scaling, partition assignment and throughput optimization requires constant attention.
How Tinybird solves it:
Tinybird's serverless architecture automatically scales consumers based on load. You don't need to manage consumer groups, partition assignment, or scaling logic, the infrastructure handles it.
Built in monitoring through kafka_ops_log gives you visibility into lag, throughput and partition performance:
SELECT
datasource_id,
topic,
partition,
lag,
timestamp
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
AND partition >= 0
AND msg_type = 'info'
ORDER BY timestamp DESC
LIMIT 1 BY datasource_id, topic, partition
The connector also optimizes for performance automatically. It handles schema parsing efficiently and provides guidance on Materialized View optimization. When you see lag, the monitoring data shows exactly where the bottleneck is, whether it's schema parsing, Materialized Views, or partition distribution.
You can set up alerts on kafka_ops_log for error rates and processing stalls, but the autoscaling infrastructure usually handles lag before it becomes a problem.
3. Schema Evolution Issues
Problem: Schema evolution is one of the trickiest aspects of building data pipelines. Message structures change and suddenly ingestion breaks. The worst part? It often fails silently, sending problematic messages to quarantine without obvious errors.
How Tinybird solves it:
Tinybird's branching feature lets you test schema changes safely with production data before deploying. You can evolve schemas without breaking production:
SCHEMA >
`order_id` String `json:$.order_id`,
`customer_id` String `json:$.customer_id`,
`order_total` Float64 `json:$.order_total`,
`payment_method` Nullable(String) `json:$.payment_method`, -- New field, nullable
`data` String `json:$`
The FORWARD_QUERY feature automatically migrates existing data when you add new fields or change types. This eliminates the manual backfill work that usually comes with schema evolution.
Tinybird also provides clear guidance on schema management, when to use Nullable() vs DEFAULT, how to handle missing fields and best practices for Schema Registry compatibility. The kafka_ops_log Service Data Source surfaces deserialization warnings immediately, so you know exactly what's wrong instead of guessing.
For detailed schema evolution strategies, see the schema management guide.
4. Message Size Limits
Problem: Oversized messages get quarantined, but the pipeline appears to work. You only discover missing data later when queries return incomplete results.
How Tinybird solves it:
Tinybird automatically quarantines messages exceeding 10 MB, but unlike self managed solutions, you get immediate visibility into what's being quarantined:
SELECT
timestamp,
length(__value) as message_size_bytes,
length(__value) / 1024 / 1024 as message_size_mb,
msg
FROM your_datasource_quarantine
WHERE timestamp > now() - INTERVAL 1 hour
ORDER BY message_size_bytes DESC
LIMIT 100
The quarantine system preserves the problematic messages so you can analyze them and fix the root cause. You can also set up alerts on quarantine rates to catch oversized messages early.
Tinybird's documentation provides clear guidance on message size optimization, when to enable Kafka compression, how to split large messages and best practices for schema design. This helps you prevent the problem rather than just detecting it.
Prevention Best Practices
Monitor proactively:
- Set up alerts for consumer lag thresholds (alert at 50k+ messages)
- Track error rates in
kafka_ops_log - Monitor message size distribution to catch oversized messages early
For comprehensive monitoring queries and alerting setup, see the Kafka monitoring guide.
Use explicit schemas:
- Define schemas upfront instead of schemaless parsing
- Use appropriate data types (
DateTimefor timestamps, notString) - Make new fields nullable during schema evolution
For detailed schema management strategies, see the schema management guide.
Optimize Materialized Views:
- Avoid cascading MVs from the same source
- Add time based filters to reduce data volume
- Simplify aggregations where possible
Design for even distribution:
- Use hash based partition keys (
user_id,session_id) not time based - Monitor partition level metrics regularly
- Adjust partition count based on throughput needs
For partition optimization strategies, see the partitioning strategies guide.
Test connectivity regularly:
- Use
tb connection datato verify connections - Monitor authentication errors
- Check SSL certificate validity before expiration
Building New Pipelines and Next Steps
Most pipeline failures are preventable with the right monitoring and schema design. The key is catching issues early and understanding the common failure modes.
If you're building a new pipeline, consider using Tinybird's serverless Kafka connector to avoid these common issues. It handles:
- Automatic consumer scaling based on message throughput
- Built in monitoring through
kafka_ops_logService Data Source - Schema evolution tools with branches and FORWARD_QUERY
- Quarantine handling for problematic messages
- Connection management with validation and troubleshooting
This eliminates the need to manage Kafka consumers, ClickHouse parts and monitoring infrastructure yourself.
Additional resources:
- Troubleshooting guide for specific error messages
- Monitoring guide for tracking consumer lag
- Performance optimization guide for throughput tuning
Ready to build reliable pipelines? Sign up for Tinybird and get started with our Kafka connector today. The free Build plan includes everything you need to get started.
