Kafka connector troubleshooting guide¶
This guide helps you diagnose and resolve common issues with Tinybird's Kafka connector. Use the tinybird.kafka_ops_log Service Data Source to monitor errors and warnings in real time.
For setup instructions and configuration details, see the Kafka connector documentation.
Quick error lookup¶
Use this table to quickly find errors and their solutions. Errors may appear in kafka_ops_log (Kafka connector operations) or datasources_ops_log (Data Source ingestion operations).
| Error message / symptom | Category | Log source | Solution link |
|---|---|---|---|
| Connection timeout or broker unreachable | Connectivity | kafka_ops_log | Connection timeout |
| Authentication failed | Authentication | kafka_ops_log, datasources_ops_log | Authentication failed |
| SSL handshake failed | SSL/TLS | kafka_ops_log | SSL certificate validation |
| Schema Registry connection failed | Deserialization | kafka_ops_log | Schema Registry |
| Deserialization failed - Avro | Deserialization | kafka_ops_log | Avro deserialization |
| Deserialization failed - JSON | Deserialization | kafka_ops_log | JSON deserialization |
| Offset commit failed | Consumer group | kafka_ops_log | Offset commit |
| Consumer lag continuously increasing | Performance | kafka_ops_log | Consumer lag |
| Schema mismatch or type conversion failed | Schema | kafka_ops_log | Schema mismatch |
| Materialized View errors | Schema | kafka_ops_log, datasources_ops_log | Materialized View errors |
| Low throughput or processing stall | Performance | kafka_ops_log | Low throughput |
| Uneven partition processing | Performance | kafka_ops_log | Uneven partitions |
| Message too large | Message size | kafka_ops_log | Message size |
| Compressed message handling | Message format | kafka_ops_log | Compression |
| Unknown topic or partition | Kafka | datasources_ops_log | Unknown topic |
| Group authorization failed | Authorization | datasources_ops_log | Group authorization |
| Topic authorization failed | Authorization | datasources_ops_log | Topic authorization |
| Unknown partition | Kafka | datasources_ops_log | Unknown partition |
| Table in readonly mode | Data Source | datasources_ops_log | Readonly mode |
| Timeout or memory limit exceeded | Resource | datasources_ops_log | Timeout |
How to diagnose errors¶
Use both kafka_ops_log and datasources_ops_log to diagnose Kafka connector issues:
Check Kafka connector operations¶
Query recent errors and warnings from kafka_ops_log:
SELECT
timestamp,
datasource_id,
topic,
partition,
msg_type,
msg,
lag
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
AND msg_type IN ('warning', 'error')
ORDER BY timestamp DESC
Check Data Source ingestion errors¶
Query errors from datasources_ops_log to see issues during data ingestion:
SELECT
timestamp,
datasource_id,
event_type,
result,
error,
elapsed_time
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
AND result = 'error'
AND event_type LIKE '%kafka%'
ORDER BY timestamp DESC
This shows errors that occur during the actual data processing phase, even when the Kafka connection itself might be working.
Set up automated monitoring: Connect these diagnostic queries to your monitoring and alerting tools. Query the ClickHouse® HTTP interface directly from tools like Grafana, Datadog, PagerDuty, and Slack. Alternatively, create API endpoints from these queries, or export them in Prometheus format for Prometheus-compatible tools. Configure your tools to poll these queries periodically and trigger alerts when errors are detected.
For detailed monitoring queries, see Monitor Kafka connectors.
Connectivity errors¶
Error: Connection timeout or broker unreachable¶
Symptoms:
- No messages are being processed
- Errors in
kafka_ops_logwith messages like "Connection timeout" or "Broker unreachable" - High lag values that continue to increase
Root causes:
- Incorrect
KAFKA_BOOTSTRAP_SERVERSconfiguration - Network connectivity issues between Tinybird and your Kafka cluster
- Firewall or security group rules blocking access
- Kafka broker is down or unreachable
Solutions:
Verify bootstrap servers configuration:
- Check that
KAFKA_BOOTSTRAP_SERVERSin your.connectionfile includes the correct host and port - Ensure you're using the advertised listeners address, not the internal broker address
- For multiple brokers, use comma-separated values:
broker1:9092,broker2:9092,broker3:9092 - For cloud providers, verify you're using the public endpoint provided by your Kafka service
- Check that
Test connectivity:
tb connection data <connection_name>
This command allows you to select a topic and consumer group ID, then returns preview data. This validates that Tinybird can reach your Kafka broker, authenticate, and consume messages.
Check network configuration:
- Verify firewall rules allow outbound connections from Tinybird to your Kafka cluster
- For AWS MSK, ensure security groups allow inbound traffic on the Kafka port
- For Confluent Cloud, verify network access settings
- For PrivateLink setups (Enterprise), verify the PrivateLink connection is active
Verify security protocol:
- Ensure
KAFKA_SECURITY_PROTOCOLmatches your Kafka cluster configuration - For most cloud providers, use
SASL_SSL - For local development, you may use
PLAINTEXT
- Ensure
For vendor-specific network configuration help, see:
Error: Authentication failed¶
Symptoms:
- Errors in
kafka_ops_logwith "Authentication failed" or "SASL authentication error" - Connection check fails with authentication errors
Root causes:
- Incorrect
KAFKA_KEYorKAFKA_SECRETcredentials - Wrong
KAFKA_SASL_MECHANISMconfiguration - Expired credentials or tokens
- For AWS MSK with OAuthBearer, incorrect IAM role configuration
Solutions:
- Verify credentials:
tb [--cloud] secret get KAFKA_KEY tb [--cloud] secret get KAFKA_SECRET
Ensure the secrets match your Kafka cluster credentials.
Check SASL mechanism:
- Verify
KAFKA_SASL_MECHANISMmatches your Kafka cluster (PLAIN, SCRAM-SHA-256, SCRAM-SHA-512, or OAUTHBEARER) - For Confluent Cloud, typically use
PLAIN - For AWS MSK with IAM, use
OAUTHBEARERwithKAFKA_SASL_OAUTHBEARER_METHOD AWS - For Redpanda, check your cluster's configured SASL mechanism
- Verify
For AWS MSK OAuthBearer:
- Verify the IAM role ARN is correct:
tb [--cloud] secret get AWS_ROLE_ARN - Check that the IAM role has the correct trust policy allowing Tinybird to assume the role
- Verify the external ID matches between your connection configuration and IAM trust policy
- Ensure the IAM role has the required Kafka cluster permissions (see AWS IAM permissions)
- Verify the IAM role ARN is correct:
Rotate credentials if needed:
- If credentials have expired, update them using
tb secret set - Redeploy your connection after updating secrets
- If credentials have expired, update them using
For detailed authentication setup, see:
- AWS MSK setup guide for IAM authentication
- Confluent Cloud setup guide for API key authentication
Error: SSL/TLS certificate validation failed¶
Symptoms:
- Errors mentioning "SSL handshake failed" or "certificate validation error"
- Connection failures when using
SASL_SSLsecurity protocol
Root causes:
- Missing or incorrect CA certificate
- Self-signed certificate not provided
- Certificate expired or invalid
Solutions:
- Provide CA certificate:
tb [--cloud] secret set --multiline KAFKA_SSL_CA_PEM
Paste your CA certificate in PEM format.
- Add certificate to connection file:
KAFKA_SSL_CA_PEM >
{{ tb_secret("KAFKA_SSL_CA_PEM") }}
Note: This is a multiline setting.
- Verify certificate format:
- Ensure the certificate is in PEM format (starts with
-----BEGIN CERTIFICATE-----) - Include the full certificate chain if required
- For Aiven Kafka, download the CA certificate from the Aiven console
- Ensure the certificate is in PEM format (starts with
Deserialization errors¶
Error: Schema Registry connection failed¶
Symptoms:
- Errors in
kafka_ops_logmentioning "Schema Registry" or "Failed to fetch schema" - Messages not being ingested when using Avro or JSON with schema
Root causes:
- Incorrect
KAFKA_SCHEMA_REGISTRY_URLconfiguration - Missing or incorrect Schema Registry credentials
- Schema Registry is unreachable
- Schema not found in Schema Registry
Solutions:
Verify Schema Registry URL:
- Check that
KAFKA_SCHEMA_REGISTRY_URLin your.connectionfile is correct - For Basic Auth, use format:
https://<username>:<password>@<registry_host> - Ensure the URL is accessible from Tinybird's network
- Check that
Check schema exists:
- Verify the schema exists in your Schema Registry for the topic
- Ensure the schema subject name matches your topic naming convention
- For Confluent Schema Registry, check subject names like
{topic-name}-valueor{topic-name}-key
Test Schema Registry access:
- Use curl or similar tool to verify Schema Registry is reachable
- Verify credentials work with Schema Registry API
For more information on schema management, see the schema management guide.
Error: Deserialization failed - Avro¶
Symptoms:
- Warnings in
kafka_ops_logwith "Deserialization failed" or "Avro parsing error" - Messages sent to Quarantine Data Source
processed_messages>committed_messagesin monitoring queries
Root causes:
- Schema mismatch between message and Schema Registry
- Schema evolution incompatibility
- Incorrect
KAFKA_VALUE_FORMATorKAFKA_KEY_FORMATconfiguration - Corrupted message data
Solutions:
Verify format configuration:
- Ensure
KAFKA_VALUE_FORMATis set toavrofor Avro messages - Ensure
KAFKA_KEY_FORMATis set toavroif keys are Avro-encoded - Verify
KAFKA_SCHEMA_REGISTRY_URLis configured
- Ensure
Check schema compatibility:
- Verify the message schema matches the schema in Schema Registry
- Check for schema evolution issues (backward/forward compatibility)
- Review Quarantine Data Source to see the actual message that failed
Inspect quarantined messages:
SELECT * FROM your_datasource_quarantine WHERE timestamp > now() - INTERVAL 1 hour ORDER BY timestamp DESC LIMIT 100
This helps you see the actual message content and identify the issue.
- Schema evolution:
- Ensure schema changes are backward compatible
- Consider using schema versioning strategies
- Test schema changes in a development environment first
For detailed schema evolution guidance, see the schema management guide.
Error: Deserialization failed - JSON¶
Symptoms:
- Warnings in
kafka_ops_logwith "JSON parsing error" or "Invalid JSON" - Messages in Quarantine Data Source
- Low success rate in throughput monitoring
Root causes:
- Invalid JSON format in message payload
- Schema mismatch with JSONPath expressions
- Missing required fields in JSON
- Incorrect
KAFKA_VALUE_FORMATconfiguration
Solutions:
Verify JSON format:
- Check that messages are valid JSON
- Use a JSON validator to test sample messages
- Review Quarantine Data Source for examples of failed messages
Check JSONPath expressions:
- Verify JSONPath expressions in your Data Source schema match the message structure
- Test JSONPath expressions with sample messages
- Use
json:$to store the entire message if you're unsure of the structure
Handle missing fields:
- Use nullable types for optional fields:
Nullable(String) - Provide default values in JSONPath:
json:$.field DEFAULT '' - Consider using a schemaless approach with
data String json:$and extract fields later
- Use nullable types for optional fields:
Verify format configuration:
- Use
json_without_schemafor plain JSON messages - Use
json_with_schemaonly if you're using Schema Registry for JSON schemas
- Use
Offset and consumer group errors¶
Error: Offset commit failed or consumer group conflict¶
Symptoms:
- Data Source only receives messages from the last committed offset
- Multiple Data Sources competing for the same consumer group
- Errors about offset commit failures
Root causes:
- Multiple Data Sources using the same
KAFKA_TOPICandKAFKA_GROUP_IDcombination - Consumer group already in use by another app
- Offset reset behavior not working as expected
Solutions:
Use unique consumer group IDs:
- Each Data Source must use a unique
KAFKA_GROUP_IDfor the same topic - Use environment-specific group IDs:
{{ tb_secret("KAFKA_GROUP_ID", "prod-group") }} - For testing, use unique group IDs to avoid conflicts
- Each Data Source must use a unique
Check for duplicate configurations:
SELECT
datasource_id,
topic,
count(*) as group_count
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
GROUP BY datasource_id, topic
HAVING group_count > 1
This helps identify if multiple Data Sources are consuming from the same topic.
Reset offset behavior:
KAFKA_AUTO_OFFSET_RESET=earliestonly works for new consumer groups- If a consumer group already has committed offsets, it resumes from the last committed offset
- To start from the beginning, use a new
KAFKA_GROUP_IDor reset offsets in your Kafka cluster
Best practices:
- Use different
KAFKA_GROUP_IDvalues for development, staging, and production - Document which consumer groups are in use
- Monitor consumer group activity in your Kafka cluster
- Use different
For managing consumer groups across environments, see the CI/CD and version control guide.
Error: Consumer lag continuously increasing¶
Symptoms:
- Lag values in
kafka_ops_logkeep growing - Messages are not being processed fast enough
- Throughput is lower than message production rate
Root causes:
- Message production rate exceeds processing capacity
- Consumer autoscaling not keeping up with load
- Network latency or connectivity issues
- Data Source schema or Materialized View performance issues
Solutions:
- Monitor lag trends:
SELECT
datasource_id,
topic,
partition,
max(lag) as current_lag,
avg(lag) as avg_lag
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
AND partition >= 0
GROUP BY datasource_id, topic, partition
ORDER BY current_lag DESC
Verify autoscaling:
- Tinybird's serverless Kafka connector automatically scales consumers
- Monitor
kafka_ops_logto see partition assignment changes - If lag continues to increase, there may be a bottleneck in your Data Source or Materialized Views
Check Data Source performance:
- Review Materialized View queries that trigger on append
- Optimize complex Materialized View queries that may slow down ingestion
- Check for schema issues causing slow parsing
Analyze throughput:
SELECT
datasource_id,
topic,
sum(processed_messages) as processed,
sum(committed_messages) as committed,
(sum(committed_messages) * 100.0 / sum(processed_messages)) as success_rate
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
GROUP BY datasource_id, topic
Low success rates indicate processing issues.
Review partitioning strategy:
- Check if partition distribution is even
- Review partition key design if lag is uneven across partitions
- See the partitioning strategies guide for optimization tips
Contact support:
- If lag continues to increase despite autoscaling, contact Tinybird support
- Provide
kafka_ops_logqueries showing the issue - Include information about message production rates
For performance optimization strategies, see the performance optimization guide.
Data quality and schema errors¶
Error: Schema mismatch or type conversion failed¶
Symptoms:
- Warnings in
kafka_ops_logabout type mismatches - Messages in Quarantine Data Source
- Low
committed_messagescompared toprocessed_messages
Root causes:
- Data type mismatch between message and Data Source schema
- Missing required fields
- Invalid data formats (for example, date strings that can't be parsed)
- JSONPath expressions not matching message structure
Solutions:
Review schema definition:
- Verify column types match the data in messages
- Use appropriate ClickHouse® types (for example,
DateTimefor timestamps,Int64for large integers) - Check for nullable vs non-nullable field requirements
Test with sample messages:
- Use
tb sqlto test JSONPath expressions with sample data - Verify date/time formats can be parsed correctly
- Check numeric formats and precision
- Use
Handle data quality issues:
- Use nullable types for fields that may be missing:
Nullable(String) - Provide default values:
json:$.field DEFAULT 0 - Use type conversion functions if needed:
toDateTime(JSONExtractString(data, 'timestamp'))
- Use nullable types for fields that may be missing:
Inspect quarantined data:
- Regularly check Quarantine Data Source for patterns
- Identify common data quality issues
- Update schema or data producers to fix root causes
For detailed schema management guidance, see the schema management guide.
Error: Materialized View errors¶
Symptoms:
- Warnings in
kafka_ops_logmentioning Materialized View errors - Data ingested but Materialized Views not updating
- Errors in Materialized View queries
- Errors in Materialized View queries affecting ingestion
Root causes:
- Materialized View query errors
- Schema changes breaking Materialized View queries
- Resource constraints (memory, CPU)
- Circular dependencies between Materialized Views
Solutions:
Check Materialized View queries:
- Review Materialized View pipe definitions
- Test Materialized View queries independently
- Verify queries work with the current Data Source schema
Monitor Materialized View impact:
- Monitor overall ingestion throughput in
kafka_ops_logto see if Materialized Views are slowing down ingestion - Check for errors in Materialized View queries that may be blocking ingestion
- Review Materialized View query complexity and execution time
- Monitor overall ingestion throughput in
Optimize Materialized View queries:
- Simplify complex aggregations
- Add appropriate filters to reduce data volume
- Consider breaking complex Materialized Views into multiple steps
Handle schema evolution:
- Update Materialized View queries when Data Source schema changes
- Test Materialized View changes in development first
- Use
FORWARD_QUERYto provide default values for new columns
Data Source operation errors¶
These errors occur during the data ingestion phase and are logged in datasources_ops_log. They represent problems that happen during actual data processing, even when the Kafka connection itself might be working.
Error: Unknown topic or partition¶
Error message:
- "KafkaError[UNKNOWN_TOPIC_OR_PART]: Broker: Unknown topic or partition"
Symptoms:
- Errors in
datasources_ops_logwith "Unknown topic or partition" - No messages being ingested
- Topic name errors
Root causes:
- Topic doesn't exist in Kafka cluster
- Topic was deleted
- Topic name typo in configuration
- Topic retention policies caused data deletion
Solutions:
Verify topic exists:
- Check your Kafka cluster to confirm the topic exists
- Use Kafka tools:
kafka-topics.sh --list --bootstrap-server <server> - Verify topic name matches exactly (case-sensitive)
Check topic configuration:
- Ensure topic hasn't been deleted
- Verify topic retention policies haven't removed all data
- Check if topic was renamed
Verify Data Source configuration:
cat <datasource_name.datasource>
Check that KAFKA_TOPIC matches the actual topic name.
- Create topic if needed:
- If topic doesn't exist, create it in your Kafka cluster
- Ensure proper replication factor and partitions
- Redeploy the Data Source after creating the topic
Note: If the error specifically mentions a partition (not the topic), see Unknown partition in the following section for partition-specific troubleshooting.
Monitor topic errors:
SELECT
timestamp,
datasource_id,
error,
count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
AND result = 'error'
AND error LIKE '%UNKNOWN_TOPIC%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
Error: Authentication failure (during ingestion)¶
Error message:
- "KafkaError[_AUTHENTICATION]: Local: Authentication failure"
Symptoms:
- Errors in
datasources_ops_logwith authentication failures - Connection works initially but fails during ingestion
- Credentials expired or rotated during operation
Root causes:
- SASL credentials expired or invalid during ingestion
- SSL certificates expired
- Authentication settings changed on Kafka broker
- Credentials rotated but not updated in Tinybird
- Token-based authentication expired mid-operation
Solutions:
- Verify credentials:
tb [--cloud] secret get KAFKA_KEY tb [--cloud] secret get KAFKA_SECRET
Ensure secrets match your Kafka cluster credentials.
Check for credential expiration:
- Some credentials have expiration dates
- Rotate credentials if they've expired
- Update secrets and redeploy connection
- For token-based auth, ensure tokens are refreshed before expiration
Verify SSL certificates:
- Check certificate expiration dates
- Update certificates if expired
- Verify certificate format is correct
Test connection:
tb connection data <connection_name>
This validates authentication is working.
- Check for intermittent auth failures:
- Monitor
datasources_ops_logfor authentication error patterns - If errors occur periodically, credentials may be expiring
- Set up credential rotation before expiration
- Monitor
Note: This error occurs during data ingestion, not during initial connection. If you see authentication errors during connection setup, see Authentication failed in the Connectivity errors section.
Monitor authentication errors:
SELECT
timestamp,
datasource_id,
error,
count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
AND result = 'error'
AND error LIKE '%AUTHENTICATION%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
Error: Group authorization failed¶
Error message:
- "KafkaError[GROUP_AUTHORIZATION_FAILED]: Broker: Group authorization failed"
Symptoms:
- Errors in
datasources_ops_logwith "Group authorization failed" - Consumer group lacks permissions
- ACLs not configured correctly
Root causes:
- Consumer group lacks proper authorization
- Kafka ACLs not configured for the consumer group
- Consumer group name doesn't match ACL configuration
- Permissions changed on Kafka cluster
Solutions:
Check Kafka ACLs:
- Verify consumer group has read permissions
- Check ACLs for the specific consumer group ID
- Ensure ACLs allow operations on the topic
Verify consumer group ID:
- Check the
KAFKA_GROUP_IDin your Data Source configuration - Ensure it matches what's configured in Kafka ACLs
- Use consistent naming across environments
- Check the
Update ACLs:
- Grant necessary permissions to the consumer group
- Ensure group has access to read from the topic
- Verify group can commit offsets
Test with different group ID:
- Try a different consumer group ID temporarily
- If it works, the issue is with ACLs for the original group
- Update ACLs for the original group
Monitor authorization errors:
SELECT
timestamp,
datasource_id,
error,
count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
AND result = 'error'
AND error LIKE '%GROUP_AUTHORIZATION%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
Error: Topic authorization failed¶
Error message:
- "KafkaError[TOPIC_AUTHORIZATION_FAILED]: Broker: Topic authorization failed"
Symptoms:
- Errors in
datasources_ops_logwith "Topic authorization failed" - Cannot read from topic
- ACLs not configured for topic access
Root causes:
- Kafka client lacks permission to read from topic
- Topic ACLs not configured
- Permissions changed on Kafka cluster
- Credentials don't have topic access
Solutions:
Check topic ACLs:
- Verify credentials have read permissions on the topic
- Check ACLs for the specific topic name
- Ensure ACLs allow consumer operations
Verify credentials:
- Ensure credentials have proper topic access
- Check if topic permissions have changed
- Update credentials if needed
Update ACLs:
- Grant read permissions to the topic
- Ensure consumer group has topic access
- Verify ACLs are applied correctly
Test connection:
tb connection data <connection_name>
Select the topic to verify access.
Monitor topic authorization errors:
SELECT
timestamp,
datasource_id,
error,
count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
AND result = 'error'
AND error LIKE '%TOPIC_AUTHORIZATION%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
Error: Unknown partition¶
Error message:
- "KafkaError[_UNKNOWN_PARTITION]: Local: Unknown partition"
Symptoms:
- Errors in
datasources_ops_logwith "Unknown partition" (note: different from "Unknown topic or partition") - Specific partition no longer available
- Topic reconfiguration issues
- Partition-specific errors
Root causes:
- Partition no longer exists in topic (topic was reconfigured)
- Topic reconfiguration changed partition count
- Broker failures affecting specific partition availability
- Partition replication issues
- Partition was deleted or reassigned
Solutions:
Check topic configuration:
- Verify current partition count for the topic
- Check if topic was reconfigured (partitions added/removed)
- Ensure partition assignments are correct
- Compare current partition count with what the connector expects
Check broker health:
- Verify all brokers are healthy
- Check for broker failures that might affect specific partitions
- Ensure partition replication is working
- Review partition leader assignments
Review partition assignments:
- Check if partition assignments changed
- Verify replication factors are correct
- Consider rebalancing if needed
- Check if partitions were reassigned to different brokers
Monitor partition availability:
- Use
kafka_ops_logto see which partitions are being accessed - Check for partition-specific errors
- Identify which specific partition is causing issues
- Contact support if partitions are consistently unavailable
- Use
Note: This error is different from "Unknown topic or partition" - this specifically indicates a partition issue when the topic exists. If you see "UNKNOWN_TOPIC_OR_PART", see Unknown topic or partition in the preceding section.
Monitor partition errors:
SELECT
timestamp,
datasource_id,
error,
count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
AND result = 'error'
AND error LIKE '%UNKNOWN_PARTITION%'
AND error NOT LIKE '%UNKNOWN_TOPIC%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
Error: Table in readonly mode¶
Error message:
- "Table is in readonly mode: replica_path=..."
Symptoms:
- Errors in
datasources_ops_logwith "readonly mode" - Data Source temporarily unavailable for writes
- Replication or maintenance in progress
Root causes:
- ClickHouse® table in readonly mode during replication
- Ongoing maintenance operations
- ClickHouse® cluster issues
- Replication lag or issues
Solutions:
Wait for table to become writable:
- This is often a transient state
- Wait a few minutes and check again
- Monitor
datasources_ops_logfor resolution
Check ClickHouse® cluster:
- Verify cluster health
- Check for ongoing maintenance
Monitor for resolution:
SELECT
timestamp,
datasource_id,
error,
count(*) as occurrence_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
AND result = 'error'
AND error LIKE '%readonly%'
GROUP BY timestamp, datasource_id, error
ORDER BY timestamp DESC
- Contact support:
- If issue persists for extended period
- Provide
datasources_ops_logqueries showing the issue - Include timestamps and Data Source IDs
Note: Readonly mode errors are typically transient and resolve automatically. If they persist, contact Tinybird support.
Error: Timeout or memory limit exceeded¶
Error message:
- "memory limit exceeded: would use ... GiB"
- "Waiting timeout for memo"
- Timeout errors during ingestion
Symptoms:
- Errors in
datasources_ops_logwith timeout or memory errors - Large messages or complex transformations
- Resource constraints
Root causes:
- Message size too large
- Complex Materialized View queries consuming too much memory
- High message throughput
- Resource constraints
Solutions:
Reduce message size:
- Use Kafka compression
- Split large messages into smaller chunks
- Move large data to external storage
Optimize Materialized View queries:
- Simplify complex aggregations
- Add filters to reduce data volume
- Break complex transformations into multiple steps
Monitor memory usage:
SELECT
timestamp,
datasource_id,
error,
elapsed_time
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
AND result = 'error'
AND (error LIKE '%memory%' OR error LIKE '%timeout%')
ORDER BY timestamp DESC
Optimize transformations:
- Reduce data processed per operation
- Use more efficient query patterns
- Consider batching operations
Contact support:
- If memory issues persist
- Discuss resource requirements
- Consider plan upgrades if needed
For more information on handling large messages, see the message size handling guide.
Performance and throughput issues¶
Error: Low throughput or processing stall¶
Symptoms:
processed_messagesis zero or low- No recent activity in
kafka_ops_log - Data Source not receiving new messages
Root causes:
- Kafka topic has no new messages
- Consumer has stopped or crashed
- Network connectivity issues
- Configuration errors preventing consumption
Solutions:
Verify topic has messages:
- Check your Kafka cluster to verify messages are being produced
- Use Kafka tools to verify topic has new messages
- Check producer metrics
Check connector activity:
SELECT
datasource_id,
topic,
max(timestamp) as last_activity,
now() - max(timestamp) as time_since_activity
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 7 day
GROUP BY datasource_id, topic
HAVING time_since_activity > INTERVAL 1 hour
Verify configuration:
- Run
tb connection data <connection_name>to test the connection and preview data - Verify all required settings are present
- Check for typos in topic names or connection names
- Run
Check for errors:
- Review recent errors in
kafka_ops_log - Check Quarantine Data Source for issues
- Review
datasources_ops_logfor Data Source operation errors
- Review recent errors in
Error: Uneven partition processing¶
Symptoms:
- Some partitions have high lag while others have low lag
- Uneven message distribution across partitions
- Some partitions processing faster than others
Root causes:
- Uneven message distribution in Kafka topic
- Partition key design causing hot partitions
- Consumer assignment imbalance
- Different message sizes across partitions
Solutions:
- Analyze partition distribution:
SELECT
datasource_id,
topic,
partition,
max(lag) as max_lag,
avg(lag) as avg_lag,
sum(processed_messages) as total_messages
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
AND partition >= 0
GROUP BY datasource_id, topic, partition
ORDER BY max_lag DESC
Review partition key strategy:
- Ensure partition keys distribute messages evenly
- Avoid using keys that create hot partitions
- Consider using random keys if even distribution is needed
Monitor autoscaling:
- Tinybird's connector automatically balances partition assignment
- Monitor
kafka_ops_logto see partition assignment changes - High lag should trigger additional consumer instances
Optimize at producer level:
- Review Kafka producer configuration
- Adjust partition key strategy if needed
- Consider increasing topic partitions if needed
For detailed partitioning strategies, see the partitioning strategies guide.
Compression and message format errors¶
Error: Compressed message handling¶
Symptoms:
- Messages ingested as raw bytes instead of decompressed content
- Warnings about message format
Root causes:
- Messages compressed before being sent to Kafka producer
- Kafka compression not configured correctly
- Message format not recognized
Solutions:
Understand compression types:
- Kafka compression (configured in producer): Automatically decompressed by Kafka consumer
- App-level compression (compressed before producing): Not automatically decompressed
Use Kafka compression:
- Configure Kafka producer with
compression.type=gzip(or snappy, lz4) - Kafka consumer automatically decompresses these messages
- Messages arrive in Tinybird already decompressed
- Configure Kafka producer with
Handle app-level compression:
- If you compress messages before sending to Kafka, you need to handle decompression
- Consider storing compressed messages and decompressing in Materialized Views
- Or change producer to use Kafka compression instead
Verify message format:
- Check that
KAFKA_VALUE_FORMATmatches your message format - For JSON, use
json_without_schemaorjson_with_schema - For Avro, use
avrowith Schema Registry configured
- Check that
Message size errors¶
Error: Message too large or quarantined due to size¶
Symptoms:
- Messages sent to Quarantine Data Source
- Errors about message size limits
- Large messages not being ingested
Root causes:
- Message exceeds Tinybird's 10 MB default limit
- Large payloads causing memory issues
- Compression not reducing message size effectively
Solutions:
Check message size:
- Review Quarantine Data Source for size-related errors
- Verify message sizes in your Kafka topic
- Use Kafka tools to inspect message sizes
Implement compression:
- Use Kafka compression to reduce message size
- Consider compressing large payloads before producing to Kafka
Split large messages:
- Break large messages into smaller chunks
- Use message headers to track message parts
- Reassemble in Materialized Views if needed
Alternative approaches:
- Store large payloads in object storage (S3, GCS) and reference them in Kafka messages
- Use external storage for large binary data
For detailed guidance on handling large messages, see the message size handling guide.
Quarantine Data Source issues¶
Understanding Quarantine Data Source¶
When messages fail to ingest into your main Data Source, they are automatically sent to a Quarantine Data Source. This prevents data loss and allows you to inspect problematic messages.
Common reasons for quarantine:
- Schema mismatches
- Invalid data formats
- Type conversion errors
- Missing required fields
- Deserialization failures
- Message size limits exceeded
How to inspect quarantined messages:
SELECT * FROM your_datasource_quarantine WHERE timestamp > now() - INTERVAL 24 hour ORDER BY timestamp DESC LIMIT 100
How to resolve:
- Identify patterns in quarantined messages
- Fix schema or data quality issues
- Update Data Source schema if needed
- Fix data producers to send correct formats
- Consider reprocessing quarantined messages after fixes
For more information, see Quarantine Data Sources.
Getting help¶
If you've tried the preceding solutions and still experience issues:
Collect diagnostic information:
- Recent errors from
kafka_ops_log - Recent errors from
datasources_ops_log - Connection configuration (without secrets)
- Data Source schema
- Sample of problematic messages (if available)
- Recent errors from
Check monitoring:
- Review Kafka monitoring guide
- Check Service Data Sources for additional context
- Query both
kafka_ops_loganddatasources_ops_logfor complete picture
Review related guides:
- Performance optimization guide for throughput issues
- Schema management guide for schema-related problems
Contact support:
- Provide error messages and timestamps from both logs
- Include relevant queries from
kafka_ops_loganddatasources_ops_log - Share configuration details (sanitized)
- Describe steps to reproduce the issue
Prevention best practices¶
- Use unique consumer group IDs for each Data Source and environment
- Test schema changes in development before deploying to production
- Monitor both
kafka_ops_loganddatasources_ops_logregularly to catch issues early - Set up automated alerts for high lag or error rates in both logs using monitoring tools
- Review Quarantine Data Source periodically to identify data quality issues
- Test connections using
tb connection data <connection_name>to preview data before deploying - Document consumer group usage to avoid conflicts
- Test with sample messages before connecting production topics
- Use environment-specific configurations for development, staging, and production
- Keep credentials secure using Tinybird secrets, never hardcode them
- Regularly review Kafka ACLs to ensure proper permissions
- Monitor for missing tables and recreate Data Sources if accidentally deleted
- Verify topic and partition availability before deploying connectors
- Optimize Materialized View queries to prevent timeout and memory errors
- Set up monitoring for authorization errors to catch permission issues early
Integrate with your monitoring stack: Connect the monitoring queries in this guide to your existing monitoring tools. Query the ClickHouse® HTTP interface directly from Grafana, Datadog, PagerDuty, Slack, and other alerting systems. You can also create API endpoints from these queries, or export them in Prometheus format for Prometheus-compatible tools. This activates proactive monitoring and automated alerting for your Kafka connectors.
For comprehensive monitoring queries and alerts, see Monitor Kafka connectors.