---
title: Kafka connector troubleshooting guide
meta:
  description: Comprehensive troubleshooting guide for common Kafka connector errors, including connectivity issues, deserialization failures, offset conflicts, and performance problems.
---

# Kafka connector troubleshooting guide

This guide helps you diagnose and resolve common issues with Tinybird's Kafka connector. Use the `tinybird.kafka_ops_log` [Service Data Source](/forward/monitoring/service-datasources#tinybird-kafka-ops-log) to monitor errors and warnings in real time.

For setup instructions and configuration details, see the [Kafka connector documentation](/forward/get-data-in/connectors/kafka).

## Quick error lookup

Use this table to quickly find errors and their solutions. Errors may appear in `kafka_ops_log` (Kafka connector operations) or `datasources_ops_log` (Data Source ingestion operations).

{% table %}
* Error message / symptom
* Category
* Log source
* Solution link
---
* Connection timeout or broker unreachable
* Connectivity
* kafka_ops_log
* [Connection timeout](#error-connection-timeout-or-broker-unreachable)
---
* Authentication failed
* Authentication
* kafka_ops_log, datasources_ops_log
* [Authentication failed](#error-authentication-failed)
---
* SSL handshake failed
* SSL/TLS
* kafka_ops_log
* [SSL certificate validation](#error-ssltls-certificate-validation-failed)
---
* Schema Registry connection failed
* Deserialization
* kafka_ops_log
* [Schema Registry](#error-schema-registry-connection-failed)
---
* Deserialization failed - Avro
* Deserialization
* kafka_ops_log
* [Avro deserialization](#error-deserialization-failed---avro)
---
* Deserialization failed - JSON
* Deserialization
* kafka_ops_log
* [JSON deserialization](#error-deserialization-failed---json)
---
* Offset commit failed
* Consumer group
* kafka_ops_log
* [Offset commit](#error-offset-commit-failed-or-consumer-group-conflict)
---
* Consumer lag continuously increasing
* Performance
* kafka_ops_log
* [Consumer lag](#error-consumer-lag-continuously-increasing)
---
* Schema mismatch or type conversion failed
* Schema
* kafka_ops_log
* [Schema mismatch](#error-schema-mismatch-or-type-conversion-failed)
---
* Materialized View errors
* Schema
* kafka_ops_log, datasources_ops_log
* [Materialized View errors](#error-materialized-view-errors)
---
* Low throughput or processing stall
* Performance
* kafka_ops_log
* [Low throughput](#error-low-throughput-or-processing-stall)
---
* Uneven partition processing
* Performance
* kafka_ops_log
* [Uneven partitions](#error-uneven-partition-processing)
---
* Message too large
* Message size
* kafka_ops_log
* [Message size](#error-message-too-large-or-quarantined-due-to-size)
---
* Compressed message handling
* Message format
* kafka_ops_log
* [Compression](#error-compressed-message-handling)
---
* Unknown topic or partition
* Kafka
* datasources_ops_log
* [Unknown topic](#error-unknown-topic-or-partition)
---
* Group authorization failed
* Authorization
* datasources_ops_log
* [Group authorization](#error-group-authorization-failed)
---
* Topic authorization failed
* Authorization
* datasources_ops_log
* [Topic authorization](#error-topic-authorization-failed)
---
* Unknown partition
* Kafka
* datasources_ops_log
* [Unknown partition](#error-unknown-partition)
---
* Table in readonly mode
* Data Source
* datasources_ops_log
* [Readonly mode](#error-table-in-readonly-mode)
---
* Timeout or memory limit exceeded
* Resource
* datasources_ops_log
* [Timeout](#error-timeout-or-memory-limit-exceeded)
{% /table %}

## How to diagnose errors

Use both `kafka_ops_log` and `datasources_ops_log` to diagnose Kafka connector issues:

### Check Kafka connector operations

Query recent errors and warnings from `kafka_ops_log`:

```sql
SELECT
    timestamp,
    datasource_id,
    topic,
    partition,
    msg_type,
    msg,
    lag
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
  AND msg_type IN ('warning', 'error')
ORDER BY timestamp DESC
```

### Check Data Source ingestion errors

Query errors from `datasources_ops_log` to see issues during data ingestion:

```sql
SELECT
    timestamp,
    datasource_id,
    event_type,
    result,
    error,
    elapsed_time
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
  AND result = 'error'
  AND event_type LIKE '%kafka%'
ORDER BY timestamp DESC
```

This shows errors that occur during the actual data processing phase, even when the Kafka connection itself might be working.

{% callout type="info" %}
**Set up automated monitoring**: Connect these diagnostic queries to your monitoring and alerting tools. Query the [ClickHouse® HTTP interface](/forward/work-with-data/publish-data/clickhouse-interface) directly from tools like Grafana, Datadog, PagerDuty, and Slack. Alternatively, create API endpoints from these queries, or export them in [Prometheus format](/forward/work-with-data/publish-data/guides/consume-api-endpoints-in-prometheus-format) for Prometheus-compatible tools. Configure your tools to poll these queries periodically and trigger alerts when errors are detected.
{% /callout %}

For detailed monitoring queries, see [Monitor Kafka connectors](/forward/monitoring/kafka-clickhouse-monitoring).

## Connectivity errors

### Error: Connection timeout or broker unreachable

**Symptoms:**
- No messages are being processed
- Errors in `kafka_ops_log` with messages like "Connection timeout" or "Broker unreachable"
- High lag values that continue to increase

**Root causes:**
1. Incorrect `KAFKA_BOOTSTRAP_SERVERS` configuration
2. Network connectivity issues between Tinybird and your Kafka cluster
3. Firewall or security group rules blocking access
4. Kafka broker is down or unreachable

**Solutions:**
1. **Verify bootstrap servers configuration:**
   - Check that `KAFKA_BOOTSTRAP_SERVERS` in your `.connection` file includes the correct host and port
   - Ensure you're using the advertised listeners address, not the internal broker address
   - For multiple brokers, use comma-separated values: `broker1:9092,broker2:9092,broker3:9092`
   - For cloud providers, verify you're using the public endpoint provided by your Kafka service

2. **Test connectivity:**
```bash
tb connection data <connection_name>
```
   This command allows you to select a topic and consumer group ID, then returns preview data. This validates that Tinybird can reach your Kafka broker, authenticate, and consume messages.

3. **Check network configuration:**
   - Verify firewall rules allow outbound connections from Tinybird to your Kafka cluster
   - For AWS MSK, ensure security groups allow inbound traffic on the Kafka port
   - For Confluent Cloud, verify network access settings
   - For PrivateLink setups (Enterprise), verify the PrivateLink connection is active

4. **Verify security protocol:**
   - Ensure `KAFKA_SECURITY_PROTOCOL` matches your Kafka cluster configuration
   - For most cloud providers, use `SASL_SSL`
   - For local development, you may use `PLAINTEXT`

For vendor-specific network configuration help, see:
- [Confluent Cloud setup guide](guides/confluent-cloud-setup)
- [AWS MSK setup guide](guides/aws-msk-setup)

### Error: Authentication failed

**Symptoms:**
- Errors in `kafka_ops_log` with "Authentication failed" or "SASL authentication error"
- Connection check fails with authentication errors

**Root causes:**
1. Incorrect `KAFKA_KEY` or `KAFKA_SECRET` credentials
2. Wrong `KAFKA_SASL_MECHANISM` configuration
3. Expired credentials or tokens
4. For AWS MSK with OAuthBearer, incorrect IAM role configuration

**Solutions:**
1. **Verify credentials:**
```bash
tb [--cloud] secret get KAFKA_KEY
tb [--cloud] secret get KAFKA_SECRET
```
   Ensure the secrets match your Kafka cluster credentials.

2. **Check SASL mechanism:**
   - Verify `KAFKA_SASL_MECHANISM` matches your Kafka cluster (PLAIN, SCRAM-SHA-256, SCRAM-SHA-512, or OAUTHBEARER)
   - For Confluent Cloud, typically use `PLAIN`
   - For AWS MSK with IAM, use `OAUTHBEARER` with `KAFKA_SASL_OAUTHBEARER_METHOD AWS`
   - For Redpanda, check your cluster's configured SASL mechanism

3. **For AWS MSK OAuthBearer:**
   - Verify the IAM role ARN is correct: `tb [--cloud] secret get AWS_ROLE_ARN`
   - Check that the IAM role has the correct trust policy allowing Tinybird to assume the role
   - Verify the external ID matches between your connection configuration and IAM trust policy
   - Ensure the IAM role has the required Kafka cluster permissions (see [AWS IAM permissions](index#aws-iam-permissions))

4. **Rotate credentials if needed:**
   - If credentials have expired, update them using `tb secret set`
   - Redeploy your connection after updating secrets

For detailed authentication setup, see:
- [AWS MSK setup guide](guides/aws-msk-setup) for IAM authentication
- [Confluent Cloud setup guide](guides/confluent-cloud-setup) for API key authentication

### Error: SSL/TLS certificate validation failed

**Symptoms:**
- Errors mentioning "SSL handshake failed" or "certificate validation error"
- Connection failures when using `SASL_SSL` security protocol

**Root causes:**
1. Missing or incorrect CA certificate
2. Self-signed certificate not provided
3. Certificate expired or invalid

**Solutions:**
1. **Provide CA certificate:**
```bash
tb [--cloud] secret set --multiline KAFKA_SSL_CA_PEM
```
   Paste your CA certificate in PEM format.

2. **Add certificate to connection file:**
```tb
KAFKA_SSL_CA_PEM >
   {{ tb_secret("KAFKA_SSL_CA_PEM") }}
```
   Note: This is a [multiline setting](/forward/dev-reference/datafiles#multiple-lines).

3. **Verify certificate format:**
   - Ensure the certificate is in PEM format (starts with `-----BEGIN CERTIFICATE-----`)
   - Include the full certificate chain if required
   - For Aiven Kafka, download the CA certificate from the Aiven console

## Deserialization errors

### Error: Schema Registry connection failed

**Symptoms:**
- Errors in `kafka_ops_log` mentioning "Schema Registry" or "Failed to fetch schema"
- Messages not being ingested when using Avro or JSON with schema

**Root causes:**
1. Incorrect `KAFKA_SCHEMA_REGISTRY_URL` configuration
2. Missing or incorrect Schema Registry credentials
3. Schema Registry is unreachable
4. Schema not found in Schema Registry

**Solutions:**
1. **Verify Schema Registry URL:**
   - Check that `KAFKA_SCHEMA_REGISTRY_URL` in your `.connection` file is correct
   - For Basic Auth, use format: `https://<username>:<password>@<registry_host>`
   - Ensure the URL is accessible from Tinybird's network

2. **Check schema exists:**
   - Verify the schema exists in your Schema Registry for the topic
   - Ensure the schema subject name matches your topic naming convention
   - For Confluent Schema Registry, check subject names like `{topic-name}-value` or `{topic-name}-key`

3. **Test Schema Registry access:**
   - Use curl or similar tool to verify Schema Registry is reachable
   - Verify credentials work with Schema Registry API

For more information on schema management, see the [schema management guide](guides/schema-management).

### Error: Deserialization failed - Avro

**Symptoms:**
- Warnings in `kafka_ops_log` with "Deserialization failed" or "Avro parsing error"
- Messages sent to Quarantine Data Source
- `processed_messages` > `committed_messages` in monitoring queries

**Root causes:**
1. Schema mismatch between message and Schema Registry
2. Schema evolution incompatibility
3. Incorrect `KAFKA_VALUE_FORMAT` or `KAFKA_KEY_FORMAT` configuration
4. Corrupted message data

**Solutions:**
1. **Verify format configuration:**
   - Ensure `KAFKA_VALUE_FORMAT` is set to `avro` for Avro messages
   - Ensure `KAFKA_KEY_FORMAT` is set to `avro` if keys are Avro-encoded
   - Verify `KAFKA_SCHEMA_REGISTRY_URL` is configured

2. **Check schema compatibility:**
   - Verify the message schema matches the schema in Schema Registry
   - Check for schema evolution issues (backward/forward compatibility)
   - Review Quarantine Data Source to see the actual message that failed

3. **Inspect quarantined messages:**
```sql
SELECT *
FROM your_datasource_quarantine
WHERE timestamp > now() - INTERVAL 1 hour
ORDER BY timestamp DESC
LIMIT 100
```
   This helps you see the actual message content and identify the issue.

4. **Schema evolution:**
   - Ensure schema changes are backward compatible
   - Consider using schema versioning strategies
   - Test schema changes in a development environment first

For detailed schema evolution guidance, see the [schema management guide](guides/schema-management).

### Error: Deserialization failed - JSON

**Symptoms:**
- Warnings in `kafka_ops_log` with "JSON parsing error" or "Invalid JSON"
- Messages in Quarantine Data Source
- Low success rate in throughput monitoring

**Root causes:**
1. Invalid JSON format in message payload
2. Schema mismatch with JSONPath expressions
3. Missing required fields in JSON
4. Incorrect `KAFKA_VALUE_FORMAT` configuration

**Solutions:**
1. **Verify JSON format:**
   - Check that messages are valid JSON
   - Use a JSON validator to test sample messages
   - Review Quarantine Data Source for examples of failed messages

2. **Check JSONPath expressions:**
   - Verify JSONPath expressions in your Data Source schema match the message structure
   - Test JSONPath expressions with sample messages
   - Use `json:$` to store the entire message if you're unsure of the structure

3. **Handle missing fields:**
   - Use nullable types for optional fields: `Nullable(String)`
   - Provide default values in JSONPath: `json:$.field DEFAULT ''`
   - Consider using a schemaless approach with `data String json:$` and extract fields later

4. **Verify format configuration:**
   - Use `json_without_schema` for plain JSON messages
   - Use `json_with_schema` only if you're using Schema Registry for JSON schemas

## Offset and consumer group errors

### Error: Offset commit failed or consumer group conflict

**Symptoms:**
- Data Source only receives messages from the last committed offset
- Multiple Data Sources competing for the same consumer group
- Errors about offset commit failures

**Root causes:**
1. Multiple Data Sources using the same `KAFKA_TOPIC` and `KAFKA_GROUP_ID` combination
2. Consumer group already in use by another app
3. Offset reset behavior not working as expected

**Solutions:**
1. **Use unique consumer group IDs:**
   - Each Data Source must use a unique `KAFKA_GROUP_ID` for the same topic
   - Use environment-specific group IDs: `{{ tb_secret("KAFKA_GROUP_ID", "prod-group") }}`
   - For testing, use unique group IDs to avoid conflicts

2. **Check for duplicate configurations:**
```sql
SELECT
      datasource_id,
      topic,
      count(*) as group_count
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
GROUP BY datasource_id, topic
HAVING group_count > 1
```
   This helps identify if multiple Data Sources are consuming from the same topic.

3. **Reset offset behavior:**
   - `KAFKA_AUTO_OFFSET_RESET=earliest` only works for new consumer groups
   - If a consumer group already has committed offsets, it resumes from the last committed offset
   - To start from the beginning, use a new `KAFKA_GROUP_ID` or reset offsets in your Kafka cluster

4. **Best practices:**
   - Use different `KAFKA_GROUP_ID` values for development, staging, and production
   - Document which consumer groups are in use
   - Monitor consumer group activity in your Kafka cluster

For managing consumer groups across environments, see the [CI/CD and version control guide](guides/cicd-version-control).

### Error: Consumer lag continuously increasing

**Symptoms:**
- Lag values in `kafka_ops_log` keep growing
- Messages are not being processed fast enough
- Throughput is lower than message production rate

**Root causes:**
1. Message production rate exceeds processing capacity
2. Consumer autoscaling not keeping up with load
3. Network latency or connectivity issues
4. Data Source schema or Materialized View performance issues

**Solutions:**
1. **Monitor lag trends:**
```sql
SELECT
      datasource_id,
      topic,
      partition,
      max(lag) as current_lag,
      avg(lag) as avg_lag
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
   AND partition >= 0
GROUP BY datasource_id, topic, partition
ORDER BY current_lag DESC
```

2. **Verify autoscaling:**
   - Tinybird's serverless Kafka connector automatically scales consumers
   - Monitor `kafka_ops_log` to see partition assignment changes
   - If lag continues to increase, there may be a bottleneck in your Data Source or Materialized Views

3. **Check Data Source performance:**
   - Review Materialized View queries that trigger on append
   - Optimize complex Materialized View queries that may slow down ingestion
   - Check for schema issues causing slow parsing

4. **Analyze throughput:**
```sql
SELECT
      datasource_id,
      topic,
      sum(processed_messages) as processed,
      sum(committed_messages) as committed,
      (sum(committed_messages) * 100.0 / sum(processed_messages)) as success_rate
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
GROUP BY datasource_id, topic
```
   Low success rates indicate processing issues.

5. **Review partitioning strategy:**
   - Check if partition distribution is even
   - Review partition key design if lag is uneven across partitions
   - See the [partitioning strategies guide](guides/partitioning-strategies) for optimization tips

6. **Contact support:**
   - If lag continues to increase despite autoscaling, contact Tinybird support
   - Provide `kafka_ops_log` queries showing the issue
   - Include information about message production rates

For performance optimization strategies, see the [performance optimization guide](guides/performance-optimization).

## Data quality and schema errors

### Error: Schema mismatch or type conversion failed

**Symptoms:**
- Warnings in `kafka_ops_log` about type mismatches
- Messages in Quarantine Data Source
- Low `committed_messages` compared to `processed_messages`

**Root causes:**
1. Data type mismatch between message and Data Source schema
2. Missing required fields
3. Invalid data formats (for example, date strings that can't be parsed)
4. JSONPath expressions not matching message structure

**Solutions:**
1. **Review schema definition:**
   - Verify column types match the data in messages
   - Use appropriate ClickHouse® types (for example, `DateTime` for timestamps, `Int64` for large integers)
   - Check for nullable vs non-nullable field requirements

2. **Test with sample messages:**
   - Use `tb sql` to test JSONPath expressions with sample data
   - Verify date/time formats can be parsed correctly
   - Check numeric formats and precision

3. **Handle data quality issues:**
   - Use nullable types for fields that may be missing: `Nullable(String)`
   - Provide default values: `json:$.field DEFAULT 0`
   - Use type conversion functions if needed: `toDateTime(JSONExtractString(data, 'timestamp'))`

4. **Inspect quarantined data:**
   - Regularly check Quarantine Data Source for patterns
   - Identify common data quality issues
   - Update schema or data producers to fix root causes

For detailed schema management guidance, see the [schema management guide](guides/schema-management).

### Error: Materialized View errors

**Symptoms:**
- Warnings in `kafka_ops_log` mentioning Materialized View errors
- Data ingested but Materialized Views not updating
- Errors in Materialized View queries
- Errors in Materialized View queries affecting ingestion

**Root causes:**
1. Materialized View query errors
2. Schema changes breaking Materialized View queries
3. Resource constraints (memory, CPU)
4. Circular dependencies between Materialized Views

**Solutions:**
1. **Check Materialized View queries:**
   - Review Materialized View pipe definitions
   - Test Materialized View queries independently
   - Verify queries work with the current Data Source schema

2. **Monitor Materialized View impact:**
   - Monitor overall ingestion throughput in `kafka_ops_log` to see if Materialized Views are slowing down ingestion
   - Check for errors in Materialized View queries that may be blocking ingestion
   - Review Materialized View query complexity and execution time

3. **Optimize Materialized View queries:**
   - Simplify complex aggregations
   - Add appropriate filters to reduce data volume
   - Consider breaking complex Materialized Views into multiple steps

4. **Handle schema evolution:**
   - Update Materialized View queries when Data Source schema changes
   - Test Materialized View changes in development first
   - Use `FORWARD_QUERY` to provide default values for new columns

## Data Source operation errors

These errors occur during the data ingestion phase and are logged in `datasources_ops_log`. They represent problems that happen during actual data processing, even when the Kafka connection itself might be working.

### Error: Unknown topic or partition

**Error message:**
- "KafkaError[UNKNOWN_TOPIC_OR_PART]: Broker: Unknown topic or partition"

**Symptoms:**
- Errors in `datasources_ops_log` with "Unknown topic or partition"
- No messages being ingested
- Topic name errors

**Root causes:**
1. Topic doesn't exist in Kafka cluster
2. Topic was deleted
3. Topic name typo in configuration
4. Topic retention policies caused data deletion

**Solutions:**
1. **Verify topic exists:**
   - Check your Kafka cluster to confirm the topic exists
   - Use Kafka tools: `kafka-topics.sh --list --bootstrap-server <server>`
   - Verify topic name matches exactly (case-sensitive)

2. **Check topic configuration:**
   - Ensure topic hasn't been deleted
   - Verify topic retention policies haven't removed all data
   - Check if topic was renamed

3. **Verify Data Source configuration:**
```bash
cat <datasource_name.datasource>
```
   Check that `KAFKA_TOPIC` matches the actual topic name.

4. **Create topic if needed:**
   - If topic doesn't exist, create it in your Kafka cluster
   - Ensure proper replication factor and partitions
   - Redeploy the Data Source after creating the topic

**Note:** If the error specifically mentions a partition (not the topic), see [Unknown partition](#error-unknown-partition) in the following section for partition-specific troubleshooting.

**Monitor topic errors:**
```sql
SELECT
    timestamp,
    datasource_id,
    error,
    count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
  AND result = 'error'
  AND error LIKE '%UNKNOWN_TOPIC%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
```

### Error: Authentication failure (during ingestion)

**Error message:**
- "KafkaError[_AUTHENTICATION]: Local: Authentication failure"

**Symptoms:**
- Errors in `datasources_ops_log` with authentication failures
- Connection works initially but fails during ingestion
- Credentials expired or rotated during operation

**Root causes:**
1. SASL credentials expired or invalid during ingestion
2. SSL certificates expired
3. Authentication settings changed on Kafka broker
4. Credentials rotated but not updated in Tinybird
5. Token-based authentication expired mid-operation

**Solutions:**
1. **Verify credentials:**
```bash
tb [--cloud] secret get KAFKA_KEY
tb [--cloud] secret get KAFKA_SECRET
```
   Ensure secrets match your Kafka cluster credentials.

2. **Check for credential expiration:**
   - Some credentials have expiration dates
   - Rotate credentials if they've expired
   - Update secrets and redeploy connection
   - For token-based auth, ensure tokens are refreshed before expiration

3. **Verify SSL certificates:**
   - Check certificate expiration dates
   - Update certificates if expired
   - Verify certificate format is correct

4. **Test connection:**
```bash
tb connection data <connection_name>
```
   This validates authentication is working.

5. **Check for intermittent auth failures:**
   - Monitor `datasources_ops_log` for authentication error patterns
   - If errors occur periodically, credentials may be expiring
   - Set up credential rotation before expiration

**Note:** This error occurs during data ingestion, not during initial connection. If you see authentication errors during connection setup, see [Authentication failed](#error-authentication-failed) in the Connectivity errors section.

**Monitor authentication errors:**
```sql
SELECT
    timestamp,
    datasource_id,
    error,
    count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
  AND result = 'error'
  AND error LIKE '%AUTHENTICATION%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
```

### Error: Group authorization failed

**Error message:**
- "KafkaError[GROUP_AUTHORIZATION_FAILED]: Broker: Group authorization failed"

**Symptoms:**
- Errors in `datasources_ops_log` with "Group authorization failed"
- Consumer group lacks permissions
- ACLs not configured correctly

**Root causes:**
1. Consumer group lacks proper authorization
2. Kafka ACLs not configured for the consumer group
3. Consumer group name doesn't match ACL configuration
4. Permissions changed on Kafka cluster

**Solutions:**
1. **Check Kafka ACLs:**
   - Verify consumer group has read permissions
   - Check ACLs for the specific consumer group ID
   - Ensure ACLs allow operations on the topic

2. **Verify consumer group ID:**
   - Check the `KAFKA_GROUP_ID` in your Data Source configuration
   - Ensure it matches what's configured in Kafka ACLs
   - Use consistent naming across environments

3. **Update ACLs:**
   - Grant necessary permissions to the consumer group
   - Ensure group has access to read from the topic
   - Verify group can commit offsets

4. **Test with different group ID:**
   - Try a different consumer group ID temporarily
   - If it works, the issue is with ACLs for the original group
   - Update ACLs for the original group

**Monitor authorization errors:**
```sql
SELECT
    timestamp,
    datasource_id,
    error,
    count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
  AND result = 'error'
  AND error LIKE '%GROUP_AUTHORIZATION%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
```

### Error: Topic authorization failed

**Error message:**
- "KafkaError[TOPIC_AUTHORIZATION_FAILED]: Broker: Topic authorization failed"

**Symptoms:**
- Errors in `datasources_ops_log` with "Topic authorization failed"
- Cannot read from topic
- ACLs not configured for topic access

**Root causes:**
1. Kafka client lacks permission to read from topic
2. Topic ACLs not configured
3. Permissions changed on Kafka cluster
4. Credentials don't have topic access

**Solutions:**
1. **Check topic ACLs:**
   - Verify credentials have read permissions on the topic
   - Check ACLs for the specific topic name
   - Ensure ACLs allow consumer operations

2. **Verify credentials:**
   - Ensure credentials have proper topic access
   - Check if topic permissions have changed
   - Update credentials if needed

3. **Update ACLs:**
   - Grant read permissions to the topic
   - Ensure consumer group has topic access
   - Verify ACLs are applied correctly

4. **Test connection:**
```bash
tb connection data <connection_name>
```
   Select the topic to verify access.

**Monitor topic authorization errors:**
```sql
SELECT
    timestamp,
    datasource_id,
    error,
    count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
  AND result = 'error'
  AND error LIKE '%TOPIC_AUTHORIZATION%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
```

### Error: Unknown partition

**Error message:**
- "KafkaError[_UNKNOWN_PARTITION]: Local: Unknown partition"

**Symptoms:**
- Errors in `datasources_ops_log` with "Unknown partition" (note: different from "Unknown topic or partition")
- Specific partition no longer available
- Topic reconfiguration issues
- Partition-specific errors

**Root causes:**
1. Partition no longer exists in topic (topic was reconfigured)
2. Topic reconfiguration changed partition count
3. Broker failures affecting specific partition availability
4. Partition replication issues
5. Partition was deleted or reassigned

**Solutions:**
1. **Check topic configuration:**
   - Verify current partition count for the topic
   - Check if topic was reconfigured (partitions added/removed)
   - Ensure partition assignments are correct
   - Compare current partition count with what the connector expects

2. **Check broker health:**
   - Verify all brokers are healthy
   - Check for broker failures that might affect specific partitions
   - Ensure partition replication is working
   - Review partition leader assignments

3. **Review partition assignments:**
   - Check if partition assignments changed
   - Verify replication factors are correct
   - Consider rebalancing if needed
   - Check if partitions were reassigned to different brokers

4. **Monitor partition availability:**
   - Use `kafka_ops_log` to see which partitions are being accessed
   - Check for partition-specific errors
   - Identify which specific partition is causing issues
   - Contact support if partitions are consistently unavailable

**Note:** This error is different from "Unknown topic or partition" - this specifically indicates a partition issue when the topic exists. If you see "UNKNOWN_TOPIC_OR_PART", see [Unknown topic or partition](#error-unknown-topic-or-partition) in the preceding section.

**Monitor partition errors:**
```sql
SELECT
    timestamp,
    datasource_id,
    error,
    count(*) as error_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
  AND result = 'error'
  AND error LIKE '%UNKNOWN_PARTITION%'
  AND error NOT LIKE '%UNKNOWN_TOPIC%'
GROUP BY timestamp, datasource_id, error
ORDER BY error_count DESC
```

### Error: Table in readonly mode

**Error message:**
- "Table is in readonly mode: replica_path=..."

**Symptoms:**
- Errors in `datasources_ops_log` with "readonly mode"
- Data Source temporarily unavailable for writes
- Replication or maintenance in progress

**Root causes:**
1. ClickHouse® table in readonly mode during replication
2. Ongoing maintenance operations
3. ClickHouse® cluster issues
4. Replication lag or issues

**Solutions:**
1. **Wait for table to become writable:**
   - This is often a transient state
   - Wait a few minutes and check again
   - Monitor `datasources_ops_log` for resolution

2. **Check ClickHouse® cluster:**
   - Verify cluster health
   - Check for ongoing maintenance

3. **Monitor for resolution:**
```sql
SELECT
      timestamp,
      datasource_id,
      error,
      count(*) as occurrence_count
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 1 hour
   AND result = 'error'
   AND error LIKE '%readonly%'
GROUP BY timestamp, datasource_id, error
ORDER BY timestamp DESC
```

4. **Contact support:**
   - If issue persists for extended period
   - Provide `datasources_ops_log` queries showing the issue
   - Include timestamps and Data Source IDs

**Note:** Readonly mode errors are typically transient and resolve automatically. If they persist, contact Tinybird support.

### Error: Timeout or memory limit exceeded

**Error message:**
- "memory limit exceeded: would use ... GiB"
- "Waiting timeout for memo"
- Timeout errors during ingestion

**Symptoms:**
- Errors in `datasources_ops_log` with timeout or memory errors
- Large messages or complex transformations
- Resource constraints

**Root causes:**
1. Message size too large
2. Complex Materialized View queries consuming too much memory
3. High message throughput
4. Resource constraints

**Solutions:**
1. **Reduce message size:**
   - Use Kafka compression
   - Split large messages into smaller chunks
   - Move large data to external storage

2. **Optimize Materialized View queries:**
   - Simplify complex aggregations
   - Add filters to reduce data volume
   - Break complex transformations into multiple steps

3. **Monitor memory usage:**
```sql
SELECT
      timestamp,
      datasource_id,
      error,
      elapsed_time
FROM tinybird.datasources_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
   AND result = 'error'
   AND (error LIKE '%memory%' OR error LIKE '%timeout%')
ORDER BY timestamp DESC
```

4. **Optimize transformations:**
   - Reduce data processed per operation
   - Use more efficient query patterns
   - Consider batching operations

5. **Contact support:**
   - If memory issues persist
   - Discuss resource requirements
   - Consider plan upgrades if needed

For more information on handling large messages, see the [message size handling guide](guides/message-size-handling).

## Performance and throughput issues

### Error: Low throughput or processing stall

**Symptoms:**
- `processed_messages` is zero or low
- No recent activity in `kafka_ops_log`
- Data Source not receiving new messages

**Root causes:**
1. Kafka topic has no new messages
2. Consumer has stopped or crashed
3. Network connectivity issues
4. Configuration errors preventing consumption

**Solutions:**
1. **Verify topic has messages:**
   - Check your Kafka cluster to verify messages are being produced
   - Use Kafka tools to verify topic has new messages
   - Check producer metrics

2. **Check connector activity:**
```sql
SELECT
      datasource_id,
      topic,
      max(timestamp) as last_activity,
      now() - max(timestamp) as time_since_activity
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 7 day
GROUP BY datasource_id, topic
HAVING time_since_activity > INTERVAL 1 hour
```

3. **Verify configuration:**
   - Run `tb connection data <connection_name>` to test the connection and preview data
   - Verify all required settings are present
   - Check for typos in topic names or connection names

4. **Check for errors:**
   - Review recent errors in `kafka_ops_log`
   - Check Quarantine Data Source for issues
   - Review `datasources_ops_log` for Data Source operation errors

### Error: Uneven partition processing

**Symptoms:**
- Some partitions have high lag while others have low lag
- Uneven message distribution across partitions
- Some partitions processing faster than others

**Root causes:**
1. Uneven message distribution in Kafka topic
2. Partition key design causing hot partitions
3. Consumer assignment imbalance
4. Different message sizes across partitions

**Solutions:**
1. **Analyze partition distribution:**
```sql
SELECT
      datasource_id,
      topic,
      partition,
      max(lag) as max_lag,
      avg(lag) as avg_lag,
      sum(processed_messages) as total_messages
FROM tinybird.kafka_ops_log
WHERE timestamp > now() - INTERVAL 24 hour
   AND partition >= 0
GROUP BY datasource_id, topic, partition
ORDER BY max_lag DESC
```

2. **Review partition key strategy:**
   - Ensure partition keys distribute messages evenly
   - Avoid using keys that create hot partitions
   - Consider using random keys if even distribution is needed

3. **Monitor autoscaling:**
   - Tinybird's connector automatically balances partition assignment
   - Monitor `kafka_ops_log` to see partition assignment changes
   - High lag should trigger additional consumer instances

4. **Optimize at producer level:**
   - Review Kafka producer configuration
   - Adjust partition key strategy if needed
   - Consider increasing topic partitions if needed

For detailed partitioning strategies, see the [partitioning strategies guide](guides/partitioning-strategies).

## Compression and message format errors

### Error: Compressed message handling

**Symptoms:**
- Messages ingested as raw bytes instead of decompressed content
- Warnings about message format

**Root causes:**
1. Messages compressed before being sent to Kafka producer
2. Kafka compression not configured correctly
3. Message format not recognized

**Solutions:**
1. **Understand compression types:**
   - Kafka compression (configured in producer): Automatically decompressed by Kafka consumer
   - App-level compression (compressed before producing): Not automatically decompressed

2. **Use Kafka compression:**
   - Configure Kafka producer with `compression.type=gzip` (or snappy, lz4)
   - Kafka consumer automatically decompresses these messages
   - Messages arrive in Tinybird already decompressed

3. **Handle app-level compression:**
   - If you compress messages before sending to Kafka, you need to handle decompression
   - Consider storing compressed messages and decompressing in Materialized Views
   - Or change producer to use Kafka compression instead

4. **Verify message format:**
   - Check that `KAFKA_VALUE_FORMAT` matches your message format
   - For JSON, use `json_without_schema` or `json_with_schema`
   - For Avro, use `avro` with Schema Registry configured

## Message size errors

### Error: Message too large or quarantined due to size

**Symptoms:**
- Messages sent to Quarantine Data Source
- Errors about message size limits
- Large messages not being ingested

**Root causes:**
1. Message exceeds Tinybird's 10 MB default limit
2. Large payloads causing memory issues
3. Compression not reducing message size effectively

**Solutions:**
1. **Check message size:**
   - Review Quarantine Data Source for size-related errors
   - Verify message sizes in your Kafka topic
   - Use Kafka tools to inspect message sizes

2. **Implement compression:**
   - Use Kafka compression to reduce message size
   - Consider compressing large payloads before producing to Kafka

3. **Split large messages:**
   - Break large messages into smaller chunks
   - Use message headers to track message parts
   - Reassemble in Materialized Views if needed

4. **Alternative approaches:**
   - Store large payloads in object storage (S3, GCS) and reference them in Kafka messages
   - Use external storage for large binary data

For detailed guidance on handling large messages, see the [message size handling guide](guides/message-size-handling).

## Quarantine Data Source issues

### Understanding Quarantine Data Source

When messages fail to ingest into your main Data Source, they are automatically sent to a Quarantine Data Source. This prevents data loss and allows you to inspect problematic messages.

**Common reasons for quarantine:**
- Schema mismatches
- Invalid data formats
- Type conversion errors
- Missing required fields
- Deserialization failures
- Message size limits exceeded

**How to inspect quarantined messages:**
```sql
SELECT *
FROM your_datasource_quarantine
WHERE timestamp > now() - INTERVAL 24 hour
ORDER BY timestamp DESC
LIMIT 100
```

**How to resolve:**
1. Identify patterns in quarantined messages
2. Fix schema or data quality issues
3. Update Data Source schema if needed
4. Fix data producers to send correct formats
5. Consider reprocessing quarantined messages after fixes

For more information, see [Quarantine Data Sources](/forward/get-data-in/quarantine).

## Getting help

If you've tried the preceding solutions and still experience issues:

1. **Collect diagnostic information:**
   - Recent errors from `kafka_ops_log`
   - Recent errors from `datasources_ops_log`
   - Connection configuration (without secrets)
   - Data Source schema
   - Sample of problematic messages (if available)

2. **Check monitoring:**
   - Review [Kafka monitoring guide](/forward/monitoring/kafka-clickhouse-monitoring)
   - Check Service Data Sources for additional context
   - Query both `kafka_ops_log` and `datasources_ops_log` for complete picture

3. **Review related guides:**
   - [Performance optimization guide](guides/performance-optimization) for throughput issues
   - [Schema management guide](guides/schema-management) for schema-related problems

4. **Contact support:**
   - Provide error messages and timestamps from both logs
   - Include relevant queries from `kafka_ops_log` and `datasources_ops_log`
   - Share configuration details (sanitized)
   - Describe steps to reproduce the issue

## Prevention best practices

1. **Use unique consumer group IDs** for each Data Source and environment
2. **Test schema changes** in development before deploying to production
3. **Monitor both `kafka_ops_log` and `datasources_ops_log`** regularly to catch issues early
4. **Set up automated alerts** for high lag or error rates in both logs using monitoring tools
5. **Review Quarantine Data Source** periodically to identify data quality issues
6. **Test connections** using `tb connection data <connection_name>` to preview data before deploying
7. **Document consumer group usage** to avoid conflicts
8. **Test with sample messages** before connecting production topics
9. **Use environment-specific configurations** for development, staging, and production
10. **Keep credentials secure** using Tinybird secrets, never hardcode them
11. **Regularly review Kafka ACLs** to ensure proper permissions
12. **Monitor for missing tables** and recreate Data Sources if accidentally deleted
13. **Verify topic and partition availability** before deploying connectors
14. **Optimize Materialized View queries** to prevent timeout and memory errors
15. **Set up monitoring for authorization errors** to catch permission issues early

{% callout type="tip" %}
**Integrate with your monitoring stack**: Connect the monitoring queries in this guide to your existing monitoring tools. Query the [ClickHouse® HTTP interface](/forward/work-with-data/publish-data/clickhouse-interface) directly from Grafana, Datadog, PagerDuty, Slack, and other alerting systems. You can also create API endpoints from these queries, or export them in [Prometheus format](/forward/work-with-data/publish-data/guides/consume-api-endpoints-in-prometheus-format) for Prometheus-compatible tools. This activates proactive monitoring and automated alerting for your Kafka connectors.
{% /callout %}

For comprehensive monitoring queries and alerts, see [Monitor Kafka connectors](/forward/monitoring/kafka-clickhouse-monitoring).
