Message size handling

This guide covers handling large Kafka messages in Tinybird, including message size limits and strategies for large messages.

Message size limits

Tinybird has a default message size limit of 10 MB per message. Messages exceeding this limit are automatically sent to the Quarantine Data Source.

Checking message sizes

Check quarantined messages for size-related issues:

SELECT
    timestamp,
    length(__value) as message_size_bytes,
    length(__value) / 1024 / 1024 as message_size_mb,
    msg
FROM your_datasource_quarantine
WHERE timestamp > now() - INTERVAL 1 hour
ORDER BY message_size_bytes DESC
LIMIT 100

Strategies for handling large messages

Option 1: Compression

Use Kafka compression to reduce message size:

Producer configuration:

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    compression_type='gzip',  # or 'snappy', 'lz4'
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

Compression types:

  • gzip - Best compression, higher CPU
  • snappy - Good balance
  • lz4 - Fast, lower compression

Option 2: Split large messages

Break large messages into smaller chunks on the producer side, then reassemble in a Materialized View if needed.

Option 3: External storage

Store large payloads in object storage (S3, GCS) and send only references in Kafka:

# Upload to S3, send reference in Kafka
message = {
    'message_id': message_id,
    's3_key': s3_key,
    'metadata': {...}
}
producer.send('topic', value=message)

Option 4: Schema optimization

Reduce message size by storing only necessary data and using references for large content:

{
  "user_id": "123",
  "profile_summary": "key points only",
  "full_profile_s3_key": "s3://bucket/profiles/123.json"
}

Troubleshooting quarantined messages

SELECT
    timestamp,
    length(__value) as message_size,
    length(__value) / 1024 / 1024 as size_mb,
    msg
FROM your_datasource_quarantine
WHERE timestamp > now() - INTERVAL 24 hour
  AND length(__value) > 10 * 1024 * 1024  -- Over 10 MB
ORDER BY message_size DESC

Extract useful data from quarantined messages

Even if the full message is too large, you can extract metadata:

SELECT
    timestamp,
    JSONExtractString(__value, 'message_id') as message_id,
    JSONExtractString(__value, 'user_id') as user_id,
    length(__value) as original_size
FROM your_datasource_quarantine
WHERE timestamp > now() - INTERVAL 24 hour

Monitoring message sizes

Track message size distribution

SELECT
    quantile(0.5)(message_size) as median_size,
    quantile(0.95)(message_size) as p95_size,
    quantile(0.99)(message_size) as p99_size,
    max(message_size) as max_size
FROM (
    SELECT length(__value) as message_size
    FROM your_datasource
    WHERE timestamp > now() - INTERVAL 1 hour
)

Alert on large messages

SELECT
    timestamp,
    length(__value) as message_size,
    length(__value) / 1024 / 1024 as size_mb
FROM your_datasource
WHERE length(__value) > 8 * 1024 * 1024  -- Over 8MB
  AND timestamp > now() - INTERVAL 1 hour
ORDER BY message_size DESC

Best practices

  1. Target size: Keep messages under 1 MB when possible
  2. Use Kafka compression for large messages
  3. Store only necessary data in Kafka messages
  4. Use references for large binary data (S3, GCS)
  5. Monitor message sizes regularly to catch issues early

Common issues and solutions

Issue: Messages consistently over 10 MB

Solutions:

  1. Implement Kafka compression
  2. Split messages into chunks
  3. Move large data to external storage
  4. Optimize schema to reduce size

Issue: Compression not helping

Solutions:

  1. Check if data is already compressed
  2. Try different compression types
  3. Verify compression is turned on in producer
  4. Consider if data is compressible (text vs binary)
Updated