CANNOT_READ_ALL_DATA ClickHouse error

This error occurs when ClickHouse cannot read all the data from a source during import or processing operations. It's common with corrupted data sources, incomplete transfers, or data format issues.

The CANNOT_READ_ALL_DATA error in ClickHouse (and Tinybird) happens when the system cannot read all the data from a source during import, processing, or data transfer operations. This typically occurs with corrupted data sources, incomplete data transfers, malformed data formats, or when data streams are interrupted unexpectedly.

What causes this error

You'll typically see it when:

  • Data source is corrupted or incomplete
  • Data transfer is interrupted
  • Malformed data format
  • Insufficient permissions to read data
  • Data source becomes unavailable during reading
  • Network interruptions during data transfer
  • File system issues
  • Memory constraints during data processing
  • Data source format changes unexpectedly

This error often indicates data source issues rather than ClickHouse problems. Check the data source integrity and connectivity.

Example errors

Fails: corrupted data source
INSERT INTO events FROM INFILE '/path/to/corrupted_file.csv'
-- Error: CANNOT_READ_ALL_DATA
Fails: incomplete data transfer
-- When data source becomes unavailable
INSERT INTO users FROM INFILE '/path/to/remote_data.csv'
-- Error: CANNOT_READ_ALL_DATA
Fails: malformed data format
INSERT INTO metrics FROM INFILE '/path/to/malformed_data.json'
-- Error: CANNOT_READ_ALL_DATA
Fails: interrupted data stream
-- When streaming data is interrupted
INSERT INTO logs FROM INFILE '/path/to/streaming_data.csv'
-- Error: CANNOT_READ_ALL_DATA

How to fix it

Check data source integrity

Verify the data source is complete and accessible:

Check data source
-- Verify data source integrity
-- Example for Linux:
-- ls -la /path/to/your_data_file
-- file /path/to/your_data_file
-- md5sum /path/to/your_data_file
--
-- Example for Python:
-- import os
-- file_path = '/path/to/your_data_file'
-- if os.path.exists(file_path):
--     file_size = os.path.getsize(file_path)
--     print(f"File exists, size: {file_size} bytes")
-- else:
--     print("File does not exist")

Verify data format

Check if the data format is correct:

Verify format
-- Validate data format before import
-- Example for CSV:
-- import csv
-- try:
--     with open('/path/to/your_file.csv', 'r') as f:
--         reader = csv.reader(f)
--         for i, row in enumerate(reader):
--             if i == 0:  # Check header
--                 print(f"Header: {row}")
--             if i < 5:   # Check first few rows
--                 print(f"Row {i}: {row}")
-- except Exception as e:
--     print(f"Format error: {e}")

Check permissions

Verify access permissions to the data source:

Check permissions
-- Check file permissions
-- Example for Linux:
-- ls -la /path/to/your_data_file
--
-- Example for Python:
-- import os
-- file_path = '/path/to/your_data_file'
-- if os.access(file_path, os.R_OK):
--     print("File is readable")
-- else:
--     print("File is not readable")

Retry the operation

Attempt the data reading operation again:

Retry operation
-- Try reading the data again
INSERT INTO events FROM INFILE '/path/to/your_data_file.csv'
FORMAT CSV
SETTINGS
    input_format_allow_errors_num = 10,
    input_format_allow_errors_ratio = 0.1

Common patterns and solutions

Data source validation

Implement validation before processing:

Data validation
-- Validate data source before processing
-- Example pseudo-code:
--
-- def validate_data_source(file_path):
--     try:
--         # Check file exists
--         if not os.path.exists(file_path):
--             return False, "File does not exist"
--
--         # Check file size
--         file_size = os.path.getsize(file_path)
--         if file_size == 0:
--             return False, "File is empty"
--
--         # Check file permissions
--         if not os.access(file_path, os.R_OK):
--             return False, "File is not readable"
--
--         # Check file format
--         validate_file_format(file_path)
--
--         return True, "Data source is valid"
--     except Exception as e:
--         return False, f"Validation error: {e}"

Error-tolerant import

Use error-tolerant import settings:

Error-tolerant import
-- Import with error tolerance
INSERT INTO events FROM INFILE '/path/to/your_data_file.csv'
FORMAT CSV
SETTINGS
    input_format_allow_errors_num = 100,        -- Allow up to 100 errors
    input_format_allow_errors_ratio = 0.05,     -- Allow up to 5% errors
    input_format_skip_unknown_fields = 1,       -- Skip unknown fields
    input_format_null_as_default = 1            -- Use defaults for NULL values

Incremental processing

Process data in smaller chunks:

Incremental processing
-- Process data in smaller chunks
-- Example pseudo-code:
--
-- def process_in_chunks(file_path, chunk_size=10000):
--     with open(file_path, 'r') as f:
--         header = f.readline()  # Skip header
--
--         chunk = []
--         for i, line in enumerate(f):
--             chunk.append(line)
--
--             if len(chunk) >= chunk_size:
--                 # Process chunk
--                 process_chunk(chunk)
--                 chunk = []
--
--         # Process remaining lines
--         if chunk:
--             process_chunk(chunk)

Data source monitoring

Monitor data source health:

Data source monitoring
-- Monitor data source health
-- Example pseudo-code:
--
-- class DataSourceMonitor:
--     def __init__(self, data_source_path):
--         self.data_source_path = data_source_path
--         self.last_check = None
--         self.last_size = 0
--
--     def check_health(self):
--         try:
--             if not os.path.exists(self.data_source_path):
--                 return False, "Data source not found"
--
--             current_size = os.path.getsize(self.data_source_path)
--             if current_size == 0:
--                 return False, "Data source is empty"
--
--             if self.last_size > 0 and current_size < self.last_size:
--                 return False, "Data source size decreased"
--
--             self.last_size = current_size
--             self.last_check = time.time()
--
--             return True, "Data source is healthy"
--         except Exception as e:
--             return False, f"Health check failed: {e}"

Tinybird-specific notes

In Tinybird, CANNOT_READ_ALL_DATA errors often occur when:

  • Data Source imports are interrupted
  • External data sources become unavailable
  • Pipe transformations encounter data issues
  • Data streaming operations are interrupted
  • External API endpoints fail

To debug in Tinybird:

  1. Check Data Source import status
  2. Verify external data source connectivity
  3. Review Pipe transformation logs
  4. Check external API endpoint health

In Tinybird, use the Data Source preview to validate data before processing it in Pipes.

Best practices

Data source management

  • Validate data sources before processing
  • Implement data source health monitoring
  • Use error-tolerant import settings
  • Monitor data source availability

Error handling

  • Implement retry logic for failed operations
  • Log data source issues for debugging
  • Provide clear error messages
  • Handle partial data failures gracefully

Data quality

  • Validate data format before processing
  • Check for data corruption indicators
  • Implement data quality checks
  • Monitor import success rates

Configuration options

Import settings

Import configuration
-- Configure import behavior
SET input_format_allow_errors_num = 100;
SET input_format_allow_errors_ratio = 0.05;
SET input_format_skip_unknown_fields = 1;
SET input_format_null_as_default = 1;

Data processing settings

Data processing configuration
-- Configure data processing
SET max_insert_block_size = 1000000;
SET min_insert_block_size_rows = 1000;
SET min_insert_block_size_bytes = 1000000;

Error handling settings

Error handling configuration
-- Configure error handling
SET max_errors_to_log = 1000;
SET log_error_rate_prob = 1.0;
SET log_errors = 1;

Alternative solutions

Use streaming import

Import data in streaming mode:

Streaming import
-- Use streaming import for large data sources
INSERT INTO events FROM INFILE '/path/to/your_data_file.csv'
FORMAT CSV
SETTINGS
    input_format_parallel_parsing = 1,
    max_insert_block_size = 100000,
    min_insert_block_size_rows = 1000

Implement data source fallback

Use fallback data sources:

Data source fallback
-- Implement fallback data sources
-- Example pseudo-code:
--
-- def get_data_with_fallback(primary_source, fallback_sources):
--     try:
--         return read_data_source(primary_source)
--     except CannotReadAllData:
--         for fallback_source in fallback_sources:
--             try:
--                 return read_data_source(fallback_source)
--             except CannotReadAllData:
--                 continue
--
--         raise Exception("All data sources failed")

Use data validation

Implement data validation before processing:

Data validation
-- Validate data before processing
-- Example pseudo-code:
--
-- def validate_and_process_data(file_path):
--     # Validate data source
--     is_valid, message = validate_data_source(file_path)
--     if not is_valid:
--         logger.error(f"Data source validation failed: {message}")
--         return False
--
--     # Process data
--     try:
--         process_data_file(file_path)
--         return True
--     except CannotReadAllData as e:
--         logger.error(f"Data processing failed: {e}")
--         return False

Monitoring and prevention

Data source monitoring

Data source tracking
-- Monitor data source operations
-- Example pseudo-code:
--
-- def track_data_source_operation(operation, data_source, status, error=None):
--     logger.info(f"Data source operation: {operation} on {data_source}")
--     logger.info(f"Status: {status}")
--
--     if error:
--         logger.error(f"Data source error: {error}")
--
--     # Track operation metrics
--     increment_counter('data_source_operations', {
--         'operation': operation,
--         'data_source': data_source,
--         'status': status,
--         'error': error
--     })

Data quality monitoring

Data quality tracking
-- Monitor data quality metrics
-- Example pseudo-code:
--
-- class DataQualityMonitor:
--     def __init__(self):
--         self.quality_checks = []
--
--     def check_data_quality(self, data_source):
--         try:
--             # Check data completeness
--             completeness = self.check_completeness(data_source)
--
--             # Check data consistency
--             consistency = self.check_consistency(data_source)
--
--             # Check data validity
--             validity = self.check_validity(data_source)
--
--             quality_score = (completeness + consistency + validity) / 3
--
--             quality_status = {
--                 'data_source': data_source,
--                 'completeness': completeness,
--                 'consistency': consistency,
--                 'validity': validity,
--                 'quality_score': quality_score,
--                 'timestamp': time.time()
--             }
--
--             self.quality_checks.append(quality_status)
--             return quality_status
--
--         except Exception as e:
--             quality_status = {
--                 'data_source': data_source,
--                 'error': str(e),
--                 'timestamp': time.time()
--             }
--             self.quality_checks.append(quality_status)
--             return quality_status

Proactive monitoring

Proactive monitoring
-- Implement proactive data source monitoring
-- Example pseudo-code:
--
-- def monitor_data_sources():
--     data_sources = get_data_source_list()
--
--     for data_source in data_sources:
--         try:
--             # Check data source health
--             is_healthy, message = check_data_source_health(data_source)
--
--             if not is_healthy:
--                 send_alert(f"Data source {data_source} is unhealthy: {message}")
--
--             # Check data source size
--             current_size = get_data_source_size(data_source)
--             if current_size == 0:
--                 send_alert(f"Data source {data_source} is empty")
--
--         except Exception as e:
--             send_alert(f"Failed to monitor data source {data_source}: {e}")

See also

Updated