
Unlock this content
Enter your email to unlock this content for free
Cloud Storage
Cloud storage (S3, GCS) is the industry standard for data exchange and interoperability. It's very good for cost optimization, serves as a data lake, and enables self-service analytics between data engineering and software engineering teams. Ingesting from cloud storage requires handling initial loads, event-driven updates, and the challenges of S3 APIs.
Cloud storage (S3, GCS) provides durable, inexpensive, and highly available storage. There are open table formats that are cloud storage native, such as Apache Iceberg, which store metadata and data directly in cloud storage. ClickHouse has its own custom data format optimized for analytical queries (as described throughout this course) and can also store data in cloud storage.
In this chapter, we focus on using cloud storage as a cost-efficient, interoperable layer for ingesting external data sources. Cloud storage serves as a central repository of raw data (a data lake) that enables data exchange between different services and teams, making it ideal for ingesting data from external sources into ClickHouse.
Batch Ingestion: Cloud Storage
Cloud storage ingestion is part of the batch (or "pseudo-streaming") ingestion type. This involves big files you have in cloud storage or HTTP servers:
- S3, GCS buckets
- Large data files
- Periodic dumps
Characteristics: