Cloud Storage

Cloud Storage

TL;DR

Cloud storage (S3, GCS) is the industry standard for data exchange and interoperability. It's very good for cost optimization, serves as a data lake, and enables self-service analytics between data engineering and software engineering teams. Ingesting from cloud storage requires handling initial loads, event-driven updates, and the challenges of S3 APIs.

Cloud storage (S3, GCS) provides durable, inexpensive, and highly available storage. There are open table formats that are cloud storage native, such as Apache Iceberg, which store metadata and data directly in cloud storage. ClickHouse has its own custom data format optimized for analytical queries (as described throughout this course) and can also store data in cloud storage.

In this chapter, we focus on using cloud storage as a cost-efficient, interoperable layer for ingesting external data sources. Cloud storage serves as a central repository of raw data (a data lake) that enables data exchange between different services and teams, making it ideal for ingesting data from external sources into ClickHouse.

Batch Ingestion: Cloud Storage

Cloud storage ingestion is part of the batch (or "pseudo-streaming") ingestion type. This involves big files you have in cloud storage or HTTP servers:

S3, GCS buckets
Large data files
Periodic dumps

Characteristics:

Unlock this content

Batch Ingestion: Cloud Storage

Table of Contents

Ship fast over a Managed ClickHouse^®

Unlock this content

Cloud Storage

Batch Ingestion: Cloud Storage

Table of Contents

Ship fast over a Managed ClickHouse®

Ship fast over a Managed ClickHouse^®