Unlock this content

Enter your email to unlock this content for free

By continuing, you agree to our Terms of Service and Privacy Notice, and to receive occasional marketing emails.

Cloud Storage

TL;DR

Cloud storage (S3, GCS) is the industry standard for data exchange and interoperability. It's very good for cost optimization, serves as a data lake, and enables self-service analytics between data engineering and software engineering teams. Ingesting from cloud storage requires handling initial loads, event-driven updates, and the challenges of S3 APIs.

Cloud storage (S3, GCS) provides durable, inexpensive, and highly available storage. There are open table formats that are cloud storage native, such as Apache Iceberg, which store metadata and data directly in cloud storage. ClickHouse has its own custom data format optimized for analytical queries (as described throughout this course) and can also store data in cloud storage.

In this chapter, we focus on using cloud storage as a cost-efficient, interoperable layer for ingesting external data sources. Cloud storage serves as a central repository of raw data (a data lake) that enables data exchange between different services and teams, making it ideal for ingesting data from external sources into ClickHouse.

Batch Ingestion: Cloud Storage

Cloud storage ingestion is part of the batch (or "pseudo-streaming") ingestion type. This involves big files you have in cloud storage or HTTP servers:

  • S3, GCS buckets
  • Large data files
  • Periodic dumps

Characteristics:

Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.

Cloud Storage | ClickHouse for Developers