Ingest from Google GCS

Intermediate

In this guide you'll learn how to automatically synchronize all the CSV files in a Google GCS bucket to a Tinybird Data Source.

Perform a once-off load

Often when you're building a new use-case with Tinybird, you'll want to load some historical data that comes from another system. You call this 'seeding' or 'backfilling'.

A very common pattern for exporting historical data is to create a dump of CSV files onto a Google GCS bucket. Once you have your CSV files, you need to ingest these into Tinybird.

You can append these files to a Data Source in Tinybird using the Data Sources API.

Let's assume we have a set of CSV files in our GCS bucket:

To ingest a single file, you can generate a signed URL in GCP, and simply send the URL to the Data Sources API using the append mode flag.

For example:

curl -H "Authorization: Bearer <your_auth_token>" \
   -X POST "api.tinybird.co/v0/datasources?name=<my_data_source_name>&mode=append" \
   --data-urlencode "url=<my_gcs_file_http_url>"

However, if you want to send many files, you probably don't want to manually write each cURL. So, we can use a simple script to iterate over our files in the bucket and generate the cURL commands automatically.

This script requires the gsutil tool and assumes you have already created your Tinybird Data Source.

You can use the gsutil tool to list the files in the bucket, extract the name of the CSV file and create a signed URL. Then, we can generate a cURL to send the signed URL to Tinybird.

To avoid hitting API rate limits you should delay 15 seconds between each request.

Here's an example script in bash:

The script uses the following variables:

  • TB_HOST as the corresponding URL for your region. We currently have: https://api.us-east.tinybird.co for US accounts, and https://api.tinybird.co for the rest of the world accounts.
  • TB_TOKEN as a Tinybird auth token with DATASOURCE:CREATE or DATASOURCE:APPEND scope. See the Tokens APIfor more information.
  • BUCKET as the GCS URI of the bucket containing the events CSV files.
  • DESTINATION_DATA_SOURCE as the name of the Data Source in Tinybird, in this case events.
  • GOOGLE_APPLICATION_CREDENTIALS as the local path of a Google Cloud service account JSON file.
  • REGION as the Google Cloud region name.

Automatically sync files with Google Cloud Functions

While the previously described scenario consisted of a one time dump of CSV files in a bucket to Tinybird, a different and more interesting scenario consists of appending to a Data Source each time a new CSV file is dropped into a bucket.

That way you can have your ETL process exporting some data from your Data Warehouse (such as Snowflake or BigQuery) or any other origin and you can forget about synchronizing those files to Tinybird.

This can be achieved using Google Cloud Functions.

Imagine you have a GCS bucket named gs://automatic-ingestion-poc/ and each time you put a CSV there you want to sync it automatically to an events Data Source previously created in Tinybird.

This is how you do it:

  • Clone this GitHub repository
  • Install and configure the gcloud command line tool.
  • Now run cp .env.yaml.sample .env.yaml and set the TB_HOST, and TB_TOKEN variable
  • Run:

It deploys a Google cloud function with name TB_FUNCTION_NAME to your Google Cloud account, which will listen for new files in the BUCKET_NAME provided (in our case automatic-ingestion-poc) and automatically append them to a Tinybird Data Source described by the FILE_REGEXP environment variable.

Cloud function to sync a GCS bucket to Tinybird

Now you can drop CSV files into the configured bucket:

Drop files to a GCS bucket and check the datasources_ops_log

A nice pattern consists of naming the CSV files like this datasourcename_YYYYMMDDHHMMSS.csv so they are automatically appended to datasourcename in Tinybird. For instance, events_20210125000000.csv will be appended to the events Data Source.