Syncing data from S3 or GCS buckets

Intermediate

In this guide you’ll learn how to automatically synchronize all the CSV files in an AWS S3 or Google Cloud Storage bucket to a Tinybird Data Source.

One time dump

Sometimes you have a bunch of CSV files in a bucket that you want to append to one or more Tinybird Data Sources just once. This scenario can be easily scripted using bash, the AWS/GCS command line tools and the Tinybird Data Source API.

In the intro to ingesting data guide, you saw how to programmatically ingest data to Tinybird to the events Data Source.

We have this set of events files in a bucket:

Let’s see how we can ingest them to a single Data Source.

From an S3 bucket

You can use a bash script like the one given below:

  • TB_HOST as the corresponding URL for your region. We currently have: https://api.us-east.tinybird.co for US accounts, and https://api.tinybird.co for the rest of the world accounts.

  • TB_TOKEN as a Tinybird auth token with DATASOURCE:CREATE or DATASOURCE:APPEND scope. See the Token APIfor more information.

  • BUCKET as the S3 URI of the bucket containing the events CSV files.

  • DESTINATION_DATA_SOURCE as the name of the Data Source in Tinybird, in this case events.

It uses the aws cli to list the files in the bucket, extract the name of the CSV file, create a signed URL and append to the DESTINATION_DATA_SOURCE.

To avoid hitting API rate limits it sleeps 15 seconds between each request.

This is the complete bash script:

This script requires having the AWS command line tool installed and configured, and the DESTINATION_DATA_SOURCE previously created in the Tinybird account.

From a Google Cloud Storage bucket

You can use a bash script like the one given below:

  • TB_HOST as the corresponding URL for your region. We currently have: https://api.us-east.tinybird.co for US accounts, and https://api.tinybird.co for the rest of the world accounts.

  • TB_TOKEN as a Tinybird auth token with DATASOURCE:CREATE or DATASOURCE:APPEND scope. See the Token APIfor more information.

  • BUCKET as the GCS URI of the bucket containing the events CSV files.

  • DESTINATION_DATA_SOURCE as the name of the Data Source in Tinybird, in this case events.

  • GOOGLE_APPLICATION_CREDENTIALS as the local path of a Google Cloud service account JSON file.

  • REGION as the Google Cloud region name.

It uses the gsutil cli to list the files in the bucket, extract the name of the CSV file, create a signed URL and append it to the DESTINATION_DATA_SOURCE.

To avoid hitting API rate limits it sleeps 15 seconds between each request.

This is the complete bash script:

This script requires having the gsutil command line tool installed and configured, and the DESTINATION_DATA_SOURCE previously created in the Tinybird account.

Syncing CSV files automatically with cloud functions

While the previously described scenario consisted of a one time dump of CSV files in a bucket to Tinybird, a different and more interesting scenario consists of appending to a Data Source each time a new CSV file is dropped into a bucket.

That way you can have your ETL process exporting some data from your Data Warehouse (such as Snowflake or BigQuery) or any other origin and you can forget about synchronizing those files to Tinybird.

This can be achieved using AWS lambda functions or Google cloud functions.

Syncing CSV files from S3 to Tinybird with Lambda functions

Imagine you have an S3 bucket named s3://automatic-ingestion-poc/ and each time you put a CSV there you want to sync it automatically to an events Data Source previously created in Tinybird. This is how you do it:

  • Clone this GitHub repository

  • Install the AWS cli and run aws configure. You need to provide the region as well since it is required by the script.

  • Now run cp .env_sample .env and set the TB_HOST, TB_TOKEN, and``S3_BUCKET`` (in our case S3_BUCKET=automatic-ingestion-poc)

  • Run the ./run.sh script. It’ll deploy a Lambda function with name TB_FUNCTION_NAME to your AWS account, which will listen for new files in the S3_BUCKET and automatically append them to a Tinybird Data Source described by the FILE_REGEXP environment variable.

It creates a Lambda function in your AWS account:

Lambda function to sync an S3 bucket to Tinybird

Lambda function to sync an S3 bucket to Tinybird

Each time you drop a new CSV file it is automatically ingested in the destination Data Source. In this case, the Lambda function is automatically configured to support this naming convention in the CSV files {DESTINATION_DATA_SOURCE}_**.csv. For instance, if you drop a file named events_0.csv it’ll be appended to the events Data Source. You can modify this behaviour with the FILE_REGEXP environment variable or by modifying the code of the Lambda function.

Drop files to an S3 bucket and check the datasources_ops_log

Drop files to an S3 bucket and check the datasources_ops_log

Syncing CSV files from GCS to Tinybird with cloud functions

Imagine you have a GCS bucket named gs://automatic-ingestion-poc/ and each time you put a CSV there you want to sync it automatically to an events Data Source previously created in Tinybird.

This is how you do it:

  • Clone this GitHub repository

  • Install and configure the gcloud command line tool.

  • Now run cp .env.yaml.sample .env.yaml and set the TB_HOST, and TB_TOKEN variable

  • Run:

It deploys a Google cloud function with name TB_FUNCTION_NAME to your Google Cloud account, which will listen for new files in the BUCKET_NAME provided (in our case automatic-ingestion-poc) and automatically append them to a Tinybird Data Source described by the FILE_REGEXP environment variable.

Cloud function to sync a GCS bucket to Tinybird

Cloud function to sync a GCS bucket to Tinybird

Now you can drop CSV files into the configured bucket:

Drop files to a GCS bucket and check the datasources_ops_log

Drop files to a GCS bucket and check the datasources_ops_log

A nice pattern consists of naming the CSV files like this datasourcename_YYYYMMDDHHMMSS.csv so they are automatically appended to datasourcename in Tinybird. For instance, events_20210125000000.csv will be appended to the events Data Source.