S3 Connector

The S3 Connector allows you to ingest files from your Amazon S3 bucket into Tinybird. You can choose to load a full bucket, or to load files that match a pattern.

The S3 Connector is fully managed and requires no additional tooling. You can choose to execute the S3 Connector manually or automatically, and all scheduling is handled by Tinybird.

Supported file types

The S3 Connector supports the following file types:

File typeAccepted extensionsCompression formats supported
CSV.csv, .csv.gzgzip
NDJSON.ndjson, ndjson.gzgzip
Parquet.parquet, parquet.gzsnappy, gzip, lzo, brotli, lz4, zstd

Set up

The setup process can be done using the UI or the CLI. It involves configuring both Tinybird and AWS:

  1. Create a new Data Source in Tinybird
  2. Create the AWS S3 connection
  3. Configure the scheduling options and path/file names
  4. Start ingesting

Prerequisites

To use the Tinybird S3 Connector feature, you should be familiar with Amazon S3 buckets and have the necessary permissions to set up a new policy and role in AWS.

Required IAM permissions

As part of the setup process below, the S3 Connector requires certain permissions to access objects in your Amazon S3 bucket. The IAM Role needs the following permissions:

  • s3:GetObject
  • s3:ListBucket
  • s3:ListAllMyBuckets

An example AWS Access Policy would look like this (with bucket name replaced):

Note: These policies are just examples. You need to get the actual JSONs using the API, CLI or IU as explained below.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket_name>",
                "arn:aws:s3:::<bucket_name>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

And the trust policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Principal": {
                "AWS": "arn:aws:iam::473819111111111:root"
            },
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "ab3caaaa-01aa-4b95-bad3-fff9b2ac789f8a9"
                }
            }
        }
    ]
}

Load files from an S3 bucket using the UI

1. Create a new Data Source

Open the Tinybird UI add a new Data Source. In the Data Sources modal, select the Amazon S3 option. Select "New Connection" and click "Next". Enter the bucket name, select the region, and click "Continue".

2. Create the AWS S3 connection

In the next screen, follow the 4-step instructions:

  1. Open the AWS console and navigate to IAM.
  2. Create and name the policy using the provided copyable option.
  3. Create and name the role with the trust policy using the provided copyable option.
  4. Select "Connect".

You’ll need the role’s ARN (Amazon Resource Name) in order to create the connection in the next step. To save you having to come back and look for it, go to IAM > Roles and browse the search box for the role you just created. Select it to open more role details, including the role's ARN. Copy it down somewhere you can find it easily again. It'll look like something like arn:aws:iam::111111111111:role/my-awesome-role.

  1. Paste in the connection name and ARN.

3. Choose data

Choose the data you wish to ingest and select "Next".

4. Preview and create

The next screen shows a preview of the incoming data. Here, you can review & modify any of the incoming columns, adjusting their names, changing their types or deleting them entirely. You can also configure the name of the Data Source. After reviewing your incoming data, select "Create Data Source".

You're done 🎉 ! On the Data Source details page, you can see the sync history in the tracker chart and the current status of the connection.

Load files from an S3 bucket using the CLI

You need to create a connection before you can load files from Amazon S3 into Tinybird using the CLI. Creating a connection grants your Tinybird Workspace the appropriate permissions to view files in Amazon S3.

Authenticate your CLI and switch to the desired Workspace.

To create a connection for the Tinybird S3 Connector following these steps, you need to use a CLI version equal to or higher than 3.8.3.

Steps:

  1. Run the tb connection create s3_iamrole --policy read command. Note that the --policy flag allows to switch between write (sink) and read (ingest) policies.
  2. To move to the next step, type y.
  3. Copy the suggested policy and replace the bucket placeholder <bucket> with your bucket name.
  4. In AWS, create a new policy in IAM > Policies (JSON) using the copied text.
  5. Go to the next step in the CLI and copy the next policy.
  6. In AWS navigate to IAM > Roles and copy the new custom trust policy. At the next step, attach the policy you created in the previous step.
  7. Go to the next step in the CLI and copy the full ARN (Amazon Resource Name) of the of the role you just created. Go to IAM > Roles and browse the search box for the role you just created. Select it to open more role details, including the role's ARN. Copy it and paste it into the CLI when requested. It'll look like something like arn:aws:iam::111111111111:role/my-awesome-role.
  8. Enter the region of the bucket, such as us-east-1.
  9. And (last but not least) provide a name for your connection in Tinybird.

A new s3_ingest.connection file will be created in your project files.

Note: At the moment, the .connection file is not used and cannot be pushed to Tinybird. It is safe to delete this file. A future release will allow you to push this file to Tinybird to automate creation of connections, similar to Kafka connections.

Now that your connection is created, you can create a Data Source to configure the import of files from Amazon S3.

The Amazon S3 import is configured using the following options, which can be added at the end of your .datasource file:

  • IMPORT_SERVICE: name of the import service to use, in this case, s3_iamrole.
  • IMPORT_SCHEDULE: either @auto to sync once per minute, or @on-demand to only execute manually (UTC).
  • IMPORT_STRATEGY: the strategy used to import data, only APPEND is supported.
  • IMPORT_BUCKET_URI: a full bucket path, including the s3:// protocol , bucket name, object path and an optional pattern to match against object keys. For example, s3://my-bucket/my-path would discover all files in the bucket my-bucket under the prefix /my-path. You can use patterns in the path to filter objects, for example, ending the path with *.csv will match all objects that end with the .csv suffix.
  • IMPORT_CONNECTION_NAME: the name of the S3 connection to use.

Note: For IMPORT_STRATEGY only APPEND is supported today. When new files are discovered, data from new files will be appended to any previous data in the Data Source. Replacing data is not supported.

For example:

s3.datasource file
DESCRIPTION >
    Analytics events landing data source

SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `session_id` String `json:$.session_id`,
    `action` LowCardinality(String) `json:$.action`,
    `version` LowCardinality(String) `json:$.version`,
    `payload` String `json:$.payload`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp"
ENGINE_TTL "timestamp + toIntervalDay(60)"

IMPORT_SERVICE s3_iamrole
IMPORT_CONNECTION_NAME connection_name
IMPORT_BUCKET_URI s3://my-bucket/*.csv
IMPORT_SCHEDULE @auto
IMPORT_STRATEGY APPEND

With your connection created and Data Source defined, you can now push your project to Tinybird using:

tb push

Schema evolution

The S3 Connector supports adding new columns to the schema of the Data Source via the CLI.

Non-backwards compatible changes, such as dropping, renaming, or changing the type of columns, are not supported and any rows from these files are sent to the quarantine Data Source.

At the moment, to iterate an S3 Data Source, you need to create a new Data Source with a new schema.

Note: When new files are discovered, data from new files is appended to any previous data in the Data Source. Replacing data is not supported.

Iterating an S3 Data Source

If you're using Branches, take into account that connections can only be created in the main Workspace and through the CLI (UI support for this is currently limited).

First of all, create the connector:

tb auth # use the main Workspace admin Token
tb connection create s3_iamrole

Then, to iterate an S3 Data Source through a Branch, create the Data Source using a connector that already exists. The S3 Connector won't ingest any data, as it is not configured to work in Branches. To test it on CI, you can directly append the files to the Data Source.

After you've merged it and are running CD checks, run tb datasource sync <datasource_name> to force the sync in the main Workspace.

Limits

There are some limits applied to the S3 Connector when using the auto mode:

  • Automatic execution of imports runs once every 1 minute.
  • Each run will import at most 5 files. If there are more than 5 new files, they will be left for the next run.

If you are regularly exceeding 5 files per minute, this limit can be adjusted. Contact us in our Slack community or email us at support@tinybird.co.

When using on-demand, these limits do not apply. A manual execution of the S3 connector will sync all new files available since the last run.

There's also a limit for the maximum file size per type:

File typeMax file size
CSV10 GB for the Free plan, 32 GB for Pro and Enterprise
NDJSON10 GB for the Free plan, 32 GB for Pro and Enterprise
Parquet1 GB

Check the limits page for limits on ingestion, queries, API Endpoints, and more.