S3 Connector

The S3 Connector allows you to ingest files from your Amazon S3 bucket into Tinybird.

You can choose to load a full bucket, or to load files that match a pattern.

The S3 Connector is fully managed and requires no additional tooling. You can choose to execute the S3 Connector manually or automatically, and all scheduling is handled by Tinybird.

Loading files from an S3 bucket

Loading files from an S3 bucket in the UI

Open the Tinybird UI and add a new Data Source by clicking the + icon next to the Data Sources section on the left hand side navigation bar (see Mark 1 below).

../_images/ingest-s3-connector-sync-first-table-ui-1.png

You will see the Data Sources modal appear. In the modal, click on the Amazon S3 box (see Mark 1 below).

../_images/ingest-s3-connector-sync-first-table-ui-2.png

On the next screen, you can enter your AWS details. You will need to generate Access Keys for an AWS IAM user. Each Access Key has an Access Key ID and a Secret Access Key. Paste your Access Key ID into the Access Key ID box (see Mark 1 below). Paste your Secret Access Key into the Secret Access Key box (see Mark 2 below). Lastly, you must select the AWS region in which your S3 bucket is located (see Mark 3 below). Click Connect (see Mark 4 below) when done.

../_images/ingest-s3-connector-sync-first-table-ui-3.png

The next screen will show you a summary of your connection. Here, you can give your connection a memorable name to identify it in the Tinybird UI (see Mark 1 below). After entering a name, click Next (see Mark 2 below).

../_images/ingest-s3-connector-sync-first-table-ui-4.png

Now you can configure the bucket path to use to discover files in S3. In the Bucket path text box (see Mark 1 below) enter a full bucket path, including the s3:// protocol, bucket name, object path and an optional pattern to match against object keys. For example, s3://my-bucket/my-path would discover all files in the bucket my-bucket under the prefix /my-path. You can use patterns in the path to filter objects, for example, ending the path with *.csv will match all objects that end with the .csv suffix. Click the Preview button (see Mark 2 below) to run a test discovery and review the list of files that are returned. When you’re done, click Next (see Mark 3 below).

../_images/ingest-s3-connector-sync-first-table-ui-5.png

On the next screen, you can select the frequency to scan for new files. Select Auto (see Mark 1 below) to have Tinybird scan for new files once every minute automatically. Select On demand (see Mark 2 below) to only scan for files when manually executed. Click Next (see Mark 3 below).

Note: When new files are discovered, data from new files will be appended to any previous data in the Data Source. Replacing data is not supported.

../_images/ingest-s3-connector-sync-first-table-ui-6.png

Finally, you will see a preview of the incoming data. On this screen, you can review & modify any of the incoming columns, adjusting their names, changing their types or deleting them entirely (see Mark 1 below). You can also configure the name of the Data Source (see Mark 2 below). When done, click Create Data Source (see Mark 3 below).

../_images/ingest-s3-connector-sync-first-table-ui-7.png

You’re done! On the Data Source details page, you can see the sync history in the tracker chart (see Mark 1 below) and the current status of the connection (see Mark 2 below).

Note: For IMPORT_STRATEGY only APPEND is supported today.

../_images/ingest-s3-connector-sync-first-table-ui-8.png

Loading files from an S3 bucket in the CLI

To load files from an S3 bucket into Tinybird using the CLI, you first need to create a connection. Creating a connection will grant your Tinybird Workspace the appropriate permissions to view files in S3.

Authenticate your CLI, and switch to the desired Workspace. Then run:

tb connection create s3

After running this command, you will be prompted to enter your S3 details. You will need to generate Access Keys for an AWS IAM user. Each Access Key has an Access Key ID and a Secret Access Key. Paste your Access Key ID when prompted for the Key. Paste your Secret Access Key when prompted for the Secret. When prompted for the S3 region, enter the region identifier string, e.g. eu-west-3. Lastly, enter a memorable name for the connection.

A new s3.connection file will be created in your project files.

Note: At the moment, the .connection file is not used and cannot be pushed to Tinybird. It is safe to delete this file. A future release will allow you to push this file to Tinybird to automate creation of connections, similar to Kafka connections.

Now that your connection is created, you can create a Data Source to configure the import of files from S3.

The S3 import is configured using the following options, which can be added at the end of your .datasource file:

  • IMPORT_SERVICE: name of the import service to use, in this case, s3

  • IMPORT_SCHEDULE: either @auto to sync once per minute, or @on-demand to only execute manually

  • IMPORT_STRATEGY: the strategy used to import data, only APPEND is supported

  • IMPORT_BUCKET_URI: a full bucket path, including the s3:// protocol , bucket name, object path and an optional pattern to match against object keys. For example, s3://my-bucket/my-path would discover all files in the bucket my-bucket under the prefix /my-path. You can use patterns in the path to filter objects, for example, ending the path with *.csv will match all objects that end with the .csv suffix.

  • IMPORT_CONNECTION_NAME: the name of the S3 connection to use

Note: For IMPORT_STRATEGY only APPEND is supported today. When new files are discovered, data from new files will be appended to any previous data in the Data Source. Replacing data is not supported.

For example:

s3.datasource file
DESCRIPTION >
    Analytics events landing data source

SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `session_id` String `json:$.session_id`,
    `action` LowCardinality(String) `json:$.action`,
    `version` LowCardinality(String) `json:$.version`,
    `payload` String `json:$.payload`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp"
ENGINE_TTL "timestamp + toIntervalDay(60)"

IMPORT_SERVICE s3
IMPORT_CONNECTION_NAME connection_name
IMPORT_BUCKET_URI s3://my-bucket/*.csv
IMPORT_SCHEDULE @auto
IMPORT_STRATEGY APPEND

With your connection created and Data Source defined, you can now push your project to Tinybird using:

tb push

Supported file types

The S3 Connector supports the following file types:

  • CSV

  • NDJSON

  • Parquet

Required IAM permissions

The S3 Connector requires certain permissions to access object in your Amazon S3 bucket. The IAM Role will need the following permissions:

  • s3:GetObject

  • s3:ListBucket

  • s3:ListAllMyBuckets

Schema Evolution

When the S3 Connector first runs, it will select 1 file from the initial load and use this to infer the schema of the Data Source. The file it selects will be denoted by a blue “Schema reference” bubble (see Mark 1 below).

../_images/ingest-s3-connector-schema-evolution-1.png

The S3 Connector supports automatic creation of new columns. This means that, if a new file contains a new column that has not been seen before, the next sync job will automatically add it to the Tinybird Data Source.

Non-backwards compatible changes, such as dropping, renaming or changing the type of columns, are not supported and any rows from these files will be sent to the Quarantine Data Source.

Limits

There are some limits applied to the S3 Connector when using the auto mode:

  • Automatic execution of imports runs once every 1 minute

  • Each run will import at most 5 files. If there are more than 5 new files, they will be left for the next run.

If you are regularly exceeding 5 files per minute, this limit can be adjusted. Please contact us in our Slack community or email us at support@tinybird.co.

When using on-demand, these limits do not apply. A manual execution of the S3 connector will sync all new files available since the last run.