S3 Connector¶
The S3 Connector allows you to ingest files from your Amazon S3 bucket into Tinybird. You can choose to load a full bucket, or to load files that match a pattern.
The S3 Connector is fully managed and requires no additional tooling. You can choose to execute the S3 Connector manually or automatically, and all scheduling is handled by Tinybird.
Supported file types¶
The S3 Connector supports the following file types:
- CSV
- NDJSON
- Parquet
Set up¶
The setup process can be done using the UI or the CLI. It involves configuring both Tinybird and AWS:
- Create a new Data Source in Tinybird
- Create the AWS S3 connection
- Configure the scheduling options and path/file names
- Start ingesting
Prerequisites¶
To use the Tinybird S3 Connector feature, you should be familiar with Amazon S3 buckets and have the necessary permissions to set up a new policy and role in AWS.
Required IAM permissions¶
As part of the setup process below, the S3 Connector requires certain permissions to access objects in your Amazon S3 bucket. The IAM Role needs the following permissions:
- s3:GetObject
- s3:ListBucket
- s3:ListAllMyBuckets
An example AWS Access Policy would look like this (with bucket name replaced):
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<bucket_name>", "arn:aws:s3:::<bucket_name>/*" ], "Effect": "Allow" }, { "Sid": "Statement1", "Effect": "Allow", "Action": [ "s3:ListAllMyBuckets" ], "Resource": [ "*" ] } ] }
And the trust policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "sts:AssumeRole", "Principal": { "AWS": "arn:aws:iam::473819123789074:root" }, "Condition": { "StringEquals": { "sts:ExternalId": "ab3caaaa-01aa-4b95-bad3-fff9b2ac789f8a9" } } } ] }
Load files from an S3 bucket using the UI¶
Step 1: Create a new Data Source¶
Open the Tinybird UI add a new Data Source. In the Data Sources modal, select the Amazon S3 option. Select "New Connection" and click "Next". Enter the bucket name, select the region, and click "Continue".
Step 2: Create the AWS S3 connection¶
In the next screen, follow the 4-step instructions:
- Open the AWS console and navigate to IAM.
- Create and name the policy using the provided copyable option.
- Create and name the role with the trust policy using the provided copyable option.
- Select "Connect".
You’ll need the role’s ARN (Amazon Resource Name) in order to create the connection in the next step. To save you having to come back and look for it, go to IAM > Roles and browse the search box for the role you just created. Select it to open more role details, including the role's ARN. Copy it down somewhere you can find it easily again. It'll look like something like arn:aws:iam::111111111111:role/my-awesome-role
.
- Paste in the connection name and ARN.
Step 3: Choose data¶
Choose the data you wish to ingest and select "Next".
Step 4: Preview and create¶
The next screen shows a preview of the incoming data. Here, you can review & modify any of the incoming columns, adjusting their names, changing their types or deleting them entirely. You can also configure the name of the Data Source. After reviewing your incoming data, select "Create Data Source".
You're done 🎉 ! On the Data Source details page, you can see the sync history in the tracker chart and the current status of the connection.
Load files from an S3 bucket using the CLI¶
You need to create a connection before you can load files from Amazon S3 into Tinybird using the CLI. Creating a connection grants your Tinybird Workspace the appropriate permissions to view files in Amazon S3.
Authenticate your CLI and switch to the desired Workspace.
To create a connection for the Tinybird S3 Connector following these steps, you need to use a CLI version equal to or higher than 3.8.3.
Steps:
- Run the
tb connection create s3_iamrole --policy read
command. Note that the--policy
flag allows to switch between write (sink) and read (ingest) policies. - To move to the next step, type
y
. - Copy the suggested policy and replace the bucket placeholder
<bucket>
with your bucket name. - In AWS, create a new policy in IAM > Policies (JSON) using the copied text.
- Go to the next step in the CLI and copy the next policy.
- In AWS navigate to IAM > Roles and copy the new custom trust policy. At the next step, attach the policy you created in the previous step.
- Go to the next step in the CLI and copy the full ARN (Amazon Resource Name) of the of the role you just created. Go to IAM > Roles and browse the search box for the role you just created. Select it to open more role details, including the role's ARN. Copy it and paste it into the CLI when requested. It'll look like something like
arn:aws:iam::111111111111:role/my-awesome-role
. - Enter the region of the bucket, such as
us-east-1
. - And (last but not least) provide a name for your connection in Tinybird.
A new s3_ingest.connection
file will be created in your project files.
Note: At the moment, the .connection
file is not used and cannot be pushed to Tinybird. It is safe to delete this file. A future release will allow you to push this file to Tinybird to automate creation of connections, similar to Kafka connections.
Now that your connection is created, you can create a Data Source to configure the import of files from Amazon S3.
The Amazon S3 import is configured using the following options, which can be added at the end of your .datasource
file:
IMPORT_SERVICE
: name of the import service to use, in this case,s3_iamrole
.IMPORT_SCHEDULE
: either@auto
to sync once per minute, or@on-demand
to only execute manually (UTC).IMPORT_STRATEGY
: the strategy used to import data, onlyAPPEND
is supported.IMPORT_BUCKET_URI
: a full bucket path, including thes3://
protocol , bucket name, object path and an optional pattern to match against object keys. For example,s3://my-bucket/my-path
would discover all files in the bucketmy-bucket
under the prefix/my-path
. You can use patterns in the path to filter objects, for example, ending the path with*.csv
will match all objects that end with the.csv
suffix.IMPORT_CONNECTION_NAME
: the name of the S3 connection to use.
Note: For IMPORT_STRATEGY
only APPEND
is supported today. When new files are discovered, data from new files will be appended to any previous data in the Data Source. Replacing data is not supported.
For example:
s3.datasource file
DESCRIPTION > Analytics events landing data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_SERVICE s3_iamrole IMPORT_CONNECTION_NAME connection_name IMPORT_BUCKET_URI s3://my-bucket/*.csv IMPORT_SCHEDULE @auto IMPORT_STRATEGY APPEND
With your connection created and Data Source defined, you can now push your project to Tinybird using:
tb push
Schema evolution¶
When the S3 Connector first runs, it selects 1 file from the initial load and uses this to infer the schema of the Data Source. The file it selects is denoted by a blue "Schema reference" bubble (see Mark 1 below).
The S3 Connector supports automatic creation of new columns. This means that, if a new file contains a new column that has not been seen before, the next sync job will automatically add it to the Tinybird Data Source.
Non-backwards compatible changes, such as dropping, renaming, or changing the type of columns, are not supported and any rows from these files are sent to the Quarantine Data Source.
Note: When new files are discovered, data from new files is appended to any previous data in the Data Source. Replacing data is not supported.
Limits¶
There are some limits applied to the S3 Connector when using the auto
mode:
- Automatic execution of imports runs once every 1 minute.
- Each run will import at most 5 files. If there are more than 5 new files, they will be left for the next run.
If you are regularly exceeding 5 files per minute, this limit can be adjusted. Contact us in our Slack community or email us at support@tinybird.co.
When using on-demand
, these limits do not apply. A manual execution of the S3 connector will sync all new files available since the last run.