Data Sources

What is a Data Source?

Data Sources make it super easy to bring your data into Tinybird. Think of it like a table in a database, but with a little extra on top.

When you ingest data, it is written to a Data Source. You can then write SQL to query data from a Data Source.

A Data Source combines the functionality of accessing external data and writing it to a table.

What should I use Data Sources for?

You will ingest your data into a Data Source, and build your queries against a Data Source.

If your event data lives in a Kafka topic, for instance, you can create a Data Source that connects directly to Kafka and writes the events to Tinybird. You can then create a Pipe to query your fresh event data.

A Data Source can also be the result of materializing a SQL query through a Pipe.

Creating Data Sources

Creating Data Sources in the UI

In your workspace, you’ll find the Data Sources section at the bottom of the left side navigation.

Click the Plus (+) icon to add a new Data Source (see Mark 1 below).

../_images/concepts-data-sources-creating-data-source-1.png

Events API

In the Data Source window, click on the Events API tab (see Mark 1 below). You can switch between multiple different code snippets for different languages (see Mark 2 below). Use the Copy snippet button to copy the desired snippet to your clipboard (see Mark 3 below).

The Events API does not require you to create a Data Source upfront, you can add the copied snippet directly into your application and Tinybird will automatically create the Data Source for you when data is received.

../_images/concepts-data-sources-events-api-1.png

Kafka

In the Data Source window, click on the Kafka tab (see Mark 1 below). Enter your connection details in the New connection form. (see Mark 2 below).

When you are finished configuring the connection, click Connect to finish (see Mark 3 below).

../_images/concepts-data-sources-kafka-connection-1.png

In the next screen, you can select which Topic to consume from (see Mark 1 below) and configure the Consumer Group name (see Mark 2 below).

When you are finished configuring the Topic consumer, click Connect to finish (see Mark 3 below).

../_images/concepts-data-sources-kafka-connection-2.png

In the last screen, you can choose where the consumer should start from the Earliest or Latest offset (see Mark 1 below). You can also see a preview of the schema & data (see Mark 2 below).

Click Continue (see Mark 3 below) to start importing the data.

../_images/concepts-data-sources-kafka-connection-3.png

Remote URL

In the Data Source window, click on the Remote URL tab (see Mark 1 below). In the text box, you can enter a URL to a remote file available over HTTP (see Mark 2 below).

When you are finished entering the URL, click Add to finish (see Mark 3 below).

../_images/concepts-data-sources-remote-url-1.png

On the next screen you can give the Data Source a name & description (see Mark 1 below). You can also see a preview of the schema & data (see Mark 2 below).

Click Continue (see Mark 3 below) to start importing the data.

../_images/concepts-data-sources-remote-url-schema-preview-1.png

Local File

In the Data Source window, click on the Local file tab (see Mark 1 below). Click on the Choose a CSV, NDJSON or Parquet file to upload text (see Mark 2 below) to open a file selector and choose a file you want to upload.

When you are finished selecting a file, click Add to finish (see Mark 3 below).

../_images/concepts-data-sources-local-file-1.png

On the next screen you can give the Data Source a name & description (see Mark 1 below). You can also see a preview of the schema & data (see Mark 2 below).

Click Continue (see Mark 3 below) to start importing the data.

../_images/concepts-data-sources-remote-url-schema-preview-1.png

Creating Data Sources in the CLI

Data Sources operations are performed using the tb datasource commands.

Events API

The Events API does not require you to create a Data Source upfront, Tinybird will automatically create the Data Source for you when data is received.

Kafka

To create a Kafka Data Source from the CLI, you must first create a Kafka connection:

tb connection create kafka --bootstrap-server HOST:PORT --key KEY --secret SECRET --connection-name CONNECTION_NAME

You can then interactively create the Data Source using the connnection. You will be prompted to enter the consumer details:

tb datasource connect CONNECTION_NAME DATASOURCE_NAME

Kafka topic:
Kafka group:
Kafka doesnt seem to have prior commits on this topic and group ID
Setting auto.offset.reset is required. Valid values:
latest          Skip earlier messages and ingest only new messages
earliest        Start ingestion from the first message
Kafka auto.offset.reset config:
Proceed? [y/N]:

You can also do this non-interactively:

tb datasource connect CONNECTION_NAME DATASOURCE_NAME --topic TOPIC --group GROUP --auto-offset-reset OFFSET

Remote URL

If you have a remote file available over HTTP, you can create & import the file into a new Data Source with the following command:

tb datasource append DATA_SOURCE_NAME URL

Alternatively, if you want to generate a .datasource file to version control your new Data Source, you can instead use this command:

tb datasource generate URL

After creating the .datasource file, you will need to push it to Tinybird:

tb push DATA_SOURCE_FILE

Local File

If you have a local file available, you can create & import the file into a new Data Source with the following command:

tb datasource append DATA_SOURCE_NAME FILE_PATH

Alternatively, if you want to generate a .datasource file to version control your new Data Source, you can instead use this command:

tb datasource generate FILE_PATH

After creating the .datasource file, you will need to push it to Tinybird:

tb push DATA_SOURCE_FILE

Setting Data Source TTL

You can apply a TTL (Time To Live) to a Data Source in Tinybird. A TTL allows you to define how long data should be stored for.

For example, you might define a TTL of 7 Days, which means that any data older than 7 Days should be deleted. Data that is older than the defined TTL is deleted automatically.

You must define the TTL at the time of creating the Data Source & your data must have a column who’s type can represent a date. Valid types are any of the Date or Int types.

Setting Data Source TTL in the UI

This section describes setting the TTL when creating a new Data Source in the Tinybird UI.

When creating your new Data Source, you can select a TTL on the Schema preview modal (see Mark 1 below). You must select a column that represents a date (see Mark 2 below).

If you are using the Tinybird Events API & want to use a TTL, you must create the Data Source with a TTL first before sending data.

../_images/concepts-data-sources-create-ds-with-ttl-1.png

After selecting a column, you can then define the TTL period in days (see Mark 1 below).

../_images/concepts-data-sources-create-ds-with-ttl-2.png

Alternatively, if you need to apply transformation to the date column, or want to use more complex logic, you can select the Use custom SQL option (see Mark 1 below).

../_images/concepts-data-sources-create-ds-with-ttl-3.png

You can then enter some custom SQL to define your TTL (see Mark 1 below).

../_images/concepts-data-sources-create-ds-with-ttl-4.png

Setting Data Source TTL in the CLI

This section describes setting the TTL when creating a new Data Source in the CLI.

When creating a new Data Source, you can add a TTL to the .datasource file.

At the end of a .datasource file you will find the Engine settings. Add a new setting called ENGINE_TTL and enter your TTL string enclosed in double quotes (“).

SCHEMA >
    `date` DateTime,
    `product_id` String,
    `user_id` Int64,
    `event` String,
    `extra_data` String

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYear(date)"
ENGINE_SORTING_KEY "date, user_id, event, extra_data"
ENGINE_TTL "date + toIntervalDay(90)"

Changing Data Source TTL

It is possible to modify the TTL of an existing Data Source. You can add a TTL if one was not specified previously, or update an existing TTL.

Changing Data Source TTL in the UI

This section describes changing the TTL of an existing Data Source in the Tinybird UI.

First, navigate to the Data Source details page by clicking on the Data Source who’s TTL you wish to change (see Mark 1 below). Then, click on the Schema tab (see Mark 2 below). You’ll find the Data Source’s TTL at the bottom of the right hand column, click the TTL text (see Mark 3 below).

../_images/concepts-data-sources-modify-ttl-1.png

A dialog window will open. Click into the dropdown menu (see Mark 1 below) to show the available fields to use for the TTL. Click on an item from the dropdown to select it as the field for the TTL (see Mark 2 below).

../_images/concepts-data-sources-modify-ttl-2.png

With the field selected, you can change what the TTL interval will be (see Mark 1 below). When you are finished, click Save (see Mark 2 below).

../_images/concepts-data-sources-modify-ttl-3.png

Finally, you will see the updated TTL value in the Data Source’s Schema page (see Mark 1 below).

../_images/concepts-data-sources-modify-ttl-4.png

Changing Data Source TTL in the CLI

This section describes changing the TTL of an existing Data Source in the CLI.

At the end of a .datasource file you will find the Engine settings.

If no TTL has been applied, add a new setting called ENGINE_TTL and enter your TTL string enclosed in double quotes (“). If a TTL has already been applied, modify the existing TTL string between the double quotes (“).

The ENGINE_TTL setting looks like this:

ENGINE_TTL "date + toIntervalDay(90)"

When finished modifying the .datasource file, you must push the changes to Tinybird using the CLI:

tb push DATA_SOURCE_FILE -f

Data Sources supported ingestion methods

  • Kafka

  • Events API

  • Local files

  • Remote files reachable through a URL

Data Sources supported data formats

The Quarantine Data Source

Every data source you create in your Workspace has a quarantine Data Source associated. If you send rows that don’t fit the Data Source schema, they are automatically sent to the quarantine table. This way, the whole ingestion process doesn’t fail, and you can review quarantined rows later or perform operations on them using Pipes. This is a great source of information for you to fix your origin source, or a very powerful way to do the needed changes on-the-fly during the ingestion process.

By convention, the quarantine Data Source is named {datasource_name}_quarantine.

Quarantine Data Source schema contains the columns of the original row plus some extra ones —c__error_column, c__error, c__import_id, and insertion_date— with information about the issues that made it go to quarantine.

See the Quarantine Guide for practical examples on using the Quarantice Data Source.