Data Sources

What's a Data Source?

Data Sources make it super easy to bring your data into Tinybird. They combine the functionality of accessing external data, and writing that data to a table. They function like database tables, but with a little extra on top.

When you ingest data in Tinybird, it gets written to a Data Source. You then write SQL to query data from the Data Source.

What should I use Data Sources for?

You ingest your data into a Data Source, and build your queries against the Data Source.

For instance, if your event data lives in a Kafka topic, you can create a Data Source that connects directly to Kafka and writes the events to Tinybird. You can then create a Pipe to query your fresh event data.

A Data Source can also be the result of materializing a SQL query through a Pipe.

There is a limit of 100 Data Sources per Workspace.

Creating Data Sources

Using the UI

In your Workspace, locate the Data Sources section at the bottom of the left side navigation.

Click the Plus (+) icon to add a new Data Source (see Mark 1 below).

image

Events API

In the Data Source window, click on the Events API selector (see Mark 1 below).

image

You can switch between multiple different code snippets for different languages (see Mark 1 below). Use the Copy snippet button to copy the desired snippet to your clipboard (see Mark 2 below).

image

The Events API does not require you to create a Data Source upfront, you can add the copied snippet directly into your application and Tinybird will automatically create the Data Source for you when data is received.

You can choose to create a Data Source schema upfront, and then start sending data to it over the Events API. This gives you greater control over the data types and sorting keys used by the Data Source.

Kafka

In the new Data Source modal, click on the Kafka connector (see Mark 1 below).

image

Enter your connection details in the form. When you are finished configuring the connection, click Next (see Mark 1 below).

image

In the next screen, you can select which Topic to consume from (see Mark 1 below) and configure the Consumer Group name (see Mark 2 below).

When you are finished configuring the Topic consumer, click Next (see Mark 3 below).

image

You can now choose where the consumer should start from, either the Earliest or Latest offset. Make your selection, then click Next (see Mark 1 below).

image

You will see a preview of the schema & data (see Mark 1 below). Click Create Data Source (see Mark 2 below) to finish, and start importing the data.

image

Remote URL

In the new Data Source modal, click on the Remote URL connector (see Mark 1 below).

image

On the next screen, select the format of your data (see Mark 1 below). Then, enter the URL to a remote file available over HTTP(S) (see Mark 2 below). Click Add (see Mark 3 below).

image

You will now see a preview of the schema & data (see Mark 1 below). Click Create Data Source (see Mark 2 below) to start importing the data.

image

Local File

In the new Data Source modal, click on the File Upload connector (see Mark 1 below).

image

On the next screen, click into the Upload box (see Mark 1 below) to open a file selector and choose a file you want to upload. You can also drag & drop a file onto the box.

image

You'll see a confirmation of the file name (see Mark 1 below). If it looks correct, click Add (see Mark 2 below).

image

You will now see a preview of the schema & data (see Mark 1 below). Click Create Data Source (see Mark 2 below) to start importing the data.

image

Creating Data Sources in the CLI

Data Sources operations are performed using the tb datasource commands.

Events API

The Events API does not require you to create a Data Source upfront, Tinybird will automatically create the Data Source for you when data is received.

Kafka

To create a Kafka Data Source from the CLI, you must first create a Kafka connection:

tb connection create kafka --bootstrap-servers HOST:PORT --key KEY --secret SECRET --connection-name CONNECTION_NAME

You can then interactively create the Data Source using the connnection. You will be prompted to enter the consumer details:

tb datasource connect CONNECTION_NAME DATASOURCE_NAME

Kafka topic:
Kafka group:
Kafka doesnt seem to have prior commits on this topic and group ID
Setting auto.offset.reset is required. Valid values:
latest          Skip earlier messages and ingest only new messages
earliest        Start ingestion from the first message
Kafka auto.offset.reset config:
Proceed? [y/N]:

You can also do this non-interactively:

tb datasource connect CONNECTION_NAME DATASOURCE_NAME --topic TOPIC --group GROUP --auto-offset-reset OFFSET

Remote URL

If you have a remote file available over HTTP, you can create & import the file into a new Data Source with the following command:

tb datasource append DATA_SOURCE_NAME URL

Alternatively, if you want to generate a .datasource file to version control your new Data Source, you can instead use this command:

tb datasource generate URL

After creating the .datasource file, you will need to push it to Tinybird:

tb push DATA_SOURCE_FILE

Local File

If you have a local file available, you can create & import the file into a new Data Source with the following command:

tb datasource append DATA_SOURCE_NAME FILE_PATH

Alternatively, if you want to generate a .datasource file to version control your new Data Source, you can instead use this command:

tb datasource generate FILE_PATH

After creating the .datasource file, you will need to push it to Tinybird:

tb push DATA_SOURCE_FILE

Setting Data Source TTL

You can apply a TTL (Time To Live) to a Data Source in Tinybird. A TTL allows you to define how long data should be stored for.

For example, you might define a TTL of 7 Days, which means that any data older than 7 Days should be deleted. Data that is older than the defined TTL is deleted automatically.

You must define the TTL at the time of creating the Data Source & your data must have a column who's type can represent a date. Valid types are any of the Date or Int types.

Setting Data Source TTL in the UI

This section describes setting the TTL when creating a new Data Source in the Tinybird UI.

If you are using the Tinybird Events API & want to use a TTL, you must create the Data Source with a TTL first before sending data.

When creating your new Data Source, click the Advanced Settings tab (see Mark 1 below). Click onto the TTL dropdown (see Mark 2 below). You must select a column that represents a date (see Mark 3 below).

image

After selecting a column, you can then define the TTL period in days (see Mark 1 below).

image

Alternatively, if you need to apply transformation to the date column, or want to use more complex logic, you can select the Code editor tab (see Mark 1 below). You can then enter some custom SQL to define your TTL (see Mark 2 below).

image

Setting Data Source TTL in the CLI

This section describes setting the TTL when creating a new Data Source in the CLI.

When creating a new Data Source, you can add a TTL to the .datasource file.

At the end of a .datasource file you will find the Engine settings. Add a new setting called ENGINE_TTL and enter your TTL string enclosed in double quotes (").

SCHEMA >
    `date` DateTime,
    `product_id` String,
    `user_id` Int64,
    `event` String,
    `extra_data` String

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYear(date)"
ENGINE_SORTING_KEY "date, user_id, event, extra_data"
ENGINE_TTL "date + toIntervalDay(90)"

Changing Data Source TTL

It is possible to modify the TTL of an existing Data Source. You can add a TTL if one was not specified previously, or update an existing TTL.

Changing Data Source TTL in the UI

This section describes changing the TTL of an existing Data Source in the Tinybird UI.

First, navigate to the Data Source details page by clicking on the Data Source who's TTL you wish to change (see Mark 1 below). Then, click on the Schema tab (see Mark 2 below). You'll find the Data Source's TTL at the bottom of the right hand column, click the TTL text (see Mark 3 below).

image

A dialog window will open. Click into the dropdown menu (see Mark 1 below) to show the available fields to use for the TTL. Click on an item from the dropdown to select it as the field for the TTL (see Mark 2 below).

image

With the field selected, you can change what the TTL interval will be (see Mark 1 below). When you are finished, click Save (see Mark 2 below).

image

Finally, you will see the updated TTL value in the Data Source's Schema page (see Mark 1 below).

image

Changing Data Source TTL in the CLI

This section describes changing the TTL of an existing Data Source in the CLI.

At the end of a .datasource file you will find the Engine settings.

If no TTL has been applied, add a new setting called ENGINE_TTL and enter your TTL string enclosed in double quotes ("). If a TTL has already been applied, modify the existing TTL string between the double quotes (").

The ENGINE_TTL setting looks like this:

ENGINE_TTL "date + toIntervalDay(90)"

When finished modifying the .datasource file, you must push the changes to Tinybird using the CLI:

tb push DATA_SOURCE_FILE -f

Supported Engines & Settings

We do not recommend changing these settings unless you are familiar with ClickHouse and understand their impact. If you're unsure, contact us in our Slack community or email us at support@tinybird.co.

Tinybird uses ClickHouse as the underlying storage engine. ClickHouse features different strategies to store data, these different strategies define not only where and how the data is stored but what kind of data access, queries, and availability your data has. In ClickHouse terms, a Tinybird Data Source uses a Table Engine that determines those factors.

Tinybird allows you to choose the ClickHouse Table Engine for your Data Source. To make things simpler, the API hides some of the complexity around ClickHouse's Table Engines. For instance, it simplifies the replication configuration, so you don't have to worry about it.

Currently, Tinybird supports the following Engines:

  • MergeTree
  • ReplacingMergeTree
  • SummingMergeTree
  • AggregatingMergeTree
  • CollapsingMergeTree
  • VersionedCollapsingMergeTree
  • Null

If you need to use any other Table Engine, contact us in our Slack community or email us at support@tinybird.co.

When using the Data Sources API you can use the engine parameter to specify the name of any of the available engines, e.g. engine=ReplacingMergeTree. In order to specify the engine parameters and the engine options, you can use as many engine_* request parameters as needed. These settings can also be configured in .datasource files.

The supported parameters for each engine are:

Engine parameters

ENGINESIGNATUREPARAMETERDESCRIPTION
ReplacingMergeTree,([ver]),engine_ver,Optional. The column with version.
SummingMergeTree,([columns]),engine_columns,Optional. The names of columns where values will be summarized
CollapsingMergeTree,(sign),engine_sign,Name of the column for computing the state
VersionedCollapsingMergeTree(sign, version),engine_signName of the column for computing the state
engine_versionName of the column with the version of the object state

The engine options, in particular the MergeTree engine options, match ClickHouse terminology: engine_partition_key, engine_sorting_key, engine_primary_key, engine_sampling_key, engine_ttl and engine_settings. Check the ClickHouse documentation for a detailed explanation of the different engine options available.

If engine_partition_key is left empty or not passed as a parameter, the underlying Data Source would not have any partition unless there's a Date column, in that case the Data Source will be partitioned by year. If you want to create a Data Source with no partitions just send engine_partition_key=tuple()

engine_settings allows for fine-grained control over the parameters of the underlying Table Engine. In general, we do not recommend changing these settings unless you are absolutely sure about their impact.

The supported engine settings are:

  • index_granularity
  • merge_with_ttl_timeout
  • ttl_only_drop_parts
  • min_bytes_for_wide_part
  • min_rows_for_wide_part

See the Data Sources API for examples of creating Data Sources with custom engine settings using the Tinybird REST API.

Supported data types

The supported ClickHouse data types are:

  • Int8, Int16, Int32, Int64, Int128, Int256
  • UInt8, UInt16, UInt32, UInt64, UInt128, UInt256
  • Float32, Float64
  • String
  • UUID
  • Date, Date32
  • DateTime([TZ]), DateTime64(P, [TZ])
  • Array(T)

Supported data formats

The Quarantine Data Source

Every data source you create in your Workspace has a quarantine Data Source associated. If you send rows that don't fit the Data Source schema, they are automatically sent to the quarantine table. This way, the whole ingestion process doesn't fail, and you can review quarantined rows later or perform operations on them using Pipes. This is a great source of information for you to fix your origin source, or a very powerful way to do the needed changes on-the-fly during the ingestion process.

By convention, the quarantine Data Source is named {datasource_name}_quarantine.

Quarantine Data Source schema contains the columns of the original row plus some extra ones ---c__error_column, c__error, c__import_id, and insertion_date--- with information about the issues that made it go to quarantine.

See the Quarantine Guide for practical examples on using the Quarantine Data Source.

Partitioning

Partitioning is configured using the ENGINE_PARTITION_KEY setting.

Partitioning is not intended to speed up SELECT queries, we recommend you experiment with more efficient sorting keys (ENGINE_SORTING_KEY) for that. Partitions are intended for data manipulation. A bad partition key (creating too many partitions) can negatively impact query performance.

Some options to consider when choosing a partition key:

  • Leave the ENGINE_PARTITION_KEY key empty. If the table is small (a few GB), or you aren't sure what the best partition key should be, leave it empty. The data will be placed in a single partition.
  • Use a date column. Depending on the filter, you can choose more or less granularity based on your needs. toYYYYMM(date_column) or toYear(date_column) are usually good default choices.

If you have questions about choosing a partition key, contact us in our Slack community or email us at support@tinybird.co.

Examples

Using an empty tuple to create a single partition
ENGINE_PARTITION_KEY "tuple()"
Using a Date column to create monthly partitions
ENGINE_PARTITION_KEY "toYYYYMM(date_column)"
Using a column to partition by event types
ENGINE_PARTITION_KEY "event_type % 8"

FAQs

Does Tinybird support updates and deletes?

Yes, but proceed with caution. Deletes and replaces are not part of ClickHouse, so we built on top. See this guide. Depending on the frequency needed, you may want to convert these upserts and deletes into an append problem that will be easier to solve with deduplication.