Data Sources API¶
The Data Sources API is a convenient way to import files into your Tinybird Data Sources. With this API you can use files to create new Data Sources, and append data to, or replace data from, an existing Data Source.
This page gives examples of how to use the Data Sources API to perform various tasks. These examples are a baseline that cover most common scenarios, but feel free to adapt the flow of them for your own use case. For more information, see the Data Sources API Reference docs or contact us in our Slack community or email us at support@tinybird.co.
New to Data Sources? Read more about the concept of Data Sources
Import a file into a new Data Source¶
If you do not already have a Data Source, you can import a file and Tinybird can create the Data Source for you. This operation supports CSV, NDJSON, and Parquet files, and you can create a Data Source from local or remote files.
Automatic schema inference is supported for CSV files, but is not supported for NDJSON or Parquet files.
CSV files¶
CSVs must follow these requirements:
- One line per row
- Comma-separated
Tinybird supports Gzip compressed CSVs (csv.gz files).
The Data Sources API automatically detects and optimizes your column types, so you don't need to manually define a schema. You can use the type_guessing=false
parameter to force Tinybird to use String
for every column.
CSV headers are optional. When creating a Data Source from a CSV file, if your file contains a header row, Tinybird uses this to name your columns. If no header is present, your columns are given default names with an incrementing number. When appending a CSV file to an existing Data Source, if your file has a header, Tinybird uses the headers to identify the columns. If no header is present, Tinybird uses the order of columns. If you can guarantee that the order of columns in the CSV file are always the same, it is safe to omit the header line.
For example, to create a new Data Source from a local file using cURL:
Creating a Data Source from a local CSV file
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \ -F csv=@local_file.csv
Or, from a remote file:
Creating a Data Source from a remote CSV file
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \ -d url='https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv'
When importing a remote file from a URL, the response will contain the details of an import Job. To see the status of the import, you must use the Jobs API.
NDJSON & Parquet files¶
The Data Sources API does not support automatic schema inference of NDJSON and Parquet files. You must specify the schema
parameter with a valid schema to parse the files.
The schema for both NDJSON and Parquet files uses JSONPaths to identify columns in the data. You can add default values to the schema.
For example, assume your NDJSON (or Parquet) data looks like this:
Simple NDJSON data example
{ "id": 123, "name": "Al Brown"}
Your schema definition would need to provide the JSONPath expressions to identify the columns id
and name
:
You can use the Analyze API to automatically generate a schema definition from a file.
Simple NDJSON schema definition
id Int32 `json:$.id`, name String `json:$.name`
To create a new Data Source from a local file using cURL, you must URL encode the Schema as a query parameter. For example:
These examples use NDJSON. To use Parquet, adjust the format
parameter to format=parquet
.
Creating a Data Source from a local NDJSON file
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?name=events&mode=create&format=ndjson&schema=id%20Int32%20%60json%3A%24.id%60%2C%20name%20String%20%60json%3A%24.name%60" \ -F ndjson=@local_file.ndjson
Or, from a remote file:
Note the escape characters in this example are only required due to backticks in cURL.
Creating a Data Source from a remote NDJSON file
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?name=events&mode=create&format=ndjson" \ --data-urlencode "schema=id Int32 \`json:$.id\`, name String \`json:$.name\`" \ -d url='http://example.com/file.json'
When importing a remote file from a URL, the response will contain the details of an import Job. To see the status of the import, you must the Jobs API.
To add default values to the schema, use the DEFAULT
parameter after the JSONPath expressions. For example:
Simple NDJSON schema definition with default values
id Int32 `json:$.id` DEFAULT 1, name String `json:$.name` DEFAULT 'Unknown'
Append a file into an existing Data Source¶
If you already have a Data Source, you can append the contents of a file to the existing data.
This operation supports CSV, NDJSON and Parquet files.
You can append data from local, or remote, files.
When appending CSV files, you can improve performance by excluding the CSV Header line. However, in this case, you must ensure the CSV columns are ordered. If you cannot guarantee the order of column in your CSV, include the CSV Header.
For example, to append data into an existing Data Source from a local file using cURL:
Appending data to a Data Source from a local CSV file
curl \ -H "Authorization: Bearer <DATASOURCES:APPEND token>" \ -X POST "https://api.tinybird.co/v0/datasources?mode=append&name=my_datasource_name" \ -F csv=@local_file.csv
Or, from a remote file:
Appending data to a Data Source from a remote CSV file
curl \ -H "Authorization: Bearer <DATASOURCES:APPEND token>" \ -X POST "https://api.tinybird.co/v0/datasources?mode=append&name=my_datasource_name" \ -d url='https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv'
If the Data Source has dependent Materialized Views, data is appended in cascade.
Replace data in an existing Data Source with a file¶
If you already have a Data Source, you can replace existing data with the contents of a file. You can choose to replace all data, or a selection of data.
This operation supports CSV, NDJSON and Parquet files.
You can replace with data from local or remote files.
For example, to replace all of the data in a Data Source with data from a local file using cURL:
Replacing a Data Source from a URL
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \ -F csv=@local_file.csv
Or, from a remote file:
Replacing a Data Source from a URL
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \ --data-urlencode "url=http://example.com/file.csv"
Rather than replacing all data, you can also replace specific partitions of data. This operation is atomic.
To do this, use the replace_condition
parameter. This parameter defines the filter that is applied, where all matching rows will be deleted before finally ingesting the new file. Only the rows matching the condition will be ingested.
Note that if the source file contains rows that do not match the filter, these rows are ignored.
Replacements are made by partition, so it is mandatory that the replace_condition
filters on the partition key of the Data Source.
To replace filtered data in a Data Source with data from a local file using cURL, you must URL encode the replace_condition
as a query parameter. For example:
Replace filtered data in a Data Source with data from a local file
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv&replace_condition=my_partition_key%20%3E%20123" \ -F csv=@local_file.csv
Or, from a remote file:
Replace filtered data in a Data Source with data from a remote file
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \ -d replace_condition='my_partition_key > 123' \ --data-urlencode "url=http://example.com/file.csv"
All the dependencies of the Data Source (e.g. Materialized Views) are recomputed so your data will be consistent after the replacement. If you have n-level dependencies they are also updated by this operation. Taking the example A --> B --> C, if you replace data in A, Data Sources B and C will automatically be updated accordingly. You need to take into account that the Partition Key of Data Source C must also be compatible with Data Source A.
You can find more example in this guide.
Replacements are atomic but we cannot assure data consistency if you continue appending data to any related Data Source at the same time the replacement takes place. The new incoming data will be discarded. This is an important limitation and we are working to provide you with a more flexible alternative solution.
If the Data Source has dependent Materialized Views, data will be replaced in cascade.
Creating an empty Data Source from a schema¶
To create an empty Data Source, you must pass a schema
with your desired column names and types and leave the url
parameter empty.
When you want to have more granular control about the Data Source schema, you can manually create the Data Source with a specified schema.
For example, to create an empty Data Source with a set schema using cURL:
Create an empty Data Source with a set schema
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?name=stocks" \ -d "schema=symbol String, date Date, close Float32"
Generate schemas with the Analyze API¶
The Analyze API can analyze a given NDJSON or Parquet file to produce a valid schema. The column names, types, and JSONPaths are inferred from the file.
For example, to analyze a local NDJSON file using cURL:
analyze a NDJSON file to get a valid schema
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/analyze" \ -F "ndjson=@local_file_path"
The response contains a schema
field that can be used to create your Data Source. For example:
Successful analyze response
{ "analysis": { "columns": [{ "path": "$.a_nested_array.nested_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "a_nested_array_nested_array" }, { "path": "$.an_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "an_array" }, { "path": "$.field", "recommended_type": "String", "present_pct": 1, "name": "field" }, { "path": "$.nested.nested_field", "recommended_type": "String", "present_pct": 1, "name": "nested_nested_field" } ], "schema": "a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field`" }, "preview": { "meta": [{ "name": "a_nested_array_nested_array", "type": "Array(Int16)" }, { "name": "an_array", "type": "Array(Int16)" }, { "name": "field", "type": "String" }, { "name": "nested_nested_field", "type": "String" } ], "data": [{ "a_nested_array_nested_array": [ 1, 2, 3 ], "an_array": [ 1, 2, 3 ], "field": "test", "nested_nested_field": "bla" }], "rows": 1, "statistics": { "elapsed": 0.00032175, "rows_read": 2, "bytes_read": 142 } } }
Error handling¶
In many cases, an error will return an HTTP Error code, e.g. HTTP 4xx
or HTTP 5xx
.
However, if the imported file is valid, but some rows failed to ingest due to an incompatible schema, you will still receive an HTTP 200
. In this case, the Response body contains two keys invalid_lines
and quarantine_rows
. These keys tell you how many rows failed to ingest. Additionally, an error
key will be present with an error message.
Successful ingestion with errors
{ "import_id": "e9ae235f-f139-43a6-7ad5-a1e17c0071c2", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name", "cluster": null, "tags": {}, "created_at": "2019-03-12 17:45:04", "updated_at": "2019-03-12 17:45:04", "statistics": { "bytes": 1397, "row_count": 4 }, "replicated": false, "version": 0, "project": null, "used_by": [] }, "error": "There was an error with file contents: 2 rows in quarantine and 2 invalid lines", "quarantine_rows": 2, "invalid_lines": 2 }