Data Sources API¶
The Data Sources API is a convinient way to import files into your Tinybird Data Sources. With this API you can use files to create new Data Sources, and append data to, or replace data from, an existing Data Source.
On this page you will find examples of how to use the Data Sources API to perform various tasks. You can find the Data Sources API Reference here.
New to Data Sources? Read more about them here.
Importing a file into a new Data Source¶
If you do not already have a Data Source, you can import a file and Tinybird can create the Data Source for you.
This operation supports CSV, NDJSON and Parquet files.
You can create a Data Source from local, or remote, files.
Automatic schema inference is supported for CSV files, but is not supported for NDJSON or Parquet files.
CSV files¶
CSVs must follow these requirements:
One line per row
Comma-separated
Tinybird supports Gzip compressed CSVs (csv.gz files).
The Data Sources API will automatically detect and optimize your column types, so you do not need to manually define a schema. You can use the parameter type_guessing=false
to force Tinybird to use String
for every column.
CSV headers are optional. When creating a Data Source from a CSV file, if your file contains a header row, Tinybird will use this to name your columns. If no header is present, your columns will be given default names with an incrementing number. When appending a CSV file to an existing Data Source, if your file has a header, Tinybird will use the headers to identify the columns. If no header is present, Tinybird will use the order of columns. If you can guarantee that the order of columns in the CSV file are always the same, it is safe to omit the header line.
For example, to create a new Data Source from a local file using cURL:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \
-F csv=@local_file.csv
Or, from a remote file:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \
-d url='https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv'
When importing a remote file from a URL, the response will contain the details of an import Job. To see the status of the import, you must use the Jobs API.
NDJSON & Parquet files¶
The Data Sources API does not support automatic schema inferance of NDJSON and Parquet files. You must specify the schema
parameter with a valid schema to parse the files.
The schema for both NDJSON and Parquet files uses JSONPaths to identify columns in the data.
For example, assume your NDJSON (or Parquet) data looks like this:
{ "id": 123, "name": "Al Brown"}
Default values are not yet supported for JSONPath columns.
Your schema definition would need to provide the JSONPath expressions to identify the columns id
and name
:
You can use the Analyze API to automatically generate a Schema definition from a file.
id Int32 `json:$.id`,
name String `json:$.name`
To create a new Data Source from a local file using cURL, you must URL encode the Schema as a query parameter. For example:
These examples use NDJSON, to use Parquet simply adjust the format
parameter to format=parquet
.
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?name=events&mode=create&format=ndjson&schema=id%20Int32%20%60json%3A%24.id%60%2C%20name%20String%20%60json%3A%24.name%60" \
-F ndjson=@local_file.ndjson
Or, from a remote file:
Note the escape characters in this example are only required due to backticks in cURL.
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?name=events&mode=create&format=ndjson" \
--data-urlencode "schema=id Int32 \`json:$.id\`, name String \`json:$.name\`" \
-d url='http://example.com/file.json'
When importing a remote file from a URL, the response will contain the details of an import Job. To see the status of the import, you must the Jobs API.
Appending a file into an existing Data Source¶
If you already have a Data Source, you can append the contents of a file to the existing data.
This operation supports CSV, NDJSON and Parquet files.
You can append data from local, or remote, files.
When appending CSV files, you can improve performance by excluding the CSV Header line. However, in this case, you must ensure the CSV columns are ordered. If you cannot guarantee the order of column in your CSV, you should include the CSV Header.
For example, to append data into an existing Data Source from a local file using cURL:
curl \
-H "Authorization: Bearer <DATASOURCES:APPEND token>" \
-X POST "https://api.tinybird.co/v0/datasources?mode=append&name=my_datasource_name" \
-F csv=@local_file.csv
Or, from a remote file:
curl \
-H "Authorization: Bearer <DATASOURCES:APPEND token>" \
-X POST "https://api.tinybird.co/v0/datasources?mode=append&name=my_datasource_name" \
-d url='https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv'
If the Data Source has dependent Materialized Views, data will be appended in cascade.
Replacing data in an existing Data Source with a file¶
If you already have a Data Source, you can replace existing data with the contents of a file. You can choose to replace all data, or a selection of data.
This operation supports CSV, NDJSON and Parquet files.
You can replace with data from local, or remote, files.
For example, to replace all of the data in a Data Source with data from a local file using cURL:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \
-F csv=@local_file.csv
Or, from a remote file:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \
--data-urlencode "url=http://example.com/file.csv"
Rather than replacing all data, you can also replace specific partitions of data. This operation is atomic.
To do this, use the replace_condition
parameter. This parameter defines the filter that will be applied, where all matching rows will be deleted, before finally ingesting the new file. Only the rows matching the condition will be ingested.
Note that if the source file contains rows that do not match the filter, these rows will be ignored.
Replacements are made by partition, so it is mandatory that the replace_condition
filters on the partition key of the Data Source.
To replace filtered data in a Data Source with data from a local file using cURL, you must URL encode the replace_condition
as a query parameter. For example:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv&replace_condition=my_partition_key%20%3E%20123" \
-F csv=@local_file.csv
Or, from a remote file:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \
-d replace_condition='my_partition_key > 123' \
--data-urlencode "url=http://example.com/file.csv"
All the dependencies of the Data Source (e.g. Materialized Views) will be recomputed so your data will be consistent after the replacement. If you have n-level dependencies they will be also updated by this operation. Taking the example A –> B –> C, if you replace data in A, Data Sources B and C will automatically be updated accordingly. You need to take into account that the Partition Key of Data Source C must also be compatible with Data Source A.
You can find more example in this guide .
Replacements are atomic but we cannot assure data consistency if you continue appending data to any related Data Source at the same time the replacement takes place. The new incoming data will be discarded. This is an important limitation and we are working to provide you with a more flexible alternative solution.
If the Data Source has dependent Materialized Views, data will be replaced in cascade.
Creating an empty Data Source from a schema¶
To create an empty Data Source, you must pass a schema
with your desired column names and types and leave the url
parameter empty.
When you want to have more granular control about the Data Source schema, you can manually create the Data Source with a specified schema.
For example, to create an empty Data Source with a set schema using cURL:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/datasources?name=stocks" \
-d "schema=symbol String, date Date, close Float32"
Generating schemas with the Analyze API¶
The Analyze API can analyze a given NDJSON or Parquet file to produce a valid schema. The column names, types and JSONPaths will be inferred from the file.
For example, to analyze a local NDJSON file using cURL:
curl \
-H "Authorization: Bearer <DATASOURCES:CREATE token>" \
-X POST "https://api.tinybird.co/v0/analyze" \
-F "ndjson=@local_file_path"
The response will contain a schema
field that can be used to create your Data Source. For example:
{
"analysis": {
"columns": [{
"path": "$.a_nested_array.nested_array[:]",
"recommended_type": "Array(Int16)",
"present_pct": 3,
"name": "a_nested_array_nested_array"
},
{
"path": "$.an_array[:]",
"recommended_type": "Array(Int16)",
"present_pct": 3,
"name": "an_array"
},
{
"path": "$.field",
"recommended_type": "String",
"present_pct": 1,
"name": "field"
},
{
"path": "$.nested.nested_field",
"recommended_type": "String",
"present_pct": 1,
"name": "nested_nested_field"
}
],
"schema": "a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field`"
},
"preview": {
"meta": [{
"name": "a_nested_array_nested_array",
"type": "Array(Int16)"
},
{
"name": "an_array",
"type": "Array(Int16)"
},
{
"name": "field",
"type": "String"
},
{
"name": "nested_nested_field",
"type": "String"
}
],
"data": [{
"a_nested_array_nested_array": [
1,
2,
3
],
"an_array": [
1,
2,
3
],
"field": "test",
"nested_nested_field": "bla"
}],
"rows": 1,
"statistics": {
"elapsed": 0.00032175,
"rows_read": 2,
"bytes_read": 142
}
}
}
Handling errors¶
In many cases, an error will return an HTTP Error code, e.g. HTTP 4xx
or HTTP 5xx
.
However, if the imported file is valid, but some rows failed to ingest due to an incompatible schema, you will still receive an HTTP 200
. In this case, the Response body contains two keys invalid_lines
and quarantine_rows
. These keys tell you how many rows failed to ingest. Additionally, an error
key will be present with an error message.
{
"import_id": "e9ae235f-f139-43a6-7ad5-a1e17c0071c2",
"datasource": {
"id": "t_0ab7a11969fa4f67985cec481f71a5c2",
"name": "your_datasource_name",
"cluster": null,
"tags": {},
"created_at": "2019-03-12 17:45:04",
"updated_at": "2019-03-12 17:45:04",
"statistics": {
"bytes": 1397,
"row_count": 4
},
"replicated": false,
"version": 0,
"project": null,
"used_by": []
},
"error": "There was an error with file contents: 2 rows in quarantine and 2 invalid lines",
"quarantine_rows": 2,
"invalid_lines": 2
}