Using Quarantine Data Sources¶
The quarantine Data Source¶
Every data source you create in your account has a quarantine Data Source associated. When the Data Sources API finds rows in your CSV that don’t fit the Data Source schema, they are automatically sent to the quarantine table. This way, the whole ingestion process doesn’t fail, and you can review rows quarantined later or perform operations on them in real time using Pipes. This is a great source of information for you to fix your origin ETL, or a very powerful way to do the needed changes on-the-fly during the ingestion process.
By convention, the quarantine Data Source is named
Auth tokens over quarantine data sources:
Adding a particular Data Source scope to an Auth Token, means giving the same permissions to its associated quarantine Data Source. See the Auth Token API for more information about authorization.
A practical example¶
Under the hood, the Data Source schema would be auto-guessed and defined as follows:
Now, we will do an incremental update by appending a few more rows with a new CSV file to our Data Source. Some rows on the file contains wrong data, so it fails.
Check out the operations log tab:
Within the Data Source modal window you can also find the Operations Log, where you can see details about an operation performed over that Data Source. In the top-right of the page you can switch between a 5 minute, 7 day and 30 day view of the log.
As the UI shows, we have 4 rows in quarantine for a variety of reasons. Let’s take a look at the detail by looking at the quarantine table and seeing how to fix those.
In the Error view you will see a summary of all the errors and the number of occurrences for each of those, so you can prioritize fixing the most common ones. In the Row view, you will see all the rows that have failed, so you can further investigate why.
Common fixes for rows in quarantine¶
As mentioned above, there are three ways of fixing quarantine errors. First, you can always use the Quarantine Data Source to detect potential problems when generating the data, and then fix those in the origin.
On other occasions, especially when you don’t control the way data is generated, it is better to change the original schema to support new rows with different values, or if you can’t do that easily, the best approach is to use Pipes to generate intermediate views of your data consolidating both rows that would go to quarantine and rows that go straight to your Data Source.
Changing the original Data Schema¶
As you can see in the Quarantine table, all the problems were due to the wrong data types. The simplest solution would be to modify the schema generation at ingest time and recreate the original Data Source. For the errors we have, we can change the schema using the CLI or the User Interface to something like:
As you can see:
By setting the date column to
Stringwe can accommodate dates in other formats that we could eventually fix,
and by changing the user_id column to
Nullable(String)we would support null values and non-numeric formats in our Data Source.
Fixing errors on-the-fly using Pipes and Materialized views¶
This is one of the most powerful capabilities of Tinybird. When, for whatever, reason you aren’t able to influence the way data is generated, or you just can’t recreate the Data Sources easily, you can apply a transformation to the erroring columns at ingestion time and materialize the results on another Data Source.
It’s always better to fix data at the source:
Data Sources with wrong rows that end up in the quarantine table take more time to ingest than data sources with no errors, so for an optimal ingestion performance it’s better if you fix all the quarantine errors you find during the development phase of your endpoints.