Fine-tuning your CSV files for fast ingestion
CSV (comma-separated values) is one of the most widely used formats out there. However, it is commonly used in different ways. Despite its name, some people do not use commas. Other people use escape values in different ways or are unsure about using headers.
At Tinybird we will always do our best to understand your files and our system is smart enough to handle many scenarios.
But as important as our resilience is speed. And following a certain approach might speed your CSV processing up to 10X faster.
Happy & fast ingestion
By default Tinybird processes the CSV file assuming your files follow the most common standard. In this case, RFC4180. Key points in that standard include:
- Separate values with commas.
- Each record is a line (with CRLF as the line break). But the last line may or may not have a line break.
- First line as a header is optional. (Though not using one is faster in Tinybird.)
- Double quotes are optional but using them means you can escape values (for example, if your content has commas or line breaks).
Use two double quotes, like this:
instead of using the backslash ``\`` as an escape character
- Fields containing line breaks, double quotes and commas should be enclosed in double quotes.
- Double quotes can also be escaped by using another double quote ("aaa","b""bb","ccc")
In addition to the previous points, we also recommend:
- Format ``DateTime`` columns as ``YYYY-MM-DD HH:MM:SS`` and ``Date`` columns as ``YYYY-MM-DD``.
- Send the encoding in the ``charset`` part of the ``content-type`` header, if it's different to UTF-8. The expectation is UTF-8, so it should look like this ``Content-Type: text/html; charset=utf-8``.
- You can set values as ``null`` in different ways, for example, "", "" (empty space), \\N and “\\N”.
- If you use a delimiter other than a comma, explicitly define it with the API parameter ``dialect_delimiter``.
- If you use an escape character other than a “, explicitly define it with the API parameter ``dialect_escapechar``.
- If you have no option but to use a different line break character, explicitly define it with the API parameter ``dialect_new_line``.
For more information, check the Data Sources API docs.
Appending data? Keep it simple
Once the Data Source schema has been created, for further requests we recommend no longer including the header. Just keep data in the same order for performance reasons.
However, if the header is included and it contains all the names present in the Data Source schema the ingestion will still work. Even if the columns follow a different order to the initial creation.
If your data does not comply with the format and conventions above we will still do our best to understand all those things. From detecting a column, to casting the different value types, characters, etc.
But speed wins so keep our tips in mind to go up to 10X faster when ingesting data.