How to ingest NDJSON data

Intermediate

In this guide you'll learn how to ingest unstructured data, like NDJSON to Tinybird.

A typical scenario consists of having a document-based database, using nested records on your data warehouse or generatied events in JSON format from a web application

For cases like this, we used to export the JSON objects as if they were a String in a CSV file, ingest them to Tinybird and then use the builtin JSON functions to prepare the data for realtime analytics as it was being ingested.

But this is not needed anymore, Tinybird accepts JSON imports by default!

NDJSON and JSON support:

Although Tinybird allows you to ingest .json and .ndjson files, it only accepts the Newline Delimited JSON as content: each line must be a valid JSON object and every line has to end with \n. The API will return an error if each line isn't a valid JSON value

Ingesting to Tinybird

As an example for this guide, we will make use of this 100k rows NDJSON file, which contains events from an ecommerce website with different properties.

With the API

Ingesting NDJSON files using the API is similar to the CSV way. There are only two differences that we need to manage in the query parameters:

  • format: it has to be "ndjson"
  • schema: we are used to providing the name and the type for every column but here we need to include another property, called the `jsonpath` (See the JSONPath syntax). Example: "schema=event_name String `json:$.event.name`"

You can guess the schema by first calling the Analyze API. It's a very handy way to not have to remember the schema and jsonpath syntax, simply send a sample of your file and we'll guess what's inside: columns, types, schema, a preview, etc.

Just take the schema attribute in the response and either use it right away in the next API request to create the Data Source or modify as you wish: column names, types, remove any columns, etc.

Now that we've analyzed the file, let's create the Data Source. In the example below, you will ingest the 100k rows NDJSON file only taking 3 columns from it: date, event, and product_id. The jsonpath allows Tinybird to match the Data Source column with the JSON property path.

With the Command Line Interface

There are no changes in the CLI in order to ingest an NDJSON file. Just run the command you are used to with CSV.

Once it is finished, it will automatically generate a .datasource file with all the columns with their proper types and jsonpaths.

Then you can push that .datasource to Tinybird and start using it in your Pipes or append new data to it.

With the User Interface

To create a new Data Source from an NDJSON file, you just need to go to your dashboard and click on the Add Data Source button.

The import modal will allow you to select an NDJSON formatted file (with either .ndjson or .json extension) from your computer or provide a URL such as the one we are using for this guide.

Import modal

You will see a preview of the schema & data (see Mark 1 below). Click Create Data Source (see Mark 2 below) to start importing the data.

Preview step

JSON tree view:

You can preview your JSON data with the tree view available in this step. Just click on the icon on the top right of the table.

Once your data is imported, you will have a Data Source with your JSON data structured in columns, which are easy to transform and consume in any Pipe.

Data Source created with data

Ingest just the columns you need:

After exploration of your data, always remember to create a Data Source that only has the columns needed for your analyses. That will help to make your ingestion, materialization and your realtime data project faster.

Dealing with new JSON fields

One of the features we have included with the new NDJSON import is to automatically detect if a new JSON property is being added when new data is coming in.

Using the Data Source we imported in the previous paragraph, we will include a new property to know the origin country of the event, complementing the city. Let's append new JSON data with the extra property (example file).

After finishing the import, open the Data Source modal and check that a new blue banner appears, warning you about the new properties detected in the last ingestion.

Automatically suggesting new columns

Once you accept viewing those new columns, the application will allow you to add them, change the column types and the column names, as it did in the preview step in the import.

Accepting new columns

From now on, whenever you append new data where the new column is defined and has a value, it will appear in the Data Source and will be available to be consumed from your Pipes.

New column receiving data

We automatically detect if there are new columns available. If you ingest data periodically into your NDJSON Data Source (from a file or a Kafka connection) and we see that new columns are coming in, you will see a blue dot in the Data Source icon that appears in the sidebar. Click on the Data Source to view the new columns and add them to the schema, following the steps above.

New columns detected, notified by a blue dot

JSONPaths

When creating a Data Source using NDJSON/Parquet data, for each column in the schema you have to provide a JSONPath using the JSONPath syntax.

This is easy for simple schemas, but it can get complex if you have nested fields and arrays.

For example, given this NDJSON object:

{ "field": "test", "nested": { "nested_field": "bla" }, "an_array": [1, 2, 3], "a_nested_array": { "nested_array": [1, 2, 3] } }

The schema would be something like this:

schema with jsonpath
a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`,
an_array Array(Int16) `json:$.an_array[:]`,
field String `json:$.field`,
nested_nested_field String `json:$.nested.nested_field`

Our JSONPath syntax support has some limitations: It support nested objects at multiple levels, but it just support nested arrays at the first level, as in the example above. To ingest and transform more complex JSON object you should use the root object JSONPath syntax as described in the next section.

JSONPaths and the root object

Defining a column as "column_name String json:$" in the Data Source schema will ingest each line in the NDJSON file as a String in the column_name column.

This is very handy in a few scenarios.

When you have nested arrays, such as polygons:

Nested arrays
{
   "id": 49518,
   "polygon": [
      [
         [30.471785843000134, -1.066836591999916],
         [30.463855835000118, -1.075127054999925],
         [30.456156047000093, -1.086082457999908],
         [30.453003785000135, -1.097347919999962],
         [30.456311076000134, -1.108096617999891],
         [30.471785843000134, -1.066836591999916]
      ]
   ]
}

You can parse the id and then add the whole JSON string to the root column to extract the polygon with JSON functions.

schema definition
id String `json:$.id`,
root String `json:$`

When you have complex objects:

Complex JSON objects
{
   "elem": {
      "payments": [
            {
               "users": [
                  {
                        "user_id": "Admin_XXXXXXXXX",
                        "value": 4
                  }
               ]
            }
      ]
   }
}

Or if you have variable schema ("schemaless") events:

Schemaless events
{
   "user_id": "1",
   "data": {
      "whatever": "bla",
      "whatever2": "bla"
   }
}

{
   "user_id": "1",
   "data": [1, 2, 3]
}

You can simply put the whole event in the root column and parse as needed:

schema definition
root String `json:$`