Ingestion from a notebook


Many people enjoy working with Jupyter Notebooks such as Colab notebooks. Data can be ingested into Data Sources from notebooks in several different ways. This Colab notebook uses recent updates to Wikipedia as example data to show the range of ways to ingest data to Tinybird from a pandas DataFrame, with a bonus option of streaming events directly to Tinybird.

Using the Rest API

A straightforward approach is to write the contents of the DataFrame to a file and ingest the file using the Rest API.

Alternatively, the data can be kept in memory, converted to an array and ingested using the function ingest_from_array in this notebook.

Through the UI

Of course, the DataFrame can simply be written to a CSV, NDJSON, or Parquet file and uploaded to the UI as a local file.

From the CLI

From within the notebook, you can install the tinybird-cli (!pip install tinybird-cli) and run the commands. You can either add the token for your workspace using !tb auth or set token = `your token’ and use $token.

Write the DataFrame to a CSV file.

df_wiki.to_csv(""wiki_cli_csv.csv"", index=False)

The schema for the Data Source can be generated from the CSV file

!tb --token=$token datasource generate wiki_cli_csv.csv

or defined directly in a file.

Then push the schema to Tinybird to create an empty Data Source

!tb --token=$token push wiki_cli_csv.datasource

and append the contents of the CSV file.

!tb --token=$token datasource append wiki_cli_csv wiki_cli_csv.csv

Following the same pattern, the DataFrame can be written to a JSON file.

df_wiki.to_json(""wiki_cli_ndjson.ndjson"", orient=""records"", lines=True, force_ascii=0)

The schema for the Data Source generated from the JSON file

!tb --token=$token datasource generate wiki_cli_ndjson.ndjson

or defined directly

The schema is then pushed to Tinybird to create an empty Data Source and the contents of the file appended.

!tb --token=$token push wiki_cli_ndjson.datasource

!tb --token=$token datasource append wiki_cli_ndjson wiki_cli_ndjson.ndjson

Streaming with high-frequency ingestion

Here events are streamed directly to the Data Source from the Wikipedia stream using high-frequency ingestion. The data is not first written to a pandas DataFrame.

With mode='create' the data types are inferred. For this example data, to avoid rows going into quarantine, a few more columns need to be Nullable than inferred. Directly defining the schema after exploring the automatically created Data Source in the UI solves this issue.

The schema is pushed to Tinybird.

!tb --token=$token push wiki_hfi.datasource

Note that here the ‘meta’ dictionary is split into many columns. In the code used to write the DataFrame in this example notebook, df[df.index=='domain'] selected a single row for each event, where the columns within ‘meta’ were the index of the DataFrame for a single event.

This Python code sends each event individually to Tinybird but excludes events of type ‘log’. In your code, you can ensure you only write the data you need to Tinybird. If you don’t need it then don’t write it.

The full range of ingestion options is available from within a notebook. You can mix-and-match how you ingest data as your projects develop. The full code for these examples is in this Colab notebook.