Intro to transforming data


Tinybird transforms your data on-the-fly, either at ingestion or at request time. This guide summarizes how to do it using Pipes and SQL queries.

If you’ve followed the Intro to ingesting data guide, you should have two Data Sources already in your account. Now you’re ready to create Pipes and start exploring and transforming your raw data.

What’s a Pipe?

If you’ve used Jupyter Notebooks, you’ll be familiar with Pipes. A Pipe is a series of transformation nodes, where you can work with your data using SQL and see the results iteratively, and in real time. Tinybird is built on ClickHouse, so all the power of SQL in ClickHouse is available to you.

Each pipe can have multiple transformation nodes, that behave similarly to cells in a notebook. And nodes in a Pipe can refer to the results of previous nodes (think of them as subqueries) or just query raw Data Sources.

Pipes are a great way to keep your queries clean and short so that they’re easy to maintain and debug. They allow for a smoother experience when working with (big) data—they also help you identify parts of the queries that are performing worse—and, ultimately, make the process much more enjoyable.

You can create new Pipes from the Data Source modal window, the sidebar and your Dashboard view

You can create new Pipes from the Data Source modal window, the sidebar and your Dashboard view

Clicking on the + button in the Pipes section of the lateral bar, you’ll create a new one. Give it a name like ecommerce_example, for example.

Getting started pipe:

By default, you’ll have a ‘Getting Started Pipe’ in your account where we introduce you to some analyses you can make on the NYC Taxi Trip Duration dataset using Tinybird, and to some of the functions available for you to use. You can edit the code in every node, run it and see the results immediately.

Exploring data with the UI

Most of the SQL syntax you already know also works with Tinybird. To get started, let’s create an empty Pipe by clicking on the small “”+”” sidebar button, and rename the pipe as “”Ecommerce””. After doing so, let’s do a first query to explore what’s in the products Data Source.

Clicking on the run button, will show you the results of your query below together with some very useful information about the rows you have processed, the memory you’ve used and the duration of the query.

Apart from the query performance information, you can see quick actions for your transformation node right below your query results

Apart from the query performance information, you can see quick actions for your transformation node right below your query results

Now, let’s create another transformation node and do the exactly the same with the events Data Source.

Under normal circumstances, you would see that querying both Data Sources takes less than 100ms. Given how fast it is, we recommend you to spend some time doing simple filter or aggregation queries to better see the shape of your data, such as looking at the number of events per day:

or ranking the different user_ids by number of events:

Joining Data Sources

Joining Data Sources at scale is one of the most challenging operations in analytical platforms, and Tinybird excels at that. Let’s now enrich the events data with info about the product related to each event by performing a simple JOIN.

As you can see, joining 100M events with 2M products takes around a second or two in Tinybird. To make this query about 100X faster, learn how to use the Join engine in this guide.

Other types of joins:

ClickHouse supports all standard SQL JOIN types.

Calculated columns

ClickHouse comes with lots of powerful functions that will let us transform data in multiple ways. With Tinybird, you can use them all, so we encourage you to check out the ClickHouse docs on everything that’s available for you to use. Let’s see a few below.

JSON functions

As you might have seen, the events data source has a column containing JSON data for locations and prices. We can easily extract the price and the city values to a column by using the JSONExtract functions.

Working with JSON columns:

ClickHouse comes with set of functions to work with JSON data, which is especially useful when working with Data Streams or events data.

Date functions

ClickHouse also comes with an extensive list of functions to transform dates and times. As you have seen above, extracting the date part from a Datetime column is easy and fast, and very useful when doing time filters:

Enriching data on ingestion

Operations like these can be done on-the-fly at a decent performance as you can see, but if they’re done on ingestion they can speed up queries, especially if they read lots of data. To do it, you can use Materialized Views to calculate data on ingestion, as we explain in this guide.