Simple Time Series Prediction Modeling Using Tinybird

Time series predictions are one of the most common use cases you can find. Predicting the future, enables you to get ready for it (and act accordingly) so, as you would expect, it is something every company would love to do. Good news is that there are many methods to do it: from sophisticated Machine Learning algorithms or advanced forecasting libraries like prophet to simpler approaches based on simpler statistical foundations, like the one we describe below.

In this case, we are going to use the famous NYC taxi dataset to predict the number of pick-ups for a specific day around the same area.

Importing the data

The data consists of a set of CSV files with "yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts".

First thing, we will load the data from 2017 and 2018 using the Tinybird Datasource API. We will be using pick-ups in 2017 to "train" the model and the ones in 2018 to validate it. This will load around 220M records in your account.

Remember that you'll need to grab your DATASOURCE:CREATE from your tokens page

The method

The simplest method we can use is to assume that the number of pickups is exactly the same than one year ago, but taking into account the average growth for the year, in other words:

So let's start by predicting the total pick-ups number for August.

Calculating the average growth

It can be done per day, per week, per month or using any other time range, but we will try with a month. It’s a good balance between having enough samples to avoid outliers, and not too many that it does not pick up the tendency.

Some notes: Instead of using the pipe name, the underscore (_) symbol is used. In Tinybird _ is a shortcut to avoid typing pipe names all the time. It substitutes the pipe name used when instantiating the nyc_taxi_17_18_pipe object. This also lets you reuse queries when changing pipe names, for example, when changing the environment from staging to production. toYYYYMM function returns an integer of a datetime. It could be replaced by tpep_pickup_datetime between ‘2018-03-01’ and ‘2018-03-31’ but it’s shorter and we don’t have to deal with different month lengths.

As a side note, we always run benchmarks and examples on the smallest machine we can get from our infrastructure provider. Better than showing how fast Tinybird can be, we prefer to show you the baseline, which we can always scale up linearly.

console.log(`this query took ${res.statistics.elapsed} seconds on the smallest tinybird account`)

Fetching previous year info

As mentioned above, we are going to calculate the number of pick-ups based on the previous year's data. To get the same day, but one year before, we will use the addYears function:

In order to match the day of the week, we are going to use a nice trick from Xoel López. It involves getting the date 364 days before, instead of 365 (this will not work on a leap year but let's simplify to check if the model works).

Now, let's evaluate how our model performs by calculating it for every day and comparing the predicted value with the actual one (2018).

Not bad! almost everything under 10% can be considered pretty good taking into account the simplicity of our model.

The final function

Next, once we have checked that the model works, we should create a function to predict data for a given day. It would look like

Some comments: In this case we are using addDays(date, -364) instead of toDayOfYear((tpep_pickup_datetime)) - 1, simpler and more elegant. In order to avoid problems when there is no data, we added throwIf(prev_count == 0) which raises and exception and stops the query. We can also use first(sql) instead of json(sql) to fetch the first row. Lots of stuff can be done in order to speed up the query, but we will talk about that in a future post.

Quick wins to improve the model

There are some quick wins that could be added, for example:

Take into account special dates like July, 4th or Dec, 25th. As outliers, they deserve some special treatment. Actually, it would be better to analyze when the model is not working properly, analyze why and fix the model for that.
Calculate the growth ratio using all the data up to date, instead of just using the last month.

Other interesting analysis

Knowing the number of pickups is not really interesting, it would have more value if we would do a more specific analysis like calculating the number of pick-ups per taxi zone per hour, the estimated fare rate per taxi zone per hour, or the estimated number of people who will be taking a taxi in an area at a specific time. Give them a try! We will cover those in following blog posts.