– a 3 hours crash course created by the Tinybird team.
Our team has been designing and building high performance data products for more than 10 years. We have dealt with millions of chat messages per second, served maps to hundreds of millions of people through the front page of the Wall Street Journal during an election night, helped Google deliver an application directly linked from their home page, and designed and run an analytics system to deal with 200 QPS over 7B rows.
During all these years we learnt a lot about what works and what doesn’t, and what’s the best approach to tackle different use cases, and it forced us to dig deeper into understanding how software and hardware interacts and the principles behind working with large amounts of data. Until now we had never devoted any time to collating all this knowledge that over the years we had spread in notes, presentations, references to books, etc. We started putting it together as an internal on-boarding guide to new employees but we thought it made sense to open it.
That is why we decided to create “Principles of real-time analytics on large datasets”, a course on how to design and build analytics systems at scale, an in-depth look at the core concepts behind our design methodology and principles.
This is a technology agnostic course, we will use different technologies to illustrate those concepts with a lot of easy to understand examples.
We are opening this first revision to a reduced group of people. Ideally this would be a face to face full day course but given the circumstances we are planning a reduced (3-4 hours) remote version.
Today’s platforms are pushing us to develop things quickly, hardware is generally fast enough and developers often don’t need to worry too much about doing things “properly”. That’s sometimes fine but there is value in knowing how things actually work.
Saving a few CPU cycles does not seem like a worthy investment, but those few cycles multiplied by billions of iterations means less machines to maintain, less money to spend, less complex architectures and on top of that, the feeling of having almost everything under control.
Everything sounds really hard until you understand it, that’s why we created this.
You most likely either know the hardware architectures explained during your degree or don’t know anything about hardware at all. That’s fine, to drive a car you don’t need to know how the engine works.
So let’s introduce here really basic concepts of hardware and how it works nowadays.
"Different problems require different solutions.[...] If you have different data, you have a different problem.”
– CppCon 2014: Mike Acton Data-Oriented Design and C++
You already know some databases out there - we will group them in different ways to understand when to use which depending on your particular needs: from performance to budget and from relational databases to analytics ones.
The focus of this section is on analytical databases and on how to ingest data. Although it sounds like a trivial thing (doing some inserts is easy), it is actually a critical part of working with data at scale.
You will see how even the source of your data impacts how you send it to your data systems: a bank might send information once a day, a bus in NYC sends it every few seconds.
Ingesting data != storing data.
As a follow-up to section 3, we will make a stop here to understand some concepts about data storage, to analyse different ways to store data and how that relates to the hardware and the OS.
Everything discussed so far is valuable but not enough to provide real value: only when you turn that data into information you start delivering it.
This is the most important chapter in which you will connect the dots with all the previous chapters.
Understanding what happens when you write an SQL query is key. We are not just talking about understanding an “EXPLAIN ANALYZE”.
And beyond that, all the pragmatic things that you are interested in: understanding joins, denormalization, when to group, when to filter, etc. The concepts learnt here will make sense for your daily work.
We will focus on understanding basic concepts of a distributed architecture, as it is important to know when and how to split the load.
Wrap up. Mainly to recap that you need to understand your data, your use cases and sort the data accordingly.
And time for us to say thank you and see you soon.
We recommend you join the course if you:
Tech. Previously Head of BBVA Data services, CARTO CTO and Agroguia Founder.
Writes software. Former Head of Technology at CARTO. Previously worked at Tuenti and Yahoo. Loves all things cars.
We will eventually give access to everyone but we would like to start with those who are closer to the use cases we solve now.
And if you have 1 minute, take this brief survey for priority access.