Principles of real-time analytics on large datasets

Understanding what it takes to query billion rows datasets under 100ms

– a 3 hours crash course created by the Tinybird team.

Access

~3 hours of Video Content

Free edition

Get your Course Now

About the course

Our team has been designing and building high performance data products for more than 10 years. We have dealt with millions of chat messages per second, served maps to hundreds of millions of people through the front page of the Wall Street Journal during an election night, helped Google deliver an application directly linked from their home page, and designed and run an analytics system to deal with 200 QPS over 7B rows.

During all these years we learnt a lot about what works and what doesn’t, and what’s the best approach to tackle different use cases, and it forced us to dig deeper into understanding how software and hardware interacts and the principles behind working with large amounts of data. Until now we had never devoted any time to collating all this knowledge that over the years we had spread in notes, presentations, references to books, etc. We started putting it together as an internal on-boarding guide to new employees but we thought it made sense to open it.

That is why we decided to create “Principles of real-time analytics on large datasets”, a course on how to design and build analytics systems at scale, an in-depth look at the core concepts behind our design methodology and principles.

This is a technology agnostic course, we will use different technologies to illustrate those concepts with a lot of easy to understand examples.

We are opening this first revision to a reduced group of people. Ideally this would be a face to face full day course but given the circumstances we are planning a reduced (3-4 hours) remote version.

Course outline

  • 0 - Why this course?

    +

    Today’s platforms are pushing us to develop things quickly, hardware is generally fast enough and developers often don’t need to worry too much about doing things “properly”. That’s sometimes fine but there is value in knowing how things actually work.

    Saving a few CPU cycles does not seem like a worthy investment, but those few cycles multiplied by billions of iterations means less machines to maintain, less money to spend, less complex architectures and on top of that, the feeling of having almost everything under control.

    Everything sounds really hard until you understand it, that’s why we created this.

  • 1 - Intro about modern computer hardware

    +

    You most likely either know the hardware architectures explained during your degree or don’t know anything about hardware at all. That’s fine, to drive a car you don’t need to know how the engine works.

    So let’s introduce here really basic concepts of hardware and how it works nowadays.

    • Different kind of data storages
    • Memory speed
    • CPU speed
    • The OS
    • The cloud
  • 2 - Using the right database for the job

    +

    "Different problems require different solutions.[...] If you have different data, you have a different problem.”

    – CppCon 2014: Mike Acton Data-Oriented Design and C++

    You already know some databases out there - we will group them in different ways to understand when to use which depending on your particular needs: from performance to budget and from relational databases to analytics ones.

  • 3 - Source data and ingestion

    +

    The focus of this section is on analytical databases and on how to ingest data. Although it sounds like a trivial thing (doing some inserts is easy), it is actually a critical part of working with data at scale.

    You will see how even the source of your data impacts how you send it to your data systems: a bank might send information once a day, a bus in NYC sends it every few seconds.

    • Different sources
    • Formats: not every format is the right tool for the job
    • Different kinds of datasets
    • Ingestion patterns: incremental, replace, partial replace, streaming
    • Other general recommendations
  • 4 - Data Storage

    +

    Ingesting data != storing data.

    As a follow-up to section 3, we will make a stop here to understand some concepts about data storage, to analyse different ways to store data and how that relates to the hardware and the OS.

    • Schemas
    • Stop to think about network/HD speed vs CPU
    • Compression techniques & types: making it easy with codecs
    • Picking the right Data Types. Special mention to NoSQL
    • Indices for large datasets
    • A small intro to data views
  • 5 - Querying

    +

    Everything discussed so far is valuable but not enough to provide real value: only when you turn that data into information you start delivering it.

    This is the most important chapter in which you will connect the dots with all the previous chapters.

    Understanding what happens when you write an SQL query is key. We are not just talking about understanding an “EXPLAIN ANALYZE”. 

    And beyond that, all the pragmatic things that you are interested in: understanding joins, denormalization, when to group, when to filter, etc. The concepts learnt here will make sense for your daily work.

    • General concepts.
      How analytics databases work, data ordering tricks and tips, calculating query times and memory scanning techniques.
    • Writing queries.
      Query just the data you need, joins on-the-fly vs denormalization, Lightweight and indexed operations first, Leveraging statistics to speed things up, what can and cannot be parallelized as well as using the right algorithm and the best function for each job. Trading memory with CPU.
    • View generation.
      Why we should do it (saving money and speeding things up), denormalization on ingest, view granularity and incremental view generation. Special kinds of views.
  • 6 - How to size a cluster

    +

    We will focus on understanding basic concepts of a distributed architecture, as it is important to know when and how to split the load.

    • Sharding & replication
    • Ingestion vs Querying
    • SLA
  • 7 - Misc and takeaways

    +

    Wrap up. Mainly to recap that you need to understand your data, your use cases and sort the data accordingly.

    And time for us to say thank you and see you soon.

Requirements

We recommend you join the course if you:

  • Deal with big quantities of data in your daily job
  • You are familiar with SQL. You don't need to be an expert.
  • You are a developer, DBA or any data related role.
  • You are curious about how things work :)

About the course team

Javier Santana

Javier Santana

Tech. Previously Head of BBVA Data services, CARTO CTO and Agroguia Founder.

Raul Ochoa

Raul Ochoa

Writes software. Former Head of Technology at CARTO. Previously worked at Tuenti and Yahoo. Loves all things cars.