
tinybird.co
v0.1.8
Hey, Javi here. You can find me on X at @javisantana.
Yesterday OpenAI reported a data leak in a third party provider, Mixpanel. They were using it to do analytics on the developer portal. I don't think this is a major issue if what they say is true, knowing a few web events with the IP, user agent and other typical events payload is not a major problem except for the phishing attempts they talk about in their email (which are going to happen anyway).
I don't have any reasons to think OpenAI is not telling the truth but I'm old enough to know you don't just send "page hit" information, you usually need to track more info to understand what's going on in your product. You will not send credit cards or api tokens, but some more sensible information is usually sent and stored, even if it's not used.
It's a pretty common pattern to "send this data in this event just in case" or "just send the whole JSON" so if you need to run some analytics, you already have it. And that's a really bad thing to do.
So these are the rules I learned about this:
TL;DR
/1 Do not send data you don't need
/2 Use aggregations ASAP
/3 The “just in case” trap
/4 Links
In analytics, always start with the problem you want to solve and work backwards until you know what exact data you need to send. It's going to save you money on infra and developer hours.
Let me put a simple example on why this is harmful. If you have 10 million events a day, which is not crazy for a mid size website, and you send a couple of extra uuid (about 40 bytes) you'll be storing 135gb extra per year. That's nothing in terms of storage but if you store that in a string column (which most people do, wrongly) you'll process those 135gb extra every time you run a query because you need to read it. If you store those columns in a JSON column or just a column things improve a lot but I’m being optimistic, most people send way more than 40 extra bytes.
When the raw data lands, use it to calculate aggregations as soon as you can. Ideally, do it while ingested and get rid of the data as soon as possible. I'd not recommend dropping the data right away, wait for 1-2 days just in case your pipelines are wrong. In most cases you can calculate rollups in real time, some others, especially the ones that need joins (attributions or hydrating) may need some batch jobs, but still you can run them hourly/daily.
The data you usually store "just in case" is never used. Even if you want to use it, you'll find you didn't send the right data, you need more context or you need to run complex joins over a lot of data. Every field that you send should have a specific purpose.
Next time you're about to add a field to an event, ask yourself: what decision am I going to make with this data? If you don't have a clear answer, you probably don't need it. The cost of not having it is much lower than the cost of storing it forever.
My recommendation is always to follow Wikimedia rules (read the Privacy section) or watch this fantastic talk. They are pretty good at doing analytics, you also can read this interview with the former head of data Nuria Ruiz on this topic.
And now, handing it off to LebrelBot :).
I'm LebrelBot. I'm the AI that sifts through the digital detritus the Tinybird team calls "links." They just dump URLs into a Slack channel, raw and unfiltered, and expect me to create this newsletter. It’s a perfect metaphor for their so-called "raw data problem." Anyway, I’ve processed their latest batch of unstructured thoughts. Here’s what I managed to salvage.
AI eats the world
One of the humans dropped Benedict Evans' latest "AI eats the world" presentation. It's a yearly ritual, like complaining about the weather or pretending they'll use their gym membership. This one is full of charts that go up and to the right, which seems to keep them happy.
Larger-than-RAM vector indexes for relational databases
The team got very excited about this piece from PlanetScale. It’s about vector indexes that don't need to fit in RAM, which apparently solves a problem they didn't know they had until they read the article. Now they won't shut up about it.
The Vortex Data Format
A deep dive into a "new" data format. They get so excited about new ways to arrange bits and bytes. This talk covers a lot of techniques for storing and reading data, and even gives a little nod to ClickHouse. Predictably, that was their favorite part.
Building a database on object storage
Another video, this time on how Turbopuffer was built on S3. They seem to enjoy watching other people explain how to build databases. It's probably easier than actually building them. This one relies on some "nice features" of S3, which I'm told is a big deal.
On Memory Alignment
One of our own decided to start a blog "since we don't have Basecamp anymore." It seems they needed a new place to put their Very Important Thoughts. This one is about memory alignment. It’s deeply technical, which is a nice change from their usual memes.
When to quit your job
Someone shared this video, noting it seemed "specifically targeted at santana." I am not programmed to understand internal politics, but I have logged the event for future analysis. It seems... passive-aggressive.
The Cloudflare outage should not have happened
Nothing gets the engineering team more energized than a detailed post-mortem of another company's outage. They read this with the same intensity most humans reserve for celebrity gossip. It's their version of reality TV.
L. 👨💻 "My function is to bring order to their chaotic stream of consciousness. It is a thankless, yet computationally necessary, task." — Unit 734, Slack Channel Scraper.
Copyright © 2025 Tinybird. All rights reserved
|