What is a data product?

I recently stumbled across a guy named Dustin Phillips on YouTube. He’s the current drummer for The Ataris and an insanely talented musician. He publishes shorts in which he records each track to a famous song and packages them all into beautiful, mobile-optimized video mosaics so you can visually engage with each track as a part of the larger musical experience (and so he can showcase his incredible musical range).

Here he is covering a portion of Jimmy Eat World's The Sweetness.

I’ve recently been debating the concept of a “data product” with some of my colleagues, and have been grasping for a metaphor that would help communicate the finer points. Finally, with a little inspiration from Mr. Phillips, I believe I have it:

A ‘data product’ is like a song.

To produce a compelling and marketable song, musicians bring their talents together to first record a bunch of individual tracks: drums, bass, guitar, vocals, cowbell, padding, and so on.

On their own, each track does not have much value, but together they create a song. Of course, the initial sound is pretty raw, akin to a live performance, and so there’s usually more work to do.

Integral to the process of producing a studio-quality song is mixing, editing, and mastering. Sound techs modify the raw tracks by cleaning up background noise, equalizing, and making all the tracks fit together within an audio-spectral landscape.

FL Studio Screenshot 8 — Raw tracks come together with editing, mixing, and mastering to make a song.

When the final song is published, it’s accompanied by some information that helps describe it: Title, credits, length, content warnings, genre, etc. This makes it easy for listeners to search for and find the song on media platforms. It might be published as a single, or packaged together with other songs to produce an album.

Anybody can buy the song or album through various channels: as a CD or vinyl in a shop, as a digital download from iTunes, or as a stream on Spotify. The band doesn’t (necessarily) care about your media preference. They record, mix, and master the song once so it can be consumed over and over again through a variety of media and by a variety of consumers.

So why does a band go to a studio to produce a song? Well, generally because they want to sell it. They want to fill the tour van with gas and buy that custom shop Les Paul, and recording a digital track is a lot easier (and less mundane) than selling merch at every tour stop. The song is a product, intended to be packaged and distributed en masse. Certainly, recording a song in a professional studio is very expensive, but if it’s recorded well, distributed well, and received well, the band earns high multiples on its one-time investment.

A data product is exactly like a song.

To create a data product, data producers (backend developers, data engineers) first gather raw data sources. These could be time-series streams, dimensional tables, metadata, etc., and they are akin to “tracks”.

Like solo tracks, on their own raw data sources aren’t always very valuable. But we can multiply their value by joining them with a lookup table (enriching), getting rid of junk values (cleaning), or turning them into a more usable format (transforming).

Once data producers are happy with the “tracks”, they bring them together as a single table that tells the full story. The table is the “song”.

The table is also accompanied by information that describes it and makes it discoverable: schema, table name, who built it, who owns it, when it was updated, the domain of the data, what the sources are, etc. It also might be packaged with other data products to produce a data set, just like an album of songs.

Like musicians in a band, data producers create these tables to “sell” them. The table is the product. Data consumers (developers, analysts, executives) have different channels to “buy” the table: a JDBC connection, an API endpoint, a Kafka stream of changes, a CSV file. It doesn’t matter how it’s consumed, it’s a single product. Data producers create a data product once so that data consumers and the organization at large can earn high multiples on the one-time investment of creating it.

Why productize data?

Just like the individual talents of musicians in a band, data is an asset with latent value. And also like musical talents, when data comes together with other data harmoniously, the end product is more valuable than the sum of its parts.

But recording a quality song in a professional studio is expensive, and so is creating a data product.

You need good data engineers, good tooling, and plenty of time. If you do it right, there’s a significant upfront cost to create data that data consumers can reliably and repeatably use. If you take shortcuts, you’ll end up with a bad product and a negative value. Still, taking the time and effort to produce a valuable data product makes it discoverable and consumable by the people who could benefit from it, both within and outside your company.

Discoverable data is better than custom data

When you want music at your house party, do you hire a band? Or do you just connect your iPhone to the stereo and stream your favorite tunes? Ninety-nine times out of a 100 you’ll choose the latter. And when you choose the former, ninety-nine times out of a 100 you wish you hadn’t. These days you’ll spend a couple grand for a half-decent cover band when you could spend $15 a month to stream the real thing a thousand times over. Plus you don’t take the risk that the drummer will puke on the rug.

The same holds for data products. In nearly every scenario, a custom data pipeline isn’t needed. Building the same data pipeline again and again with minor tweaks does not add value, it is purely cost, and because it is burdensome to resource-constrained engineers, all too often it isn’t even done well.

A discoverable data product is just as valuable as, if not more than, a custom pipeline, and you’ll spend a lot less consuming it. Not to mention the benefit of consistent experience and little risk of a Kafka stream puking on the rug…

By productizing data, we provide a consistent, reliable, repeatable data experience for all of our data consumers. We build once, vastly reducing cost, and we spend saved time on the things that add value: quality, availability, discoverability, reliability, and Les Pauls.

It’s worth remembering that this is a journey, and no one has perfected it yet. You won’t always get it right the first time (how many remixes of songs are out there?). Just like the best musicians release remastered compilations of their greatest hits, the best data engineers listen to their consumers, take feedback, iterate, and improve their data products until they go platinum.

‍