Apr 21, 2023

Generate mock data schemas with GPT

Today, we announce the release of our in-product, AI-powered demo data generator. Building off our recent release of Mockingbird, Tinybird now makes it possible to generate mock data schemas based on GPT prompts. Write your first mock data prompt today in the Tinybird UI.
Kike Alonso
Product Manager

If you’re a developer working with data, it can be hard to build new features, test your assumptions, and validate your code unless you have access to the kind of data you expect to deal with in production.

But working with production data can be tough. Maybe the data has personally identifiable information (PII) and you can’t use it for compliance or security reasons. Perhaps the data is inaccessible, or not in the required format. Maybe you need streaming data, but all you have is a CSV file.

But still, you need data to build.

That's where mock data streams come in. Mock data streams are generated synthetically, and they mimic the characteristics of the real-world event data you’d expect to deal with in a production environment. Mock data streams have a lot of benefits: They’re usually free, always safe, and can typically be generated in whatever format you need.

But generating mock data streams for your data project can be a real pain. If you’re like me, your GitHub account is riddled with one-off Python or bash scripts full of duplicated code to generate mock data streams for your latest data project.

Yesterday, we launched Mockingbird, the free, open source library to generate mock data streams for your data projects. Mockingbird makes it easy to define a data schema in JSON, preview your schema, and start sending data to Tinybird (or external destinations). We’ve even included some sample schemas to help you get started. You can read more about the Mockingbird launch here.

Mockingbird is great for generating mock data, and it’s great for testing your real-time use cases in Tinybird. With the mock real-time data generated by Mockingbird, you can more confidently build things like build web analytics, real-time personalization, smart inventory management, and much more. Tinybird then makes it easy to turn your real-time data into REST APIs so that even the most complex use cases are faster to build.

Mockingbird lets you stream mock data to Tinybird, but you have to manually build custom schemas. Wouldn't it be nice to let generative AI do it for you?

When you build production-grade real-time applications in Tinybird, however, you might need something more than just the schema templates available in Mockingbird. You can always create a custom schema, but that can be time-consuming.

Wouldn’t it be nice if some generative AI could do it for you?

AI presents a tremendous opportunity for developers and data teams

Artificial intelligence (AI) has long been a mainstay in the data landscape. It’s the foundation of data science efforts and has contributed significantly to the ways that organizations analyze, process, and understand the data they generate.

Beyond data science, however, generative AI now allows us to build new things simply by writing natural language prompts. AI has proliferated widely across the entire tech landscape, and data generation is no exception. AI can and should be used to generate the mock data we all need to build new features with data.

We can use tools like OpenAI to help us generate mock data streams. Through careful prompt engineering, we can quickly produce data schemas that match real-world conditions. All we need to do is describe those real-world patterns and structures to the AI, and it can use the prompt to generate structured or unstructured mock data schemas that replicate those real-world patterns and structures.

We can then use these AI-generated schemas to create mock data streams for testing, validation, and development.

Announcing Tinybird’s AI-powered synthetic data generator

Today, Tinybird announces AI-powered Demo Data, a new feature in the Tinybird UI to generate mock data schemas using generative AI.

Generate mock data schemas with a natural language prompt.

With AI-powered Demo Data, you can quickly spin up mock data streams for your Tinybird projects using nothing but a simple natural language prompt.

With AI-powered Demo Data, you can quickly spin up mock data streams for your Tinybird projects using nothing but a simple natural language prompt.

This new feature builds on our release of Mockingbird, and takes it a step further by shifting the burden of schema generation off of your plate and onto AI. Now, instead of building your schemas by hand or using one of our pre-existing schema templates, you can generate mock data streams based on whatever schema you dare to prompt.

Just imagine it, prompt it, and stream it.

Read on for more information about how we built AI-powered Demo Data. In the meantime, you can start using it by signing up for Tinybird. AI-powered Demo Data is available in all pricing plans, including our extremely generous Free Tier. You can also join our community Slack and ask any additional questions you may have about AI-powered Demo Data, Tinybird, or real-time data.

How we built AI-powered Demo Data

This new feature combines the library made available through Mockingbird with OpenAI’s new API.

AI-Powered Demo Data uses Mockingbird and OpenAI's GPT-3.5 to generate a schema for a Tinybird Data Source, then stream mock data to the data source based on that schema.

To generate an effective prompt for GPT, we fuse the submitted description in the Tinybird UI with a set of internal prompts. Combined, these form a prompt that we then send to GPT 3.5 using the OpenAI API. The prompt describes to GPT the exact schema that Tinybird would need to generate a new Data Source, including the number of columns and the Data Types for those columns.

Once the schema is confirmed, we use the Mockingbird library to generate a new Data Source and stream mock event data to it using the Tinybird Events API.

How to use the AI-powered Demo Data

To start generating mock data in Tinybird using an AI prompt, simply add a new Data Source in the Tinybird UI as you always would.

However, instead of selecting an external Data Source, scroll down to “Use one of our streaming data samples”, select “Create your own”, and write your prompt.

When you click "Add new Data Source" in Tinybird, you'll now see the option to "Create your own" from a GPT prompt.
Simply provide a prompt to generate a schema...

Tinybird will then use GPT to generate a table schema for your Data Source based on your prompt. Once you’re happy with it, click “Confirm and ingest”, and start streaming into your new Data Source for 10 minutes.

Tinybird automatically generates mock data schemas for testing using AI

Best practices for writing prompts to generate mock data schemas

To generate mock data with Tinybird's AI-powered Demo Data generator, you’ll need to write a prompt that gives a short description of the Data Source schema that you want to generate.

The prompt should include information about the kind of data you want to generate, the columns you want to include, the data types you’d like the columns to utilize, and any other relevant details that describe the real-world conditions you’re trying to replicate.

Your prompt should include information about the kind of data you want to generate, the columns you want to include, the data types you’d like the columns to utilize, and any other relevant details that describe the real-world conditions you’re trying to replicate. 

Here are some examples of prompts you could use with the actual results generated in Tinybird:

  • Generate a schema of customer transactions for an e-commerce website. Include columns for the customer's name, email, order number, product name, quantity, and price. The order number should be a UInt32 and the price should be a Float.
Sometimes it helps to explicitly define data types if you need that precision.
  • Create a schema for a weather application that includes 10 columns of various weather data.
You can tell Tinybird exactly how many columns you want in your schma.
  • Generate a schema of tweets related to Tinybird’s Mockingbird launch. Include 20 columns based on the Twitter API Tweet fields.
Tinybird will even generate real-world schemas. Above, the schema includes actual tweet fields from the Twitter API.
  • Generate a schema of financial transactions to be used for fraud detection that can detect potential fraud by location, time of day, and purchase amount among other commonly used columns used for financial transactions.
Or, you can just be conversational with it, and allow Tinybird to do the heavy lifting.

Bear in mind that GPT can “have a mind of its own”, and some efforts to generate a viable schema will fail. This is a beta feature, and we're open to feedback! If you have any issues, questions, or concerns, please let us know in our Slack Community. We value your feedback as we continue to unlock the possibilities available to us through AI.

Start generating mock data with AI today

Tinybird’s AI-powered Demo Data generator is available today. Login to Tinybird, add a Data Source, and start generating AI-powered demo data for your next project.

If you’re not yet a Tinybird customer, you can sign up here. The Tinybird Build Plan is free forever, with no time restrictions, no credit card required, and generous limits. If you need a little more, use the code TINYGPT for $300 off a Pro subscription.

Also, feel free to join the Tinybird Community on Slack and ask us questions or request additional features.

And, if you’re keen to learn more about the AI-powered synthetic data generator, join our Release Round-up at the end of this week. We’ll cover all the new features released this week plus we’ll give away some amazing Tinybird swag from the new Tinybird Shop. You can sign up to be notified when the Release Round-up starts.

Do you like this post?

Related posts

A free and open source mock data generator for your next data project
Real-time data platforms: An introduction
Real-Time Data Ingestion: The Foundation for Real-time Analytics
Real-time streaming data architectures that scale
Jul 21, 2023
Automating data workflows with plaintext files and Git
Build a real-time dashboard in Python with Tinybird and Dash
A practical guide to real-time CDC with Postgres
Designing and implementing a weather data API
Why iterating real-time data pipelines is so hard
Iterate your real-time data pipelines with Git

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.
Need more? Contact sales for Enterprise support.