Jan 19, 2021

Querying large CSVs online with SQL

Use Tinybird to query large CSVs using SQL, right from your browser.
Xoel López
Founder at TheirStack

Suppose you have a very large CSV file, and you’d like to extract some insights from it. You could use CLI tools like csvkit, clickhouse-local or q, but maybe you don’t want to install another program to run a simple query, or you want a tool that is more visual or interactive.

Using something like Excel is also not an option, as the experience becomes painful once your CSV grows to tens of thousands of rows or more. Some websites will let you upload a larger CSV, but they can be clunky and slow and likely won’t support the upload at a certain file size.

Tinybird lets you…

  • Upload CSVs in seconds from your computer or a remote URL (e.g. S3 bucket, Dropbox, Google Cloud Storage, etc...).
  • Write SQL queries in your browser and performing compute-intensive operations like joins and aggregations on the data in milliseconds (it uses ClickHouse under the hood).
  • And share the results with other people via snapshots or dynamic API endpoints.

Querying a CSV file online with SQL

Let’s take, for example, this repo containing Crunchbase data about startup investments from 2015 available in a nice and clean CSV format. Particularly, we'll use the investments.csv file and make some queries on it.

As you can see from the above screencast, it takes less than a minute to upload and query a CSV using SQL in Tinybird.

Want to try this yourself? Follow these steps:

  1. Create a free Tinybird account here.
  2. Navigate to your Tinybird dashboard and click "Add Data Source".
  3. Choose "Remote URL" as your Data Source (or if you're using a local file, select "File Upload".
  4. Copy the URL of the CSV file and paste it into the Tinybird UI. Click "Add".
  5. Tinybird will analyze the file and automatically infer the column data types. Modify your schema if you'd like and click “Create Data Source”. Tinybird will ingest the CSV file into a Data Source table.
  6. Click “Create Pipe” too start building your SQL queries. You can use Tinybird Pipes to break your queries into smaller, simpler nodes. Each subsequent node can query over prior nodes. This keeps things clean and simple and avoids nested CTEs and the like.
Tinybird also supports ingestion of NDJSON and Parquet files.

Here’s the query I made in the video in case you want to use it as a starting point.

Joining two CSV files

The previous dataset doesn’t contain specific data about countries. So if you wanted to know, for example, the ratio of startup funding received for a country versus the population of of that country, you’d have to join the data from the file you just uploaded with another one like this.

Here we get the latest data available for each country:

And here we join the investments data with the country data:

You will need to give each node an alias when you join them.

Creating API endpoints from CSV data

Tinybird makes it easy to publish the results of your SQL queries as HTTP endpoints in a click, with options for returning the data as CSV, JSON, NDSJON, or Parquet. Just click “Create API Endpoint”, select the node you want to publish, and you’ll have your endpoint.

If you're new to Tinybird, you can sign up for free. The Tinybird Build Plan is free forever, with no time limit and no credit card required. You can store up to 10 GB of data and query your CSVs as much as you want. Once you publish APIs from your queries, the Build Plan includes up to 1,000 requests per day free of charge.

Do you like this post?

Related posts

Iterating terabyte-sized ClickHouse tables in production
Tinybird has joined the AWS ISV Accelerate Program
Jun 05, 2024
Run analytics on files in Amazon S3 the easy way
Jul 20, 2023
Publish SQL-based endpoints on NGINX log analysis
How to query Google Sheets with SQL in real time
More Data, More Apps: Improving data ingestion in Tinybird
Operational Analytics in Real Time with Tinybird and Retool
Tinybird connects with Confluent for real-time streaming analytics at scale
Jul 18, 2023
Simplifying event sourcing with scheduled data snapshots in Tinybird
Low-latency APIs over your BigQuery datasets

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.
Need more? Contact sales for Enterprise support.