As a data engineer, I have delved deeply into the world of fast ETLs, event-driven architectures, and petabyte scale as I helped the companies where I worked get more value out of their data. But amidst the hustle and bustle, I found that the key to getting anything meaningful out of your data supersedes concerns about speed, scale, and cost-efficiency. The true predictor of data value? Data quality.
Data Quality Assurance (Data QA) plays a critical role in so many important tasks across a business, including business intelligence, machine learning, enterprise security, cost management, and so much more. A series of processes and activities that aim to identify and rectify any issues or anomalies in the data, Data QA ensures the accuracy, completeness, and reliability of data so that it is trustworthy and useful.
High quality data begets better collaboration, clearer communication, better stakeholder relationships, and healthier organizations overall.
High quality data begets better collaboration, clearer communication, better stakeholder relationships, and healthier organizations overall. Conversely, companies without Data QA suffer faulty analysis, bad decisions, and frayed relationships with both internal stakeholders and external customers.
In this blog post, I’ll describe 5 core criteria for data quality, and offer some SQL tests to check them in your data QA pipelines.
What makes for high quality data?
There are 5 widely accepted criteria upon which Data QA programs can be measured:
- Accuracy. The extent to which your data depicts real-world entities, occurrences, or trusted references.
- Completeness. The extent to which data that can feasibly be captured is not null.
- Consistency. The degree of alignment or uniformity between your dataset and a reference dataset.
- Uniqueness. The absence of an duplicate objects or events within your datasource.
- Validity. How closely a data value aligns with expected or predetermined values.
Building data quality checks
If you’re going to implement a Data Quality Assurance program, you’ll need to implement routine checks for these 5 criteria regardless of where your data resides. To help you get started, I’ll show you 5 example SQL tests, one for each criteria. These tests were written for the well-known NYC Taxi dataset. You can deploy these snippets wherever you test for data quality and modify them for your specific data sets.
Writing the tests
Below you’ll find SQL to test for each of the 5 data quality criteria based on this sample data. If you’d like to follow along in Tinybird, check out the repo for this project. If you plan to reuse these queries elsewhere, you can just copy the SQL snippets and modify them in your destination.
For each test, you’ll re-use some common statistics including the the total number of taxi rides, the average fare for taxi rides, the standard deviation of fares, and the latest day of data). You can go ahead and pre-calculate these metrics with a query like this, instead of adding subqueries to each subsequent check.
Depending on your database, how you store these statistics for re-use may change; you may need to use a materialized view, stored procedure, or macro, whichever is appropriate.
Test 1: Accuracy
To measure accuracy of your data, you can borrow the concept of six sigma quality from manufacturing. Any value that is more than six standard deviations from the mean is considered inaccurate, and the measure of accuracy will be the percentage of records whose value falls within six “sigmas”.
Any value that is more than six standard deviations from the mean is considered inaccurate.
This query counts the number of records where the absolute difference between the record fare price and the average fare price is greater than 6 standard deviations from the fare price mean, and presents this count as a percentage of total records.
Test 2: Completeness
Since completeness represents the amount of data you do have compared with the amount of data you could have, you can measure completeness by calculating the percentage of nulls in the field of interest.
You can measure completeness by calculating the percentage of nulls in the field of interest.
This query calculates the percentage of records with a null value for the fare price, and presents this count as a percentage of total records..
Test 3: Consistency
Consistency is measured over time, and unless something (expected or unexpected) perturbs the data, we should expect the same or similar measurements over time. To measure consistency, you can compute the average of one of the fields for one day and checking that the value is not very different than the average computed for all the data.
To measure consistency, you can check if recent data is significantly different from historical data.
This query counts the number of records from the most recent day where the absolute difference between the record fare price and the average fare price is greater than 6 standard deviations from the fare price mean, and presents this count as a percentage of total records.
Test 4: Uniqueness
Real-world data should almost never be entirely duplicated, especially when you have timestamped data. You can check for uniqueness by computing the number of duplicated records over relevant fields.
You can check for uniqueness by computing the number of duplicated records over relevant fields.
This query calculates the number of records that have duplicated data for dropoff time, passenger count, and payment type, and presents this count as a percentage of total records..
Test 5: Validity
Data validity is a sanity check. For example, weather temperatures on Earth will (almost) never be less than -40°, and a you'll (probably) never get paid to ride in a taxi, so a fare shouldn't be negative. For data validity, you can calculate the percentage of records whose value in one of the fields is outside an expected range.
For data validity, you can calculate the percentage of records whose value in one of the fields is outside an expected range.
This query counts the number of records where the total amount is either less than 0 or greater than 500, which would be considered invalid in this context, and presents this count as a percentage of total records.
Implementing these tests in Tinybird
If you’d like to test this code against the actual dataset, you can easily do so in Tinybird. Everything you need is in this GitHub repository.
You can sign up for a free Tinybird account here, clone the repo, and follow the README to publish an API that will return values for each of the 5 data quality checks based on the data you add to the Data Source.
When you call your API Endpoint, you will get a response like this:
You can add additional data from the NYC Taxi Dataset to see how quality changes as new records are ingested.
The result from this endpoint can be used to to integrate with observability tools, such as Grafana. It can also be easily integrated with other systems and services such as Apache Airflow to automatically trigger other actions.
As you no doubt understand, data quality assurance is essential for maximizing the value of data in today's data-driven world. By ensuring data accuracy and reliability, organizations can make informed decisions and gain a competitive edge.
Tinybird can help data engineers to simplify the process of writing data quality tests, making it easy and enjoyable. If you’d like to give Tinybird a try, you can sign up here. The Build Plan is free forever, with no credit card required and no time limit.
If you get stuck along the way, or just want to connect with a like-minded community of data practitioners, you can join the growing Tinybird Slack community.