We benchmarked how well LLMs write SQL
View the results.
Back
May 15, 2025

A Developer's Guide to Data Engineering

Developers work with code and some data, data engineers work with a lot of data and some code. We explain the main differences.
Javier Santana
Javier SantanaCo-founder

Many developers work with data daily, from app databases to API responses. But when "data" needs to fuel complex analytics or power data-intensive features, we call on a specialized field called Data Engineering. If you're a developer curious about what data engineers really do, or how your dev skills can apply in the data world, this guide is for you.

What data engineers do

Data Engineers are the plumbers and electricians of the data world. They build the infrastructure to make data flow reliably and efficiently. They typically manage less code but handle much more data compared to developers.

These are the main tasks for a data engineer:

Wrangle data from everywhere (ETL/ELT)

They build and maintain pipelines that pull data from a multitude of sources – your app's PostgreSQL database, Salesforce, Google Analytics, Kafka event streams, S3 buckets full of CSVs, third-party APIs, etc. This data then needs to be cleaned (e.g., standardizing date formats), transformed (e.g., joining user IDs with CRM data), and loaded into a central system (like a data warehouse or data lake).

As a developer, you might write a script to fetch JSON from a weather API and store parts of it for a feature. A data engineer does this for dozens of sources, often dealing with terabytes of data, ensuring reliability, and handling schema changes from those sources.

Model data for analytics

They design database schemas specifically optimized for fast analytical queries. This often means denormalizing data (the opposite of what you typically do for application databases) into structures like star or snowflake schemas. The goal is to make it easy for analysts or BI tools to ask complex questions and get answers quickly.

As a developer, your application database is likely highly normalized (e.g., users, orders, products tables) to ensure data integrity and avoid redundancy for transactional operations. A data engineer might create an analytics_sales_summary table that pre-joins and aggregates data from these, so querying "total sales per product category last quarter" doesn't require complex joins on the fly and runs in seconds, not minutes.

Orchestrate complex data workflows

They use tools like Apache Airflow, Prefect, or Dagster to schedule, monitor, and manage hundreds of interdependent data jobs. These workflows (often called DAGs - Directed Acyclic Graphs) define the order of operations, handle retries on failure, and manage dependencies between tasks.

As a developer, you might have a cron job to back up your database. A data engineer might manage a workflow that:

  1. Extracts yesterday's user sign-ups from the app DB (runs at 2 AM)
  2. If successful, extracts marketing campaign data from Google Ads
  3. If both successful, joins them to calculate cost per acquisition
  4. Loads the result into the analytics database or data warehouse
  5. Sends a Slack alert if any step fails

Ensure data quality and governance at scale

They implement automated checks and processes to ensure data is accurate, complete, consistent, and timely. This includes setting up alerts for anomalies (e.g., a sudden drop in order volume), validating data against predefined rules, and helping ensure compliance with regulations like GDPR or CCPA.

A developer writes unit tests to check if your calculate_discount function works correctly. A data engineer might implement a "data test" that checks if the order_total column in a sales data table never contains negative values, or that the number of new rows ingested daily is within an expected range.

Start building with Tinybird!
If you've read this far, you might want to use Tinybird as your analytics backend. You can just get started, on the free plan.

Dev practices that data engineers should use more often

The data world is increasingly adopting software engineering best practices. As a developer, you'll see familiar territory here.

Version control for data assets (git for data)

As a developer, you don't start any project without a version control system.

In data engineering, that's not always true, but ideally, they should version control SQL transformation scripts, dbt models or Tinybird pipes, pipeline configurations (e.g., Airflow DAGs), and infrastructure-as-code for data platforms.

Testing

When working on an app, you usually do unit tests, integration tests, and E2E tests for application logic and user flows. Most of your tests validate the happy path and then a myriad of edge cases.

In data engineering, that's exactly the same, but the probability of facing a remote edge case is high as you usually handle billions of rows. The tests in data engineering differ a little bit:

  • Data quality tests: Validate the data itself (e.g., user_id is never null, email column contains valid email formats)
  • Unit tests for transformations: Test individual pieces of data transformation logic (e.g., a Python function that cleans addresses)
  • Integration tests for pipelines: Check that different stages of a data pipeline connect and pass data correctly

CI/CD

This is industry standard in the software world: automated builds (e.g., npm run build), tests, and deployments to staging/production when code is merged.

In software development, code is the main asset, but in data engineering, it's data. Since you're dealing with constantly changing data (and it's expensive to make test copies), the approach needs to be different.

Data engineers should automate the testing and deployment of changes to data pipelines. This includes new transformation logic, schema changes, or updates to orchestration definitions. Why does this matter? Faster, more reliable delivery of data pipeline changes. It reduces manual errors and allows data engineers to iterate more quickly, just like you do with application code.

Local development & faster feedback loops

You do npm start and see changes in your browser in seconds. You run your app locally for quick iterations. Then you push the code to prod. You may have a local database with some fake data to test the app.

Data engineers need the data, and it should be as close to production as possible. It's usually in the data warehouse because it does not fit on your machine, and you can't get a sample of it because of security and privacy reasons. If you combine that with long-running batch jobs, that makes the feedback loop pretty slow.

This is more important than it seems. Faster iteration cycles lead to quicker development and debugging. Tools that allow local execution of transformations against smaller data samples, or local containerized versions of data tools, can significantly boost productivity (hint: check tb local).

Monitoring

Observability, monitoring, alerts, metrics, and so on are part of any feature. No one pushes an app to production without basic alerting.

In the data engineering space, these practices are not that common. We see many teams failing to monitor basic metrics like data ingestion status, which can lead to unnoticed issues. In Tinybird, we send automated emails when that happens (or you can monitor it with prometheus/grafana/others)

Real-time answers

As a developer, you don't expect users to wait 6 hours to see the result of an action. But in data engnineering, that happens all the time. Data engineers are used to working in batches.

There is a reason for that: making queries fast over a lot of data has typically been expensive, but it's becoming cheaper and easier.

Tinybird has most of these good practices built-in. Sign up or read our docs to learn more.

Subscribe to our newsletter
Get 10 links weekly to the Data and AI articles the Tinybird team is reading.
Do you like this post? Spread it!

Skip the infra work. Ship your first API today.

Read the docs
Tinybird wordmark