Oct 15, 2024

Simple patterns for aggregating on DynamoDB

DynamoDB doesn't natively support aggregations, so here are four different approaches to aggregate data in DynamoDB tables.
Cameron Archer
Head of Marketing

DynamoDB is a fast and scalable NoSQL key-value store used by millions of developers worldwide who are building real-time applications. When you use DynamoDB, you come to expect read/write operations that run in milliseconds. Yes, it's that fast for transactional use cases. And it remains that fast even at an absurd scale. That's why it's awesome.

But DynamoDB is purely a transactional database. It maintains consistent performance at massive scale so long as your access patterns involve point lookups and small-range scans using the primary key.

Aggregations are a different access pattern. In DynamoDB, they generally involve a full scan of all items. This is an expensive and slow operation.

So what do you do if you want to run aggregations on data in DynamoDB? Here are a few simple patterns for aggregating DynamoDB data. Each of these patterns has various benefits and tradeoffs, as you'll see, but they all satisfy the core requirement to run aggregations over DynamoDB data in a relatively performant and cost-effective manner.

Pattern 1: Aggregate in batch with Redshift

Since DynamoDB does not natively support aggregation patterns, you have to move the data outside of DynamoDB to perform aggregations.

If you don't need your aggregations to be calculated in real time, you can offload your tables to another data tool optimized for online analytics processing (OLAP) workloads, such as Amazon Redshift.

Amazon Redshift provides the classic data warehouse experience: scalability, storage/compute separation, and an SQL interface for querying. This makes it suitable for large-scale analytics workloads to power business intelligence interfaces.

A diagram showing how to perform DynamoDB aggregations in Amazon Redshift
You can use the Redshift COPY command to copy data from DynamoDB to Redshift, which you can then integrate with downstream Business Intelligence (BI) tools for ad hoc querying, reporting, and dashboarding.

To get data from DynamoDB into Redshift, you can use the COPY SQL command in Redshift to copy data from your DynamoDB table into Redshift, for example:

Pros

  • Redshift is well-suited for large-scale analytical processing that involves aggregations
  • Redshift is highly scalable for these types of workloads
  • Redshift presents an intuitive SQL interface
  • Redshift integrates easily with DynamoDB using the COPY command
  • Redshift can be deployed serverless for simplicity at scale

Cons

  • The COPY command only copies a current snapshot of the DynamoDB table, so it will need to be periodically refreshed depending on the downstream analytics use case
  • The COPY command consumes DynamoDB read capacity units, which adds cost and can impact performance on your DynamoDB table
  • Because the COPY command consumes provisioned throughput on your DynamoDB table, AWS recommends against using it on production tables or setting the readratio to a value lower than the minimum available capacity on DynamoDB, which creates a speed/cost tradeoff.
  • Copying data from a NoSQL environment to an SQL environment creates some trickiness, as you need to account for differences in data types, treatment of NULL values, etc.
  • Redshift is not suited for real-time analytics, so this approach should not be used when analytical queries must run with low latency or high concurrency (typically required in applications where DynamoDB is used).

TL;DR

You can use Amazon Redshift to perform scheduled COPY operations of DynamoDB tables for batch analytical workloads. This is a decent approach for business intelligence or internal reporting use cases but unsuited for real-time or user-facing analytics.

More resources

Pattern 2: Aggregate in batch with Apache Hive on EMR

This approach is similar in form to that described above, except using Apache Hive instead of Redshift as a destination for DynamoDB tables.

Running Apache Hive on EMR is like running Postgres on EC2 - you get the benefits of an open source tech backed by Amazon's scalability. Apache Hive presents an SQL-like interface (called HiveQL) that ultimately invokes MapReduce functions to perform aggregations, giving you quite a bit of flexibility for running complex analytical processes over massively distributed datasets.

A diagram showing how to do DynamoDB aggregations on Apache Hive deployed on an Amazon EMR cluster.
You can use Apache Hive on EMR as an open source alternative to Redshift, using HiveQL to query DynamoDB data in an ad hoc manner.

Getting data from DynamoDB to Apache Hive can be a bit complex, however, requiring you to set up an EMR cluster, SSH into the leader node, and manually set up the Hive table (with a pre-defined schema) to align with your DynamoDB table. This can get a bit tricky when dealing with single-table designs in DynamoDB, but you'll have difficulty moving data from DynamoDB to any schema-dependent data structure.

You can follow this step-by-step tutorial to learn how to copy data from DynamoDB to Hive. The basic steps are:

  1. Create an EMR cluster
  2. Connect to the leader node
  3. Set up the Hive table based on a pre-defined schema
  4. Link the Hive table to the DynamoDB table
  5. Query the DynamoDB table using HiveQL 

Pros

  • You get the benefits of open source tech (Apache Hive) on scalable AWS clusters (EMR).
  • Apache Hive presents an SQL-like interface that can be familiar to analysts and data engineers
  • Apache Hive can handle query processing on massive, distributed datasets
  • As with the prior approach, fetching data from DynamoDB when querying will consume read capacity units.

Cons

  • Setting up an EMR cluster can be complex and costly
  • Apache Hive is not suited for real-time data processing, so this approach only works for batch
  • AWS doesn't have any native integrations for visualizing or otherwise integrating the results of Hive queries into another tool, so this is really only useful for ad hoc querying

TL;DR

If you want to run complex, ad hoc analytics queries on DynamoDB data and you're familiar with the Hadoop ecosystem, copying your DynamoDB data to Apache Hive will give you a scalable and flexible system to query your data, but it won't work for real-time processing needs or when you want to integrate the results of your queries back into a user-facing application or dashboard.

More resources

Pattern 3: Aggregate to a separate DynamoDB table or GSI with AWS Lambda

DynamoDB is typically selected to back user-facing applications with real-time needs: low latency and high concurrency on both reads and writes. If your goal is to surface aggregations in the same application, you can find some efficiency by just putting those aggregations back into DynamoDB. You're already working with the database, so why not keep working with it?

To do this, you need to enable DynamoDB Streams on your DynamoDB table, which will produce CDC-style streams as new updates are inserted into your DynamoDB table. Then, you can use an AWS Lambda triggered by these streams to calculate aggregates and store them in a separate DynamoDB table. Your Lambda can also read from your aggregates target table to fetch existing aggregates and update them based on new data.

A diagram showing how to use AWS Lambda to calculate aggregates on DynamoDB table.
This is a classic pattern for running aggregations on DynamoDB. You can create a Lambda function to calculate aggregates based on updates to the source DynamoDB table, and either write them to a separate DynamoDB table or back to the original table with a Global Secondary Index (GSI) to sort on the necessary primary key.

Amazon has a good tutorial for this workflow here. The basic steps are:

  1. Enable DynamoDB streams on your DynamoDB source table
  2. Create a DynamoDB target table to store your aggregates
  3. Create a Lambda function with roles to read/write from the DynamoDB tables
  4. Set up triggers for the Lambda based on DynamoDB Stream output
  5. The Lambda then runs as filters are met, writing aggregates to the target table
Note
You can also write your aggregates back to the existing DynamoDB table, and then create a Global Secondary Index (GSI) whose primary key matches your expected filter key. More on that approach here.

Pros

  • You can continue to use DynamoDB to store aggregates, so you stick with a familiar interface.
  • This approach works for real-time applications. If your Lambda function is efficient and properly scaled, you can achieve sub-second end-to-end latencies to write and read aggregates.
  • Reading aggregates is fast because they're properly indexed in your DynamoDB target table or Global Secondary Index (GSI)
  • Leverages existing DynamoDB functionality (Streams).
  • Lambda read requests (4 times per second) to the DynamoDB Stream are free.
  • Doesn't consume read capacity units from your source table, so this can effectively be used in production without straining your app database.

Cons

  • This approach offers very little flexibility. Aggregates are pre-computed and indexed by a primary key, so you won't be able to dynamically filter on the read side.
  • While read requests by Lambda to DynamoDB Streams are free, Lambda invocations aren't. Those costs can creep up if you're invoking the Lambda function many times per second.
  • Lambda is doing a lot of work here. If it fails, your aggregates could be inaccurate, which won't work in an environment where you need strong consistency.

TL;DR

Using a Lambda to calculate aggregates and write them to another DynamoDB table is a solid approach when you need near-real-time aggregates that you can surface in your application, but this approach won't offer the same query flexibility as prior approaches.

More resources

Pattern 4: Aggregate in real-time using Tinybird as a secondary index

Tinybird is a real-time data platform that runs on AWS. It's built on open source ClickHouse® and is optimized for low-latency, high-concurrency reads and writes for OLAP workloads. It offers a native DynamoDB Connector, gives you the ability to query DynamoDB tables with SQL, and features an "instant API" experience to publish query results as scalable REST API Endpoints that integrate with your app.

In this approach, you effectively use Tinybird as a secondary index of your DynamoDB table, but with a data structure that supports complex aggregates (plus filters and joins) at low read latencies similar to what you can achieve with DynamoDB. You can also add dynamic logic to your SQL, allowing you to pass query parameters to your aggregating functions for tremendous flexibility.

Using Tinybird as a "secondary index" for your DynamoDB table gives you the benefits of a flexible, SQL-based environment for complex analytical queries plus sub-second API response times. You can integrate aggregates into your app via Tinybird REST API Endpoints, and/or sink your aggregates into an S3 Bucket or MSK topic.

To connect DynamoDB to Tinybird, you can use Tinybird's native DynamoDB Connector, which uses Point-in-Time Recovery (PITR) to automatically create an initial snapshot of the DynamoDB table, then uses DynamoDB Streams to consume CDC upserts and deduplicate them. The process ensures a consistent, stateful copy of your DynamoDB table with only a few seconds of latency.

Once your DynamoDB data is in Tinybird, you can write SQL queries to filter, aggregate, and join. Every query can be published as a scalable, low-latency REST API Endpoint secured by custom JWTs, so integrating real-time analytics into your application is simple.

Pros

  • Tinybird combines the low-latency/high-concurrency benefits of DynamoDB with the query flexibility and analytics capabilities of tools like Redshift and Hive.
  • Tinybird has a native DynamoDB Connector to maintain a stateful copy of DynamoDB tables without needing to write or maintain any external ETLs.
  • You can publish aggregating queries as hosted, scalable REST API Endpoints with dynamic query parameters, which makes for easy integration with user-facing apps.
  • Tinybird supports handling JSON and semi-structured data.
  • Tinybird is available on AWS and can be purchased through the AWS Marketplace.

Cons

  • Tinybird is a managed service that, for now, can't be self-hosted on your own AWS cloud. Tinybird does, however, maintain a "Bring your Own Cloud" waitlist for those interested in self-hosting in the future.
  • Like Redshift and Hive, Tinybird uses table schemas for indexing, which can be uncomfortable for developers familiar with the schemaless approach of DynamoDB.
  • Because of its schema pattern, using Tinybird in conjunction with single-table designs in DynamoDB can add some complexity, requiring you to parse raw data into multiple target tables using Materialized Views

TL;DR

When you need the flexibility of dynamic aggregations with real-time processing performance, Tinybird pairs perfectly with DynamoDB. Its native DynamoDB Connector simplifies integration with DynamoDB, and its hosted, scalable API Endpoints simplify integrating dynamic aggregate queries back into your application.

More resources

Wrap up

DynamoDB doesn't natively support aggregation functionality, but you can extend it to achieve aggregations with any of these 4 approaches. Each approach ranges in its use cases, from batch to real-time, and its flexibility, performance, and cost.

If you need flexible, ad hoc querying in batch, you can use Redshift or Hive. If you need real-time performance without dynamic flexibility, you can simply reuse DynamoDB. If you need both real-time performance and flexibility to dynamically calculate aggregates, Tinybird is the way to go.

Here's a table summarizing the pros and cons of each approach.

ApproachProsCons
Amazon Redshift
  • Optimized for large-scale analytical processing
  • Highly scalable, with SQL interface
  • Easy integration with DynamoDB using COPY command
  • Not suitable for real-time analytics
  • Requires periodic data refreshes, affecting DynamoDB performance
  • Costly in terms of read capacity and storage
Apache Hive + EMR
  • Scalable with SQL-like interface (HiveQL)
  • Open-source with flexibility for complex queries
  • Complex setup with EMR clusters
  • Batch processing only, no real-time support
  • Lacks native integration with visualization tools
AWS Lambda + DynamoDB
  • Real-time aggregation with DynamoDB Streams
  • Low latency for user-facing apps
  • Doesn't consume source table read capacity
  • Limited query flexibility
  • Costs can increase with frequent Lambda invocations
  • Potential inconsistency if Lambda fails
Tinybird
  • Real-time aggregation with SQL and dynamic queries
  • Native DynamoDB Connector eliminates external ETLs
  • Scalable API endpoints for fast integration
  • Managed service, no self-hosting (yet)
  • Schema-based indexing may be uncomfortable
  • Single-table designs require a bit more work upfront

Tinybird is free to use. The Build Plan has no time limit, no credit card required, and plenty of capacity to prove your DynamoDB aggregation use case. You can sign up for Tinybird here. If you have questions or get stuck, the Tinybird Slack Community is a great place to get answers.

ClickHouse®️ is a registered trademark of ClickHouse, Inc.

Do you like this post?

Related posts

Top Use Cases for DynamoDB in 2024
3 ways to run real-time analytics on AWS with DynamoDB
Query DynamoDB tables with SQL
The Tinybird DynamoDB Connector is now GA

Ashley Kim

Oct 28, 2024
Introducing the Tinybird DynamoDB Connector in public beta
Tinybird has joined the AWS ISV Accelerate Program

Tinybird

Team

Jun 05, 2024
Real-time analytics API at scale with billions of rows
Building real-time solutions with Snowflake at a fraction of the cost
Low-latency APIs over your BigQuery datasets
Adding JOIN support for parallel replicas on ClickHouse®️

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.
Need more? Contact sales for Enterprise support.