Introduction
I received an interesting question on the DevHeads community discord:
“When to use one over the other between Amazon DynamoDB and S3 for storage?”.
Short answer
Two principal factors define data storage:
- The nature of the data itself.
- The way we want to consume it.
Discussion
For our discussion, I will categorize the data provided by IoT devices into four categories:
- Time-series telemetry readings.
- Complex readings.
- Batch data.
- Custom.
By no means is that a comprehensive list, but it illustrates the primary use cases.
Time-series telemetry readings
Let’s say a temperature sensor constantly sends readings to the AWS Cloud backend. Input data have a simple structure, and we want to analyze how the values change over time. In that case, I suggest using the Time-Series Database, as it is purpose-built for this type of input data. That kind of database supports querying and aggregating over time, which are the primary access patterns for this use case.
In the AWS Cloud offering, Amazon Timestream is an example of the Time-Series Database.
Complex readings
This point requires some clarification. By “Complex readings”, I mean (potentially unstructured) data transmitted from devices to the backend system. What differentiates it from the time-series is lower frequency and diverse access patterns. Complex reading messages contain more information than the previously discussed option. Additionally, there is an opportunity to enrich this data with other details stored in the same database and query it using multiple perspectives (not constrained to time). For that purpose, I suggest using the NoSQL Database as it enables horizontal scalability and does not enforce a static schema on data.
In the AWS Cloud offering, Amazon DynamoDB is an example of the NoSQL Database.
Batch data
That is one of the most interesting use cases, as it helps to save money. Sending real-time data to the cloud backend is a valid use case but not always required. There are numerous scenarios when we can provide business value by buffering data on edge devices and sending batches. One example is device performance monitoring, when we constantly gather operational data but send it to the backend system only once a day. Streaming data for this type of analytical use case is optional and not recommended.
How can we save money by batching readings? There are a number of ways:
- We can upload data directly to the S3 Bucket, bypassing the AWS IoT Core and associated costs.
- By batching data on the edge, we reduce the count of files saved in the S3 Bucket. That is important as S3 API invocations are the significant cost factor.
- S3 Bucket data storage is way cheaper than keeping the same amount of information in databases.
What about analytics? We can use Amazon Athena to analyze data stored in the S3 Bucket using SQL queries.
Custom
There is no one-size-fits-all solution in the IoT domain. Sometimes, we need to send real-time data to the cloud backend and handle it in a customized way. There are numerous ways of implementation; the one I want to mention today is the Amazon API Gateway. What are the pros and cons of that solution?
Pros:
- Significant elasticity.
- Potential cost reduction compared to using AWS IoT Core.
Cons:
- Custom development requires more effort than using other AWS services.
- Management of the IoT fleet disconnected from the AWS IoT Core increases the solution’s complexity.
API Gateway approach allows storing data in any backend solution, so we are not limited to the default integrations of IoT Core. That enables adjusting the storage option to the nature of input data and analytical needs. Once again, that elasticity comes at the price of required development.
Summary
When designing the Internet of Things solution, I recommend starting by defining the nature of input data and the way we want to consume (analyze) it. Once we understand those aspects, we can decide on the appropriate way to transfer and store obtained information.
I hope that helps!