123. What do you need to know about DynamoDB?

Published 2024-05-17 - Listen on your favourite podcast player

In this episode, we provide a comprehensive overview of DynamoDB, including how it compares to relational databases, when to use it, how to get started, writing and querying data, secondary indexes, and single table design. We share our experiences using DynamoDB and discuss the pros and cons compared to traditional SQL databases.

AWS Bites is brought to you by fourTheorem. If you need someone to work with you to build the best-designed, highly available database on AWS, give us a shout. Check us out on fourtheorem.com!

In this episode, we mentioned the following resources.

Our previous episode "122. Amazing Databases with Aurora"
Configurable Maximum Throughput on On-Demand tables
Best practices for designing and using partition keys effectively
The DynamoDB Book
Alex DeBrie’s podcast (not about DynamoDB per se but still worth a shout!)
One of Rick Houlihan’s talks on DynamoDB

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on:

BlueSky: @eoins.sh + @loige.co,
LinkedIn: eoins + lucianomammino,
Twitter: @loige

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: Following on from our last episode on Aurora, we are sticking with databases today. This time we are discussing one of the most requested topics by our listener, DynamoDB. We are going to give you our opinion on when and how to use DynamoDB, when you should avoid it, and whether the much talked about topic of single table design is actually worth the effort. By the end of today's episode, we hope you will have a comprehensive understanding of the main DynamoDB concepts and how to get most of the value from DynamoDB.

I'm Luciano and I'm joined by Eoin, and this is another episode of AWS Bites podcast. AWS Bites is brought to you by fourTheorem. If you need someone to work with to build the best design, highly available databases on AWS, give us a shout. You can check us out at fourtheorem.com or contact us directly using the links you will find in the show notes. So maybe to get started, what we can do is give a little bit of background on what DynamoDB is and how does it compare with relational databases.

So DynamoDB is well known as one of the best in class NoSQL databases in the cloud. And because we talked about relational databases in the previous episode, again, how does a NoSQL database compare with a SQL database? And it's not necessarily an easy description because NoSQL is a bit of a marketing term. So it's not like there is a canonical definition, but we'll try our best to try to describe the differences between those two classes of databases.

So let's start with relational databases first. Relational databases traditionally optimize for storage. And after all, we have to think that they were invented at a time where storage was very expensive. So the goal at that point in history was to try to limit as much as possible the duplication of data because storage was effectively a scarce resource. So data is generally separated into normalized tables with defined relations between them.

So well-organized structure and well-defined schema. Relational databases normally use a language called SQL and highly optimized query engine that basically allows to retrieve data across multiple tables in a very dynamic way. Allows you to combine data in different ways, filter data in different ways, do updates across multiple records at the same time. And this has become over the years some kind of lingua franca for databases.

And incredibly popular, lots of people in the industry know SQL as a language, well understood, used in many products, even for reporting, not just for actually interacting with databases. And if you think it's been around for 15 years, 50 years, it actually makes sense that it's something so well-known and understood and adopted in the industry. So SQL is kind of a way to do arbitrary requests or ask arbitrary questions to your database.

And that's a great thing. It's actually a great feature, especially in comparison with NoSQL databases because you generally don't need to know in advance what you're going to be using this database for. You just put all your data there, you give it some sensible structure, and then over time you can come up with new access patterns, with new questions to ask to your database and SQL is going to be flexible enough to allow you to express this kind of questions to your database. But this is a bit of a double-edged sword because it's so flexible, that also means that it cannot be optimized for performance for any use case or for any question that you might have. So sometimes you will find yourself, if you ever manage a SQL database, trying to figure out why this query was particularly slow, how do I optimize it.

Sometimes that means maybe changing the data structure, maybe adding indices, maybe scaling hardware, maybe thinking how do I partition this data across maybe multiple instances of that database. So these are kind of the pros and cons of relational databases. Let's talk now about NoSQL. And we already mentioned that NoSQL is a bit of a marketing thing, so let's try to figure out what is the simplest definition that we can give that probably most people would agree with.

And one of the main points of most NoSQL products is that they are schema-less. So that means that when you store data, you store it in a set, for lack of a better word, let's just call it a set, like one place where you put all your data and different records in that set can contain a different structure of data. So you can have different fields. That's what we mean by schema-less. You don't have to think in advance about a schema that will fulfill all the records that you want to put in that particular set, but every single item can have its own properties. And another interesting point is that generally NoSQL products will be a little bit more relaxed when it comes to ACID compliance. With ACID, we mean atomic consistent, isolated, and durable, which is a property that most relational database will try to guarantee. In NoSQL databases, generally, the producers are concerned about performance and making sure that the data can be easily distributed across multiple nodes. So there are some trade-offs that are made where generally, for instance, what NoSQL producers will do, they will give up on the idea of consistency and favor eventual consistency so that it's easier for them to be able to distribute the data in a durable way across multiple partitions. And the final point is that with NoSQL databases, you generally worry a lot less about storage cost. And this is probably because it's a much modern version of databases. So storage is a bit less of a problem since the 70s. So there is a lot more freedom to use storage in ways where you might end up duplicating data. But once you do that, you might be able to access that data much faster in some access patterns. If you want to think about what is the simplest NoSQL database that you can imagine, you can just think about a key value storage. So imagine you have a map in any programming language where you can start key value pairs. And you can imagine that the key is basically the thing that allows you to access records univocally. And then inside the value, you can store complex objects of any kinds with multiple attributes. And they can all be different between every record. And it's also worth mentioning that this is not necessarily the same thing when we mentioned document databases. You might have heard of MongoDB, which is generally classified as a document database because document-oriented databases are more of an extension of the key value concept. They tend to have a little bit of a more structured format and a more expressive query language. So when we talk about NoSQL, and especially in the context of DynamoDB, we are talking about something that maybe can be a little bit simpler than products like MongoDB. Hopefully that gives you a good introduction to the world of NoSQL and what we mean by NoSQL and how does it compare with relational databases. So let's now talk about specifically DynamoDB. Eoin, where do we start?

Eoin: Yeah, let's start with some of the terminology and concepts around DynamoDB so that we can take the discussion from there. Just like SQL databases, you start with a table. So this is the primary unit. And you don't really create databases in DynamoDB, but you create tables. And that's your starting point. Within tables, then you're going to be storing items. So an item is the term that is used to refer to essentially a row containing a key and attribute values. So then we talk about a key and a key is less of a trivial concept really. But it is, as we mentioned, a type of key value store, and every record is identified with a key. And there are two types that you can use. You can either use a simple key or a composite key. And that would be the primary key uniquely identifying an item in the table. A single key has a hash key only, which is also known as a partition key. So it's probably a good idea to understand both the term hash key and also partition key.

And then if you're using a composite key, you'll have that same hash key. But you can also have then a range key, which is also known as a sort key. And when you're writing or reading DynamoDB data, you'll always use the partition key. If you have a sort key as well, so if you are using composite keys, you'll need to specify it when writing data, but you don't necessarily have to specify it when you're reading. So we'll talk a little bit more about all that later on.

So that's your key. And then your value is composed of attributes. You can have multiple attributes in each item, as you mentioned already, and an attribute has a name and a value. It's a little bit different to the document storage option or just a simple key value storage option, where you have a single value or you have a document with awareness of the structure. In DynamoDB, you can have multiple attributes and each of those could be like a little document, but in a very unstructured way. There's a number of different types supported. It's worthwhile understanding what types are supported, especially when it gets into the multi-value types. So the simple values types, the literals you can store are string, number. You can also store binary data as a type. And then you have lists, maps, and sets. If you just write kind of JSON type data into an attribute, that's a map, a list is an array, and then you have sets where you don't get duplicate values. And you have three different types of sets supported. So you have string sets, number sets, and binary sets. And we'll talk a little bit more about how you use those. There is also a null type that's quite rarely used. Now it is important to note that unlike a lot of databases, the maximum item size in DynamoDB is 400 kilobytes. It's important to note that this is per item, right? For the whole record, not just per attribute. You might think this is a small comparison to like Cassandra or MongoDB, which lets you store gigabytes in records, but there's a lot of limitations like this in DynamoDB, which are there for a very good reason. And they're there because it helps them to deliver on the massive performance and scalability guarantees that they provide. So it seems sometimes working with DynamoDB that it's almost like working with a low level database engine because they're strict about giving you a limited set of features so that they can give you those guarantees in return. So if you want more data than 400 KB per item, it's difficult to offload that into an S3 object. So maybe before we start diving into more technical details, let's go up a level. Why and when would you use DynamoDB?

Luciano: I will say that one of the main reasons to think about using DynamoDB is because it's so nice and easy to get started with. You can create a table in a matter of seconds. Either you click offset, or you just write a few lines of YAML in your CloudFormation file, and you have a table that you can use to store data, read it, write it, and build an application on top of that. And this is something that I haven't seen in any other database, even when you can manage services.

If those are relational databases especially, it takes a longer time to get started with. So for quick things, definitely a database to consider. It can also be very cost effective, especially when you don't need a data intensive application, when you don't expect to be reading and writing all the time, or when you don't expect large volumes of data. It can be very, very cost effective. And the main idea is that it's kind of a serverless database, at least from the pricing perspective. If you don't use it, you don't have to pay anything.

In reality, there are different building modes. But again, we can think about it as if it's a serverless service. So they try to provide you with an interface that the more you use it, the more you're going to pay. If you don't use it, the cost is going to be very limited. And this can be a really big advantage, for instance, if you are a startup, where maybe at the very beginning, you are going to have very limited traffic as you try to build your first MVP. Then eventually, if your product is very successful, it might grow. And of course, your billing is going to grow with the success of your platform. It's very well integrated with other AWS services. That's another interesting point. For instance, we can talk about DynamoDB streams, which is a nice way to basically get all the changes that happen in a DynamoDB table and stream them for real-time processing, for instance, to Lambda. And this is something that allows you to do change data capture, and you can do all sorts of interesting things with it.

Also, you can get very fine-grained access control because it's kind of a native AWS service, so to speak. You can use IAM policies to a very fine level of detail. You can really control what kind of, not just what kind of tables, but what kind of records different roles can actually access. And this can allow you to do very cool things. For instance, if you're building a multi-tenant application, you could limit, for example, a Lambda to be able to read only the records that are attributed to a specific tenant and not the other ones, which can be something really, really beneficial if you're trying to get some kind of SOCK compliance, or if you just want to be sure that you're not going to be leaking data across tenants by accident. And this will be something very, very difficult to achieve at this level with relational databases that are not so well integrated with the rest of the AWS ecosystem. And another few things that are very worth mentioning is that DynamoDB scales massively. And after all, we have to think that DynamoDB was built for Amazon, so to solve all the problems that they were having with their own massive e-commerce as it was growing in popularity. And it is powering today the entirety of the Amazon infrastructure. So you can imagine that if you can build something as complex as Amazon and as big as Amazon, there is a level of scale there that is not trivial to achieve with other technologies.

It can be very simple to use, of course, for simple use cases. If you have access patterns that are pretty much key value based, then it is very simple to use. You just store your data by key, you read your data by key, super easy to get started. If you need very low latency, DynamoDB is one of the best databases for that out there. It has very consistent load and latency responses. For example, they promise you single digit milliseconds when you do a GET operation.

So when you can access a key exactly, you get single digit millisecond response, which is amazing. And that's very consistent, regardless, for instance, of the size of your dataset. And this is something that can be a great candidate, for instance, when you're building web applications and you want to make sure your users have very snappy responses, there is a feeling that the application is very responsive, or other use cases where you need to guarantee that the access to the data is as fast as possible. But now let's talk very quickly on when you might not want to use DynamoDB, because of course it's not a silver bullet that is going to solve all your problems. So some cases that come to mind is the main one is probably when you need flexible querying capabilities. And this is actually very common for startups.

Conversely, we say that DynamoDB is really good for startup environments because the pricing dynamics will scale with the growth of your company. But on the other end, you have to consider that DynamoDB requires you to understand really, really well the way you'd need to access the data. And when you're building a startup, sometimes you need to pivot multiple times before you figure out exactly what's your product market fit, what's the product that is really solving a problem for your customers. So you're going to go through significant different iterations of your product. And as such, you're probably going to change the way you use the data in your database multiple times. DynamoDB might not be the best database for that. It will require you a lot of hard work to always adjust changes to the structure of your database, change it to the way you query the data. Something like a relational database might be much more suitable for that kind of thing because with the flexibility of SQL, as long as you're keeping your data normalized, then you can easily adjust for different access patterns. So definitely worth considering if you don't really think you understand well your current and future access patterns, DynamoDB might create a little bit of friction for the evolution of your product. Other reasons why you might not want to use DynamoDB is, for instance, when you need to integrate with other systems that expect a SQL interface. DynamoDB is not going to give you a SQL interface or at least not a traditional one. So definitely it will make your life much harder if you need to integrate with something that is expecting SQL as a language. Another case is when you might have lots of relational data by nature. So you really need to use features like join and join data across multiple tables. That's not even something that is supported in DynamoDB natively. So you will need to do your own joins with code inside your application. And that's something that is not going to be very efficient and it's going to be very tricky to do well and to scale it. And finally, if you, for whatever reason, need to manage a database by yourself, like you need to host it yourself and run it yourself in your data center or even inside your own cloud account, DynamoDB doesn't offer that option. DynamoDB is only a managed service. Amazon will give you a single node local version that you can use and run it yourself, but that's meant to be used only for local testing and development, not to be run in production. Now, I think at this point, it might be very beneficial to try to explain a little bit more how DynamoDB works, because I think that is going to demystify why there are so many constraints, but at the same time also why DynamoDB can be so effective and performant in certain use cases.

Eoin: DynamoDB data is stored in partitions. You might've guessed this already, since we mentioned that data needs a partition key, that hash key we referred to earlier. So when you provide your value for your primary key, the partition key part at least, that key is going to be hashed by DynamoDB. And the hashed value is going to be used by DynamoDB to route it to the server nodes where the partition or that shard of data is stored. And it's the scalability of the partition model that gives DynamoDB its infinite scalability. Then each partition has a primary node that will handle writes just like many other databases, but it will also have two secondary nodes. And for data to be written, it has to be written by the primary and at least one other secondary node.

The third node then can be updated asynchronously. So that will give you better performance on writes. But what this means is that because any of these nodes can handle reads and only one of the secondaries is updated synchronously, you might end up reading from a node that doesn't have the latest data. And this is part of DynamoDB's default mode of eventual consistency. And if this tradeoff is a problem for you, there is a way around it, you can explicitly request strong consistency when reading. And that may take a little longer because it has to wait for the third node to acknowledge. But you will pay an increased price for this mode, essentially double based on the billing model, which we'll explain a little bit later. And that billing model, the pricing model of DynamoDB is very tied into its performance and scalability. Because when you write, you consume a write capacity unit, when you read, you consume a read capacity unit. So WCUs and RCUs.

One RCU will allow you a strongly consistent or two eventually consistent reads per second, and they can be up to four kilobytes. And a write capacity unit allows you to write one item up to one kilobyte. You have two pricing options. You've got provisioned mode where you can say, okay, I'm going to need 500 RCUs and 500 WCUs. And then you pay a fixed amount per hour as long as the table exists. The newer mode, which is more serverless is the on demand capacity. And that will scale the WCUs and RCUs up and down for you. If you don't use them, you don't pay. But if you do use them, the cost is generally higher than provisioned capacity. So you need to measure your own workload and decide which one works. Generally, we'd say start with on demand capacity, measure how much you're using, look at your bill and optimize accordingly. And the good news there actually is that I think just last week, AWS released a new feature, which will allow you to cap the maximum on demand capacity. So you can manage that maximum cost and don't have to lie awake at night worrying about it. Now, when we talk about partitions, you might have heard the concept of hot partitions, especially if you're reading older blog posts or content where your throughput could suffer if you didn't actually evenly distribute the partition keys across your whole data set. And if you do read anything like that, don't worry, because Amazon has since added an adaptive capacity feature a few years ago that automatically solves that for you. So they'll manage capacity according to the size of the partition keys on different nodes. But it is still important to note that each partition does have a maximum throughput. So it's 3000 RCUs or 1000 WCUs. So if you are going to have a lot of traffic, you should make sure that you're not just using a small number of partition keys. And that will allow you to ensure that you get consistent performance across all of your data. So I think partitions, they're basically the fundamental concept to understand. We've talked about strong consistency and eventual consistency. Let's talk more practically, how do you get started and what you do to start using DynamoDB?

Luciano: If you're used to more traditional relational databases, one thing that might be surprising about DynamoDB is that it doesn't use something like an ODBC or JDBC type of connector. Instead, you just do HTTP requests. So in a way, it's like you have a web API to interact with when you use DynamoDB. In reality, you rarely want to use the web API directly. You will be using the AWS SDK, which of course abstract all of that communication in a much nicer way.

When it comes to the SDK, there are actually two different types of client. And this is something that sometimes can be a little bit confusing, but the idea is that you have clients at two different levels. You have the main DynamoDB client where you still need to understand a little bit what is the protocol when it comes to specifying the different types of values that you need to read and write. Instead, when you use the Document client, that type of client is generally a little bit more integrated with your programming language of choice. It can understand the types directly from the type that you express in your programming language, and it's going to do an implicit conversion behind the scenes for you. So if you are trying to put a string, for instance, in an attribute, it's going to automatically create the correct shape of the object that the underlying client expects to say that that value is going to be persisted in DynamoDB as a string and not, for instance, as another data type. So generally speaking, I would recommend to use the Document client because it will make your life a little bit easier, and it will abstract some details that you don't necessarily have to worry about when it comes to the underlying protocol of DynamoDB. Let's talk a little bit more about how do you write data, what kind of options do you have there, and all the right actions that you need to do, as Eoin, you mentioned before, force you to provide the full primary key. So you need to explicitly say, this is the primary key that I'm going to be used to store this particular record or to update a particular record.

And one interesting thing, and this is something of a pain point that I had a few times in the past, is that you have no way to do a query such as update all the records where this particular clause is true. You need to read and write based on a primary key that you need to know in advance. So you cannot just use arbitrary attributes to do that stuff. Now, when it comes to writing, you have a few different operations that you can do. The first one is put item, where basically you are either creating or overriding an existing item, so a single item. Then you have update item, which is, again, either write or update. You can specify a subset of attributes in this case, and you can also use this particular operation, for instance, if you want to delete existing attributes from an existing record. And you can also use it in interesting ways. For instance, if you have a record or an item that contains a set or a map, you can just insert or remove data from the underlying set and maps that exist in the attributes of your item.

And finally, and this is something that can be actually very common, I've seen it a few times when using DynamoDB, if you have counters, you can use the update item to just say increase by one. And if you consider that this is a distributed database, you don't want to read the data first, then in your code increase by one and then start the data again, because you might have that operation happening simultaneously, potentially in a number of different concurrent executions, and therefore your counting might be overriding each other. So it's better to just let the database do the increment for you, because that way can be done consistently. Then, of course, we have delete item operations, and it's important to know that you can also do batch writing. So, for instance, if you need to insert lots of data into DynamoDB, maybe you are loading data from, I don't know, fixture data that you need to use in your application, you can do that, but there are limits. For instance, you can write up to 25 items if they're not wrong in a single batch. So you need to create multiple batches according to how many items you need to write.

And finally, you can also use the transact write item, which allows you to write data as part of a transaction. And the other thing is that when you write something into DynamoDB, so when you do an update operation, you might be interested in receiving a response from DynamoDB, and you can actually specify what kind of response you want to get back from DynamoDB. So different options are, for instance, if you don't really care about anything, you can just say, no, just write the data, I don't care about the result of that operation. Then, for instance, when you're doing updates, it can be very interesting to know what was updated. So you have options like all old, or updated old, or all new, or updated new, that will allow you to select a subset of the data that was actually updated and compare it with the previous data. And going back to the case of the counter, if you want to say, for instance, increase this particular attribute by one, you don't necessarily know what's going to be the final value, because maybe you didn't read the value in the first place, or maybe the value that you have right now in memory in your program is outdated, because meanwhile, there have been other increases from other concurrent executions, you can get the new value as a response from your update operation when you select one of these attributes. So that can be a convenient use case. For instance, if you are building some kind of counter, increase the value, and you want to know what's the most recent count in your program.

Another thing is that you can add condition expression. So when you write, you can say, write this record only if certain conditions are happening, and don't write it if those conditions are not satisfied. And this can be useful, for instance, if you want to guarantee data integrity, for instance, you might want to create a new user in a table, and maybe you want to make sure that there is only one user with a given email. Again, thinking that you might have concurrent execution of your program in different environments, maybe different lambdas, it's not unlikely that you can have a very similar request from two different lambdas in a very short amount of time. So for instance, if a user is submitting a form twice by mistake, you might end up creating two users with the same email. By using a condition expression, you can say, don't create a user a second time if this email is already existing in the primary key, for instance, of another record. Going into queries, you have different ways to query your data. The simplest one is probably get item.

Another use case is scans, which basically allows you to iterate over the entire table. This is generally a very niche use case. You rarely need to do that. Or if you find yourself doing that, probably you should think twice because it's not always a good idea to do this. So unless you really know what you're doing, try to avoid scans as much as possible. And the main reason is that, especially if you have a large data set, a scan might take a very long time to complete, but also it's going to be very expensive for you. So just be sure that you are aware of that. If you find yourself using a scan, make sure you know what you're doing. And if you have other options, probably go with the other options. Of course, we have a concept of query as well, where you might want to retrieve multiple records together, but of course it is still somewhat limited to the partition key. So you can query only for a given partition key and then filter the subset of records from your sort key when you have a composite key. And you can have expressions such as equality, begins with, between, but you cannot do more generic expressions that you might find in SQLite, for instance, when you use the like operator and you cannot even do ends with.

So you need to be very careful depending on the type of queries that you expect to do in structuring your keys so that the query operation allows you to do the queries that you need to do. And you can also use filter expressions, which are a little bit more flexible and they can be applied to any attribute, not just the primary key and the secondary key. And these filters are a little bit funny. They work in a way that you might not expect the first time you use them, because if you're using again to SQL, the filtering happens at the database level where the database is just going to give you the data that matches the conditions that you are looking for and ignore everything else. While here with DynamoDB, when you use filter expressions, you are actually still getting all the data, effectively discarding the records that don't match that particular filter expression. Finally, each query has one megabyte read limit. If you need to read more, you need to use pagination. Thankfully, the SDK these days makes that much easier than it used to be, especially in dynamic programming languages like JavaScript, you can use async-atorators and that's a relatively easier experience to go through all the different pages. But of course, you need to be aware that you are making multiple requests to DynamoDB. You're sending multiple HTTP requests. So the more data you read, the more time is going to be required to read the entire dataset. Now we should probably talk a little bit about indices because that's such another interesting topic in DynamoDB and it's something that can allow for other access patterns.

Eoin: And we've talked about one type of index, kind of, so far because we mentioned primary keys, but you can actually add additional keys to support querying by fields that are not in the primary key. As we've said, and you've talked through the query semantics, you need to specify the partition key. If you want to do more granular filtering, you need to use a key expression, but what about different access patterns?

What if you need to query by something else entirely? Well, that's where indexes come in and they're called secondary indexes in this case. There's two types of secondary index. There's the local secondary index, which is stored together on the same partition as your database. And because of that, the partition key is always the same as your primary key's partition key. And only the sort key is different. Then you have global secondary indexes, which are stored separately, and they can have a different partition key and sort key. For that reason, they're a lot more common.

Local secondary indexes have a few more limitations and they also share the table's capacity, whereas global secondary indexes have their own capacity. And when you hear people talking about global secondary indexes and local secondary indexes, because they're a bit of a mouthful, they'll normally say GSI and any kind of secondary index allows you to retrieve atom attributes from a set of related records by different keys. You can imagine having a DynamoDB table that stores customer orders. And normally you retrieve it by customer ID and maybe date for the sort key so that you can filter by date. But you might also want to retrieve by product ID and amount.

So you could put product ID and amount in as a separate global secondary index. One of the cool things about indexes is that they can actually be sparse. So what does that mean? Well, if some of the attributes in your index aren't present in any item that you're inserting into a table, your index doesn't actually need to store that record at all. So the volume of data in an index could be much less than in the table itself. And because of that, indexes can actually be used as like a materialized view or a filter on data because it's already pre-filtered based on whether those attributes are present or not. And that's quite a common pattern for GSIs. You can also use indexes to store different but related entities together in the one table. So we talked about storing customer orders, but what if you wanted to store customers' orders and products and query them together? You can actually do that in DynamoDB. And you do that by overloading the partition keys and sort key values so that you can query them individually and using more indexes as well to be able to support more and more query patterns. And this approach is called single table design.

And it typically means having a naming convention in your keys, like having a partition key, which has a syntax like customer hash, and then a customer ID. And then maybe in your sort key, you'll have a order hash order ID. And then you might have a separate product ID column, which is used in a secondary index to query the product. It's a total shift from the simplicity of the default DynamoDB approach. And it's incorporating relational modeling from relational databases, but it allows you to get the best of both worlds with some trade-offs, but you can actually implement relational design in this way. And this all came about, well, it's been around for a while, I guess, but it was popularized when Rick Houlihan, who used to work at AWS advising all of their amazon.com teams on how to do this. He gave a series of very famous re-invent talks describing advanced DynamoDB modeling. And this really gave a lot of momentum to the idea of single table design.

I remember seeing this talk and thinking, wow, this is amazing, but I had to watch it a few times to really understand it because it's kind of mind-meltingly high speed and deep dive, like the most level 400 talk I've seen. And then Alex DeBrie gave a much more accessible guide on it in his great DynamoDB book. Yeah, he has got a lot of great content around DynamoDB, so much so that I'm surprised they haven't renamed it to DynamoDeBrie at this point!

So the fundamental idea with single table design is if you know your access patterns ahead of time, you can design your DynamoDB table indexes and keys to store all of this data, related data together. So that could be created together and it can allow you to do all this relational modeling, but still gain from the performance and scalability of DynamoDB. Unfortunately, it's not really very easy to grasp and do well. I'm still afraid of it, to be honest. Even if you do do it, it can be difficult for others on your team to understand and troubleshoot when they join the team. Even I've seen single table designs, which I've implemented and understood, and then gone back to it a few months later and thought, what is this schema? I can't remember how this is modeled. And people have tried to provide tooling around that to make it easier. And that has helped to design it, but I still don't see a great solution to ultimately making it accessible and understandable for everybody. Of course, we mentioned that you need your access patterns well-documented and understood ahead of time. So if they change, you need to be able to plan and execute a schema change and a migration later. So it's not a silver bullet. And while it looks really cool and it's very appealing, I would tend to say, don't get caught up in it and don't worry about it too much. What do you think, Luciano? I mostly agree with what you said there.

Luciano: The only thing I can add is that I found that it might help a little bit if you try to abstract all of that stuff in your code, meaning that you are going to use something like the repository pattern to say, well, I have a code layer where I can just say, give me all the, I don't know, products, or give me what's in the cart for this customer. And behind the scenes, you have abstracted all the necessary logic to integrate with DynamoDB from a team perspective that may make things a little bit easier because you are not necessarily required to go and look under the wood to exactly see what's happening with DynamoDB. But of course, as you say, then if you eventually find yourself in the position where you need to change the data structure to accommodate for different access patterns, then somebody will need to be able to touch that layer and make the necessary adjustments. So this is not necessarily a silver bullet. It's just, I guess, good code practices that might create abstraction layers that can be more accessible to a larger group of people in the team. So that's maybe something else to consider if you do find yourself using the single table design pattern, if you see value in it, and there is definitely value, that can be one of the practices you can use to make your life as a team a little bit easier.

And I think at this point, we've covered enough ground when it comes to DynamoDB. This was a longer episode that we generally do, and hopefully you enjoyed it anyway. We tried to share as much as we could about the basics of DynamoDB, what it is, how does it compare with relational databases, how do you use it, even up to talking about the single table design pattern. And of course, don't forget that if you decide to use DynamoDB, don't forget that relational databases are still pretty ubiquitous.

In a way, if you're using AWS, it makes sense to adopt DynamoDB, but you always need to look at your requirements and make sure you make it a conscious decision. Definitely, there are many advantages in using DynamoDB, but also we can say the same when it comes to traditional relational databases and SQL. So don't be feeling like you are missing out if you prefer to use a relational database rather than DynamoDB. I think there are still many ways to use relational databases and make them scale in the cloud even at very high scale. So I think we will love to hear more from you if you're using DynamoDB, if you totally ditch relational databases, or if you are still feeling more attached to relational databases than to DynamoDB. And maybe hear about the stories that you might have, if you have any scar from DynamoDB or any scar from relational databases. It would be nice to put these ideas into context because I think the context is really the key here. It's not really like one technology is better than the other. Different use cases might be more suitable for different types of technologies. With that, we will leave you some additional resources in the show notes. We will link the DynamoDB book that we mentioned by Alex DeBrie, but we will also link Alex's podcast and YouTube channel where you can find additional content and we will share some of the talks we mentioned about Rick Houlihan. So thank you very much for being with us and we look forward to seeing you in the next episodes.

123. What do you need to know about DynamoDB?

Let's talk!

Prev

Next

AWS Bites Podcast

123. What do you need to know about DynamoDB?

Let's talk!

Prev

Next