Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: Not so long ago, we did a deep dive into Amazon Aurora, and we talked about all the non-trivial things that you need to do to set it up. If you missed that episode, it's episode 122, so check it out. But today, we're going to be talking about something a little bit different, because one of the things we said about Aurora Serverless is that it is definitely something that doesn't necessarily reach what we would call the gold standard of serverlessness that we think when we think about something, some data storage, like, for example, DynamoDB. So that changed recently when Amazon challenged themselves and released a new database called Aurora DSQL.
This happened at reInvent last year, and just a few weeks ago, this new shiny database became generally available. We finally tried out, and we found where it excels, what are the limits, and today we're going to talk through all of this, what we found out, and a lot more. We got a fully working code example that we will be sharing during this episode, and we'll also talk about how we stress tested it and did some performance and scalability measures. So welcome to another episode of AWS Bites. My name is Luciano, and as always, I'm joined by Eoin, so let's get started.
AWS Bites is brought to you by fourTheorem. Stay tuned until the end of this episode to find out more about fourTheorem. So let's start with a little bit of an introduction. When you're choosing a database, there are always a few questions that you end up asking yourself. The first one might be, for example, scalability. How will this database scale both up and down to handle changing loads, for example?
Maybe you're building a small project, like a startup thing that you want to experiment with. Chances are maybe you're lucky and it goes very successfully. So what happens to the database? Is it something you need to worry about, or is it going to scale automatically to handle the success of the new application? The other aspect might be cost, for example. How much is it going to cost? And again, this might be related as well with scalability. How does the cost change as you scale, for example? What is the billing model, for example? Availability is another question.
What happens if your database might not be available all the time? Like what kind of mechanism do you need to put in place? Or maybe something you can rely on and you don't have to worry too much without doing any kind of intervention yourself. Complexity is another element. How much stuff do you need to manage, set up, care, and maintain over time? SQL versus no SQL is probably one of the biggest questions that most developers would ask themselves because it changes really the way you build the application, the kind of data model you can store, and then the type of mechanism that you have to deal with the data. So that's always a very important trade-off that you need to figure it out, depending on the kind of application you are building. The SQL, interesting enough, is a SQL database. It's a distributed SQL database, and it's designed to make these choices a little bit easier than other alternatives, especially when it comes to scale, availability, and complexity. So we'll get on cost a little bit later in this episode. So how does DSQL make scale, availability, and complexity a little bit easier than the alternatives we have seen so far in AWS? The first thing is that DSQL scales automatically, and virtually infinitely, meaning that effectively you don't have to worry. You can start very small, and then as your application grows, as your user grows, DSQL should manage automatically all the scaling in terms of storage, compute, and querying separately. They also guarantee you five nines of availability. So again, another aspect that you shouldn't be worried about. And you can create a multi-region cluster, and in that case, you get five nines of availability, yes. You can also create a single cluster. In that case, you get four nines of availability. But still, I think it's a pretty good number, given the fact that you don't have to worry about managing pretty much anything. And what we found out in our experiment is that it is by far the simplest SQL database to set up. It's pretty much as simple as creating a DynamoDB table, and you have a cluster up and running. So effectively, there is almost no configuration. There is no patching. There is no maintenance. The only setup you might need is to add multi-region support and backups, for example. So I would say it scores very high in our serverless scale. Another interesting thing that might be relevant for you as a developer is that it is Postgres compatible. I will put an asterisk there. We'll probably talk a little bit more about why the asterisk throughout this episode. But the idea is that you can use any Postgres connector client, and it should work for most of the operation. And yeah, as I said, there are some limitations and some trade-offs that you should be aware you should consider. But most of the things work, which means that generally you can test locally with a local normal Postgres container, for example. And then when you want to test remotely, you can use DSQL. But yes, there are limitations. So Eoin, what are you going to tell us about that?
Eoin: So right now, DSQL is Postgres version 16 compatible. It uses the Postgres query parser, the planner, the optimizer, the type system, and DSQL dialect. And you get asset transactions. A difference with most databases is that it uses optimistic concurrency control. And normally you would get a pessimistic locking approach. The optimistic approach lets transactions in a distributed environment like DSQL run without locking delays.
Checks for conflicts then only happen at commit time. From a developer's point of view, this means that you should know when transactions may fail and build in retry support. They won't fail as early as you might be used to. If you watch some of the interesting deep dive talks from reInvent, you'll know that Amazon uses their time sync service to ensure that transactions are isolated. And we'll link to a video that will go into that detail.
Now, authentication is AWS IAM only. So this is one thing you need to be mindful of. You get an admin role for your cluster, and then you can create additional database roles in with SQL. And then you can associate those with IAM role ARNs. Then you use normal SQL grants to give specific permissions to these roles. So username, password, authentication, all that kind of stuff, Kerberos, anything else you might need, you won't get that.
Now, in terms of limitations, one of the big ones is that foreign key constraints are not supported. Obviously, if you're coming from DynamoDB, you don't get them there either. So you might not miss them. But it is one that will likely cause incompatibility issues with existing tools and ORMs. It's definitely possible to design a great application without foreign keys. But we've already experienced issues with ORMs and schema migration tools that rely on them by default.
So you can expect to do some work. It's also worth mentioning that there are a few downsides to foreign key constraints. So we recommend reading a PlanetScale article on the topic, which we'll link in the description below. PlanetScale is another SaaS database provider. I think we talked about Neon in the past, which was recently acquired by, I think, Databricks. DSQL is a little bit comparable to those options, I think, in terms of the ease of setup. Another limitation is that if you want to create indexes, which you probably will, you can't just do a create index statement like you might be used to. Instead, you have to run create index async, which creates a specific DSQL type of asynchronous index. But when you do that, you get a job ID that you can monitor since that index is being built asynchronously in the background. And again, when it comes to things like schema generation tools, schema migration tools, you might have to do some work to make sure that it's using the right syntax.
Luciano: Yeah, this is one of the most annoying things that I found so far, especially because if you are used to use something like an ORM, where maybe indices are created automatically through your own migrations, maybe if you're testing locally with a regular Postgres, this is one of the places where your code will need to start to diverge between a local environment with a simple Postgres and DSQL in production.
But I digress. There are other limitations we want to talk about. For instance, you cannot create materialized views and temporary views, so you can only create unmaterialized views. Similarly, temporary tables are not supported. You can only have one database in a cluster, but within the cluster, you can actually have multiple schemas, so it's not that bad. No triggers or stored procedures, so that's another limitation that if you have those things in an existing application that you're thinking to migrate, you need to think carefully because that's a feature that you need to live without, at least for the foreseeable future.
And you cannot create columns of the type sequence, which is also another common thing that many ORMs will do out of the box, so be careful about that if you use an ORM. Partitions as well are not supported, which in fairness doesn't make a whole lot of sense, because probably DSQL internally manages all of that automatically. And this is probably another big one. Postgres extensions are not supported.
So if you are a user of things like PG vector, it's not something you get here, so you'll need to figure out how to live without that if you want to use DSQL. Serial sequence types and foreign key limitations can create problems, again, when you're using an ORM or a migration system, and it's one of these problems where if you really want to use them, because maybe out of the box you get them by a tool that you're using, then you need to figure out how do I do the same or an alternative thing in production when I ship my code to AWS. So that's probably the biggest friction point that we found so far while trying to build an application using tools like ORMs and migration tools. Let's get into our open source application, shall we?
Eoin: So we're going to link in the description below to our GitHub repo, where we've created an API-based application that you could deploy, and you can test it out for yourself. And there's quite a lot in it. The API itself isn't that complicated. It's just two, I think, entities in there. One which allows you to create lists, and in those lists you can create items. So there are two resources, and you get your usual post, put, patch, delete, and get endpoints.
And the technology behind this is mainly Node.js and TypeScript-based. So it's a Fastify API backend using Lambda and the Drizzle ORM. It's a Lambda-lith, monolithic Lambda. You might be interested in that because we've talked about the pros and cons of those in the past, but it worked well for us in this case. Made it quite simple. It uses power tools for TypeScript. We've got an API Gateway, REST API in front of it.
We've got endpoints to generate an OpenAPI specification automatically from the types. And we have load testing built into this repo as well, and we'll talk a little bit more about this load testing in a second. The whole thing is deployed then with CDK, so you can deploy the API pretty easily. There's two stacks, one that creates the database cluster with DSQL. I think we mentioned before that it has almost no configuration.
I think, in fact, it has a zero configuration. There are no options when you create a cluster. The only thing you can do is add regions. Now, for load testing, we did a bit of thinking and exploring what we can use for this. And a lot of these load testing tools can take a lot of setup and that sort of thing. But we wanted to try and get some metrics so that we could give you some real data on how this works rather than just talking about our subjective perspective.
We ended up landing on K6, which is a tool from Grafana that does load testing. And it's a really simple tool to use. You could just write your test script in JavaScript, and it will run it in its own special optimized runtime for load testing. So we tested 1,000 virtual users over two minutes using K6. What this essentially means is that it's a session. It'll create 1,000 concurrent sessions and just run them repeatedly for two minutes.
And this is just all from one machine, running it from the laptop. You can also use, if you want to pay, you could pay Grafana to distribute this workload around the world for you and create a more realistic global load test. But we didn't go to that expense. We just did run it locally, and it was enough for what we needed. So it's going to create lists, create items and lists, do delete, put patch operations for each session.
And then we had a look at the performance. So we have a few ways of measuring performance. We've got CloudWatch metrics, and we've got X-ray tracing as well. And there was a couple of new features that we haven't tried in X-ray before that we were able to avail of. So looking at the metrics that you get out of the box with D-SQL, it doesn't give you anything like, I think, RDS Performance Insights, which will give you more detailed query performance.
But it will tell you how many transactions you have, how much compute it's using in terms of compute milliseconds, how many bytes are being written and read, things like commit latency. It's all very interesting, but nothing particularly outstanding there. We could just see that we were getting tens of thousands of transactions, in fact, hundreds of thousands of transactions happening. But what we did was we used PowerTools tracing from TypeScript and X-ray to capture the Lambda statistics, but also instrumenting DSQL queries as well.
So instrumenting the database driver, the Postgres database driver, so that we could see what query performance was like. And because we were using a monolithic Lambda, we were taking an approach where we annotated each route with, we annotated traces with the route, basically, so we could see on an individual request level what the performance metrics were like. And one of the new features I mentioned in X-ray is transaction search.
And this is essentially a way of getting open telemetry format traces in CloudWatch logs. And then you have like a CloudWatch logs insights type query interface in the AWS console. And you can run statistics on all of your traces and spans and durations. And you can get all of those for 100% of your traces, not just a sample. So what we saw from that is that the Lambda runtime statistics for different routes, they varied quite a lot.
But we were getting, I think, you know, average of between 20 milliseconds and 100 milliseconds of latency for the Lambda function execution. Now, looking at the percentiles, then, for P95, we were seeing that some of them, like retrieving all lists, were taking like half a second. And if we look at the P99 stats, there were some even greater ones, like the maximum ones were more like 700 seconds or 900 seconds.
So we wanted to drill into this a little bit more. So looking deeper at the query segments themselves, we can see that most of the queries are double-digit milliseconds. So quite most of them are definitely like in the 10 to 20 second range. But there are some which are still like 100 milliseconds. And if you look at the maximum ones, they go into like half a second. So there's quite a lot of variance here.
And we ran it multiple times. And we ran it over multiple days. But we still saw this level of variation. And it's interesting because sometimes Amazon in their DSQL marketing talk about single-digit query performance, similar to DynamoDB. Now, you do see those in terms of the minimum values. But they're still, like if you look at the P90, you're looking at three-digit milliseconds for query performance.
So it's generally good, I would say. But if you're looking at really critical low-latency workloads, that's something you might want to consider. Of course, when you're talking about databases, you also want to look at the connection time. And this is a critical thing, especially with Lambda, where you don't have things like RDS proxy, which we talked about in that Aurora episode, 122. So we wanted to look at cold starts and try and measure how much of that was being consumed by database connection time.
And our average cold start time is about 537 milliseconds, which I think is not unusual for Node.js runtime, but probably on the higher end. But we also want to see, like, okay, how much of that is initializing Fastify and all our other third-party dependencies. So we looked at the, we added actually instrumentation and specific CloudWatch metrics for the database initialization time. And that's the time to initialize your database driver, establish a connection.
We run a test query, which is really, like, tiny and very short in duration. It's just, like, select one. And we saw that the connection time was between kind of 200 and 300 milliseconds for initializing your database. So that's definitely a significant part of that cold start time. It's the lion's share, maybe 50 to 70% of it. It would be nice if there was a way to have faster connections. But it's still, I would say, altogether not too bad. And for the majority of SaaS applications, given that you're still getting quite a lot of single millisecond query time and double-digit query time, I think it's pretty good, but could do better.
Luciano: We should try that in Rust to see if the connection is established faster. But, yeah, it's not something we have experimented.
Eoin: Actually, yeah, there is an interesting article by Benjamin Pyle, who did a whole lot of research. Which actually has two articles. We'll link them in the show notes. But one of them is about using Rust to query DSQL. And then the subsequent one is about trying to improve the query performance further by using Memento as a cache. Those are definitely worth reading.
Luciano: Awesome. Should we talk instead about high availability of multiregions?
Eoin: Let's do it. Let's get serious.
Luciano: Okay, so yeah, that's definitely another topic that might be important to you, especially when you care about how the data is replicated and trying to make sure you're not going to lose any data. If you're running in production, you have critical applications running on DSQL. So as we said, you have two options, either a multiregion and a single region cluster. So if you decide to use a single region cluster, automatically it uses three availability zones.
And every transaction is automatically committed to all of those three availability zones. If you use a multiregion DSQL, it gives you more availability. And each region is effectively linked to a cluster. And you get read and write endpoints with strong consistency. Every time you do write new data, that data is synchronously replicated across regions. And effectively you get a zero or PO, recovery point objective, which basically means that if a region fails at any point, you are not losing any data.
So to set up multiregion, there are a few steps that you need to follow. So you have basically, you need to have two or more participated regions and a witness region, which is basically a region that does not have a read or write endpoint, but is there just to, because it maintains an encrypted copy of the transaction log and ensures the five nines of availability. So effectively it's just used to provide availability and support recovery. Now note that right now, only US regions are supported as witness regions. This is something that you need to be aware. Hopefully it's going to be improved over time. But effectively, if you are trying to achieve the highest level of availability and at the same time you are worried about data sovereignty, maybe there is a little bit of a conflict there where you might not be able to use DSQL to the fullest capacity. I think now is the time where we get to talk about pricing, the time everyone is waiting for.
Eoin: Yeah, and this is where we try to make it clear how expensive or cheap it is, but ultimately fail because it depends. But let's give it a go anyway. Similar to DynamoDB, it has these units that are counted as you read and write with a separate storage cost as well. The units are called distributed processing units or DPUs, and the cost will vary per region. In Ireland, it's like $9.50 for 1 million units.
But note that this is quite a bit more expensive than $8, which is what you get in the US regions. But I think if you go to Asia-Pacific regions, it's like $10 plus. So it does vary quite a lot by region. And then it's 36 cents per gigabyte of storage. Now, there's a difference when you're comparing to DynamoDB because DynamoDB builds for reads and writes. And it's pretty clear what a DynamoDB write capacity unit and read capacity unit means in terms of the amount of data being read and written.
You can do a calculation on it. With DSQL, there's a lot more to it. But it's also a bit more vague. Like, DSQL's DPUs are based on reads, writes, but also compute. And you can't really predict how much each query is going to use. I suppose it kind of makes sense that they've got this additional compute metric that they count into the billing. Because SQL allows you to run complex queries where DynamoDB doesn't.
And that means they're going to consume more than just IO overhead. And it's not going to be deterministic. So there's no way that we can know of to calculate DPUs for your queries in advance. You just have to try and measure. Thankfully, there are usage metrics in CloudWatch for all of the things that are factors in your cost calculation. And Mark Bowes has provided a script that you can use to get a cost estimate based on this usage.
And he also has a good article on DSQL pricing. The link for that is below too. And we actually found that when reading another useful article by Alessandro Volpeccella, which is called the Amazon DSQL pricing guide. So check those out for sure. You do have a free tier with 100,000 DPUs and a gigabyte of storage per month. Interregion transfer will incur cost and data egress to the internet. That, as always, incurs a cost beyond the usual 100 gigabyte free tier.
You should also include the cost of keeping backups. And you can, by the way, integrate DSQL with AWS backup. So overall, it's difficult to say how cheap or expensive DSQL can be compared to RDS or even Aurora serverless. It just depends. It can be very cheap. And a lot of the articles I've seen give examples that seem very cheap. But we can also see how with significant load, it can get quite expensive for high load applications.
If we talk about our load test, for example, we said we measured 1,000 virtual users hammering the system constantly for two minutes. And already, like having run that maybe half a dozen times this month using a script that was provided by MacBose in that article, it'll be linked. There's a really nice script that'll just check your CloudWatch metrics and spit out a dollar amount. It will be an estimate.
But it's already saying that for us, having run that workload for, let's say, maybe all together somewhere in the 10 to 20 minute range, it's like $3. So that's just for minutes of usage for thousands of users. So if you have constant high load, you might want to think about investing in the pain of infrastructure management and using something like Aurora or RDS or something else. But still, I would say definitely give it a try and measure it because the amount of effort it will take away, the amount of development time and cost it will save you will be pretty massive. So when should we use it? Let's give our pros and cons. Who's it good for? Who should steer clear of it?
Luciano: Yes, I'm going to try my best to summarize that, although it's always a difficult choice. It isn't very binary. So let's see what can we bring into the mix to help you figure out if it's good for your project or not. So I would say that if you're building a new Greenfield application where you don't have to spend too much time effectively taking code that you have already written, maybe assuming a generic Postgres and trying to move it out where you might find unsupported features, in that case, it's probably a good starting point.
Like you have less things to worry about. So definitely if you're doing something new, it might be worth considering it. Another thing is when you have a desire to minimize infrastructure maintenance. So maybe you are doing a migration, but the cost that you would have by trying to figure out exactly things that you need to change, change them so that this equal is fully supported, maybe eventually becomes worth it because then you have less maintenance costs.
So that could be kind of a trade-off worth exploring as well. Another thing is that you have estimated costs for expected workload and know the cost trade-off makes sense. So effectively, if you have done maybe a few experiments with this equal and you have a fairly good understanding of the usage patterns, they're not going to change too much. So you kind of could guess quite exactly how much it's going to cost you.
And you see it's actually going to be much cheaper than alternatives. That could be another great use case that might justify a migration. And if you're not using libraries or tools that are requiring some of the unsupported features in DSQL, or maybe if you're not using any of those Postgres extensions, and some of them are actually quite common. So it might actually be possible that that becomes one of the blockers along the road.
So consider that as well as another thing that effectively, if you fall into all of these buckets, it might be worth exploring it and deciding then whether you want to stick with it or not. Now, if we want to summarize things that maybe can easily get on your path, and maybe at that point you should definitely avoid DSQL, let me try to give you a few pointers. So one reason might be that you have an existing application that will require too much work, and at that point it might not be worth anymore to invest that time compared to the benefits that you might get by migrating to it.
Another thing might be if you need to guarantee foreign key constraints, those are not supported, so nothing you can do about that. Very similarly, if you need traditional indexes, that's a problem. There are indices you can use, but you'll need to change your code to be able to use those. So consider the cost into implementing those changes as well. If you need store procedures, extensions, and things like that, then again, those are not supported, so nothing you can do about that.
If you have predictable load, maybe it's not going to be the most cost-effective solution for you. And if you have already invested in provision capacity, so you have invested upfront, maybe switching is not going to give you the saving benefits that you get with provision capacity. Another thing is that if you need to use different authentication mechanisms, like Kerberos, for instance, we mentioned, only IAM is supported, so that's another blocker for you.
So with this, we get to the end of this episode. As we promised you, we'll have to spend a few words about fourTheorem. Thank you for sponsoring another episode of AWS Bites. fourTheorem is the consulting company we work for, and at fourTheorem, we believe that cloud should be simple, scalable, and cost-effective. So we help teams around the world to just do that. So whether you're diving into containers, stepping into event-driven architecture, or scaling a global SaaS platform on AWS, or even trying to keep cloud spend under control, our team is available as your back. So definitely reach out, visit fourTheorem.com, and see how we can help you to be successful with AWS. We link all of the resources we mentioned in the show notes. So that's everything we have for today. I hope you enjoyed this episode. I hope you are curious to try the sequel, and let us know if you like it or not, what kind of solutions are you going to build with it. Thank you, and we'll see you in the next episode.