AWS Bites Podcast

124. S3 Performance

Published 2024-05-31 - Listen on your favourite podcast player

In this episode, we discuss some tips and tricks for optimizing performance when working with Amazon S3 at scale. We start by giving an overview of how S3 works, highlighting the distributed nature of the service and how data is stored redundantly across multiple availability zones for durability. We then dive into specific tips like using multipart uploads and downloads, spreading the load across key namespaces, enabling transfer acceleration, and using S3 byte-range fetches. Overall, we aim to provide developers building S3-intensive applications with practical guidance to squeeze the most performance out of the service.

AWS Bites is brought to you by fourTheorem an AWS consulting partner with tons of experience with S3. If you need someone to work with to optimize your S3-based workloads, check out at fourtheorem.com!

In this episode, we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: S3 must be the most loved of all AWS services. It's a storage service that allows you to store files with a simple API and takes care of scalability, durability, security, and a whole bunch of other things with very little effort on the developer side. S3 is becoming the ubiquitous cloud storage platform and powers a large variety of use cases. And for some of these use cases, performance really matters.

So if you're building a product that relies heavily on S3, there are a few interesting optimizations that you might want to leverage. In today's episode, we're going to talk about some of the lessons we've learned and some of the tips and tricks that we've discovered along the way working with S3 at scale. My name is Eoin, I'm joined by Luciano and this is another episode of the AWS Bites podcast.

AWS Bites is brought to you by fourTheorem, an AWS consulting partner with tons of experience with S3. If you need someone to work with you to optimize your S3-based workloads, check out fourtheorem.com or contact us directly using the links in the show notes. We already spoke about S3 best practices back in episode 33. Now that was more of a generic episode on a variety of best practices that are relevant to using S3, but we did give a quick intro on what S3 is, the related terminology. So if you haven't checked it out, it might be a good one to go back to. Today though, we're going to assume you already have a little bit of basic knowledge about the service and how it works, and we're going to focus mostly on performance. But let's give a brief intro. Luciano, where would you like to start?

Luciano: I think it's a good idea to still review how S3 works under the hood, because I think understanding, at least at the high level, what's the machinery behind it, it's important to really understand why certain performance or activities actually work. So if we want to just start with some stats, this is something that we can just observe to understand the scale of the service. And this is coming from a presentation that's maybe a little bit obsolete at this point, because it's a presentation from reInvent that was delivered in 2021, called Deep Dive on Amazon S3.

It's a really good one, so we'll leave the link in the show notes. But the data that they share there is that S3 stores exabytes of data. This is 1 billion gigabytes, I had to look that up, across millions of drives. So you can imagine that AWS somehow has to manage this huge amount of physical drives where all your data is going to be stored in a way or another. So this is the level of complexity that AWS is taking care of for you, so you don't have to worry about the kind of management of physical devices.

Now, there are a list that's, what they say, trillions of objects stored in various S3 markets. So all these drives are effectively a distributed system that shares all these trillions of objects. And the service can handle millions of requests per second. So I hope that all these numbers give you an idea of the volume and the scale of the service. There is another one, they even say that they can reach a peak of 60 terabytes per second of data processed.

So again, how is that magic happening? We don't necessarily know all the implementation details. But the interesting thing to know is that AWS does all of this at scale and still guarantees data durability. And the way they do that is by storing your data in multiple copies in different places. So we are obviously talking about the distributed system here, because it wouldn't be possible to reach this level of scalability with just one big machine, of course.

Now, if we remember the networking basics, you know that there are regions, and inside regions there are availability zones. And you can imagine an availability zone as a separate data center with independent connectivity, power, and so on. So in most cases, and I say in most cases because there are certain configurations that you can tweak, but by default S3 stores your data across multiple availability zones.

That basically means that as soon as you send an object to S3, AWS is automatically copying that object across independent availability zones. And then you get an acknowledge. That means that at that point your file is saved securely across different locations. Now, in all that process, at some point the data is being stored in a physical disk. And you can also imagine that it's stored in many of them, because of course if the data is living in independent locations, there are independent disks that are keeping different copies of your data.

So you can imagine that managing all these disks is tricky, and AWS needs to really have a solid process to check for physical device failure. And they actually can predict when the devices fail, and they can actually replace them before they actually break. And they can do all of that without basically losing access to your data. So they can still do all this swapping of disks and making sure that your data is always available and durable, without you having any interruption of service.

There is another cool feature that you can enable, which is called cross-region replication. So by default a bucket lives in one region, and the data is shared across multiple availability zones. But if you want extra guarantees, or maybe you want lower latency because you might have the necessity to access to that data from different locations around the world, what you can do is you can enable this cross-region replication. And what happens is basically for every object you create in a bucket, you can replicate that object in other regions as well. A bucket exists in other regions. And you can even make the data available to any location through something called AWS Global Accelerator. And we'll mention some around that a little bit later in this episode. So hopefully that gives you an understanding of the scale and the things that AWS takes care of for us when we use this service. So probably this is a good point now to jump to the first performance tip.

Eoin: ... 10,000 parts and you don't even need to upload them in order. So every part is, I think, between the limits. It has to be between five megabytes and five gigabytes per part. So if you've got a three megabyte file, you wouldn't use a multi-part upload for it. It has to be at least five megs. And AWS generally recommend you use something like eight or 16 megabytes for your part size. When you upload a single part, S3 will return to an entity tag, also known as an ETag for the part. And you record that with the part number.

And when you do the third step in the process, which is complete multi-part upload, then you essentially provide a manifest of all of the part numbers and ETags with that request. You can even send AWS a checksum of the original file to make sure everything was transferred correctly. And it's not a checksum of the entire object, but rather each individual part. There's a link in the show notes to a user guide that will help you to understand that process.

You generally don't have to do this yourself since most of the SDKs include some higher level abstraction in the API, or in the SDK for uploads and downloads, actually. But the upload part will generally automatically use multi-part uploads when it makes sense. And we'll provide links to code samples, the SDKs, including one example is the Node.js helper library, which is the lib storage in the AWS SDK version three. You can also do some cool esoteric things with this as well.

I remember having a case before when we needed to essentially merge a lot of CSV files. And those CSV files didn't have headers in them. So we were able to do that just using S3 features. Because when you specify a part for a multi-part upload, it doesn't have to be something that's on your client machine, it can also be an existing object on S3. So you can use it just to concatenate a bunch of files on S3 without any of that data, leaving S3 and being transferred to your machine.

Now, let's get on to multi-part downloads, or as it's better known, byte range fetches. So when you're doing a get object command, you can specify the start and end range for bytes. And if you want to download the entire file, it's generally not built into the SDKs. But there are examples of doing of implementing this yourself, we'll provide a link to that in the show notes. There is a very interesting podcast episode and a library associated with it from our friends at Cloudonaut. And they had a very specific need for one of their products to download large, large objects from S3 in Node.js and implemented a highly optimized library for it. So you can check that link out in the show notes as well. So that's tip one. Basically, use concurrency, do multi-part uploads and byte range fetches for downloads. What else should we suggest, Luciano?

Luciano: Another common thing is to try to spread the load across different key namespaces. And I think to really understand this one, we need to explain a little bit how some of the details of how S3 stores the object and what are some of the limits. Because if you look at the documentation, what the documentation says is that you can do 3500 put, copy, post, or delete operations, and 5500 get and head operations per prefix. And this is where things get a little bit confusing, because what does it mean per prefix? And if you look at other parts of the documentation, there is an official definition that says a prefix is a string of characters at the beginning of the object key name. A prefix can be of any length, subject to the maximum length of the object key name, which is 1204 bytes. You can think of prefixes as a way to organize your data in a similar way to directories. However, prefixes are not directories. So you can kind of make the parallel that a prefix is like saying, I don't know, "/home," "/luciano," "/documents," and then the name of your object. But behind the scenes, AWS is not really maintaining a file system. It's just a way for you to organize your data. What is interesting, though, is that somehow AWS is using this information to distribute the data across multiple partitions. And this is probably where the limit conversation comes from. You can do a certain amount of operations per prefix, but that probably really means per partition. And this is something that is not always entirely clear. What is the logic that AWS uses there to define how prefix maps to actual physical partitions? So it's something that AWS tries to determine automatically, depending on your usage patterns. But what we have seen in the wild is that if you really do lots of requests, even if you have different prefixes, you can still get throttled and see 503 errors. So it is really important if you're running at such scale to monitor the number of 503s, because if you're using the SDK, there are retries. So eventually you might be able to get your operation successfully performed. But that operation might take a long time, because there is a loop of retries that is happening behind the scenes. So you need to be aware if you're trying to get the best performance when retries are happening. Another interesting thing that we bumped into working with one of our customers is that we were still getting lots of 503s and at some point we decided to talk with support. And it was a long conversation. We got lots of help from AWS, but it seems to be possible to get AWS to tweak whatever is the internal mechanism for your specific use case. So if you're really hitting all these limits and you don't know what else can you do, I think the best course to action right now is to just open a ticket, try to talk with AWS, explain your use case. And I think they might be able to discuss with you very custom options that are the best solution for your particular use case. I think this is still very rare in the industry. We only had one use case, at least that I can remember on in my career. But again, if you happen to do thousands and thousands of requests to AWS per second, it's not unlikely that you're going to bump in this particular limit action. So just be aware that there are solutions, even though the solution is not necessarily well documented, but you can talk with AWS and they will help you to figure out the solution. Overall, the idea is to try to think about namespaces that make sense and then distribute your access, your operations to different namespaces if you want to leverage as much requests per second as possible. What's the next one you have, Eoin?

Eoin: The next one is going down to the network level. So it's a fairly common design pattern in networking and files storage to horizontally scale performance using multiple connections. So if you're making requests from one network device to another, you might bump into some bandwidth limits of that device and devices in between. So distributing the requests across multiple devices, multiple end-to-end connections can definitely help you to achieve higher throughput. So if an example of that is born out, again, going back to the Cloudonaut example, they realized that connecting to S3 from an EC2 instance, there's a limit of five gigabits per single VPC flow.

And a VPC flow is defined as combination of source IP, source port, and then a destination IP and destination port. And if you're just doing a fairly simple HTTP request to an S3 endpoint, you're going to get a DNS lookup. It's going to give you back an IP or a set of IP addresses. Your client is going to pick one and make the connection to that IP address. But if you're a little bit smarter about it, you can take all of the IP addresses for S3 back and use that to get multiple connections from your source to the destination. And that's exactly what that clever library from Cloudonaut did. It's something this load balancing on the client is something that the AWS CRT, the Common Runtime does as well. So the AWS CRT library, which is used in the Java SDK and Boto3 as well, has the capability to do that and do all this download performance optimization too. So it's worth checking out. And then as well, on the topic of network connections, different environments vary, different EC2 instances have different bandwidth characteristics on the network devices. And then you have enhanced networking and elastic fabric adapter to really squeeze more performance out of it. Also bear in mind that when you're running an AWS Lambda, your network connection size depends on your memory configuration because it's linearly proportional. So if you're finding that bandwidth is a constraint, you might think about, "Okay, well, can I do multiple downloads and multiple functions or do I just need to up the memory so that I get maximum IO throughput on that as well?" So that's the lower level performance tips. What else do we have, Luciano?

Luciano: Another interesting one is the usage of the edge. Let's call it like that. The idea is that you can enable something like Amazon S3 transfer acceleration. This is more when you have use cases where you might, for instance, be building a web application and you might have users that are connecting from all around the globe. And of course, if you store your data in a bucket that exists only one region, you might have good latency for all the users that are close to the region and very poor latency for all the other users that may be very far away from the region.

So one way that you can solve this particular problem and give more or less similar performance to all the users, regardless of where they are around the globe, is to enable this feature called transfer acceleration. And there are some data that AWS shares in their marketing page where they say that this can improve as much as between 50 and 500% performance for long-distance transfer of large objects. So that means, imagine that you have a bucket somewhere in Europe and a user from Australia is connected to that bucket.

You can imagine that there is by default significant latency, but enabling this kind of feature will reduce that latency significantly. And this is a feature that you need to enable because, of course, there is a significant complexity to make all of this happen for you. Data is replicated effectively around the globe for you. So it's something that you enable it and you need to be aware that you pay a premium price for it.

So it's not a free feature. So of course, it makes sense to use it only when you really have that particular type of use case, not just enable it because it might seem like a convenient thing to do. And if you know CloudFront, this feature is effectively leveraging CloudFront under the hood and is just distributing the data across different edge locations and then using the AWS Packable Networks to make sure that the connection between your actual region and the edge location is as fast as it could be. This is a feature that you need to enable at the bucket level.

So you just go to the bucket setting and you can enable it there either from the UI or with the CLI. And effectively, then when you want to download a file from S3 using this particular feature, you have to specify a special endpoint that is called https://s3-accelerate.amazonaws.com So that basically, rather than going directly through the bucket endpoint, is going to go through this special endpoint that uses the Edge Network.

Now we can give you a lot more details, but it's probably going to be more useful for you to redirect to the documentation page. We have a link in the show notes if we're curious to really find out how do you really enable this and kind of how to with all the steps that you need to follow if you really want to implement this particular feature. And the other option is if you are just building a website, for instance, and you want to make sure that all the static assets of that website are available in the edge locations, you can use CloudFront directly. So you just enable a CloudFront distribution. CloudFront is effectively a CDN. So that will make your object available in different locations around the world. AWS claims that they have about 400 edge locations. So this is probably going to have a good coverage for all around the globe. And yeah, if you're doing all of that, there is another extra advantage because at that point you are serving an entire website from an S3 bucket. But if you just enable the S3 website feature, that by default is only HTTP, which is not really ideal these days. You probably want to have HTTPS. When you use CloudFront, you also get support for HTTPS. So that's one more reason to use CloudFront when you are serving just static assets for a website. That I think concludes this tip. So what do we have next?

Eoin: The next one and final one is a bit neat maybe, but if your application relates to tabular data, like analytics or data science, you can leverage some of the great tools that optimize data retrieval from S3 for you to avoid reading data that you don't need. And this goes back to our byte range fetch really, but it's also saying some of the tools are already doing this for you on the hood and you don't even really need to think about how it works. And the simplest one of all is S3 Select. This is an S3 API. It's available in all the SDKs and in the console. And it's pretty straightforward. It allows you to retrieve specific rows and columns of data from S3 using a simple SQL-like syntax. So you could do select columns from table, where, and put some simple where clause. There's no joins or anything complicated like that in it. It's just for a single table. And that avoids you having to retrieve large volumes of data over the network and you push the heavy lifting onto S3.

Now, if you're doing something a bit more complicated and you're in this space, you might be familiar with the arrow, which is core to a lot of these data science tools, the arrow library and format tools built on top of it, like pandas and polars and DuckDB. These all ensure you don't have to read all of the data if you don't need it, particularly if you're using optimized column number formats like Parquet. All of these tools can intelligently like with Parquet file, for example, the metadata describing what data is in it is at the bottom of it. So those tools will go read the footer from the Parquet file, then they can figure out where the columns are stored in the file and where the row groups are that's split into groups. And then they can retrieve only the data you need. Polars and DuckDB are particularly fast when it comes to this kind of use case. They'll leverage those byte range queries automatically for you and are surprisingly fast in how they can run and are already putting a lot of engineering effort into optimizing things like object retrieval from S3. So you don't even have to think about it. So in terms of additional resources, we're going to throw a bunch of links in the show notes, which we hope are valuable, including some performance guidelines on S3 and design patterns. Apart from that, let us know if you have any more S3 performance tips. I'm sure there's more out there. Just let us know in the comments. Thanks very much for joining us this time, and we'll see you in the next episode.