AWS Bites Podcast

56. What can you do with S3 presigned URLs?

Published 2022-10-28 - Listen on your favourite podcast player

Uploading and downloading files are some of the most common operations for web applications. But let’s face it, as common as they are, they are still challenging features to implement in a reliable and scalable way!

This is especially true for serverless environments where you have strict limits in payload size and you cannot have long-running connections.

So what’s the solution? If you are using S3, pre-signed URLs can help quite a bit!

In this episode of AWS Bites podcast, we are going to learn more about them, and… if you stick until the very end of this episode, we are going to disclose an interesting and quite unknown tip about pre-signed URLs!

Some of the resources we mentioned:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: Uploading and downloading files are some of the most common operations for web applications. But let's face it, as common as they are, they are still challenging features to implement in a reliable and scalable way. This is especially true when we talk about serverless environment, where you have strict limits in payload size and you cannot have long-running connections. So what is the solution? If you're using S3, presigned URLs can help you quite a bit. And in this episode, we're going to be talking about presigned URLs. And if you stick until the very end of this episode, we're going to disclose an interesting and quite unknown tip about presigned URLs. My name is Luciano and today I'm joined by Eoin and this is AWS Bites podcast. So Eoin, maybe we can start with describing some of the use cases, like what kind of operations do we generally do when we talk about upload and download in the context of web applications? Yeah, okay. Let's set some context here by talking about a few of the use cases.

Eoin: So let's say you're signing up in a mobile application for your service and you want people to take a photo so they've got their avatar. That would be an upload. Another one might be you're offering some digital download, like a software. People are paying for the software and then they want to download a large binary application. You might want to have a download facility that's scalable and fast there.

Or a very typical one actually is if you're sending people a newsletter and you want them to be able to download a white paper using a link in the email. Or maybe, you know, they give you their signup details and then you give them a link in the email. You also have other things which are maybe less user-facing, but even between systems you might have two applications or two services talking to each other. They might have an API where they've got an event, but associated with that event is like a large file or an attachment, some like larger payloads that you don't want to put in every message. So instead you'll give a link in the message and that link will allow them to retrieve whatever large data that is. So those are the kind of use cases we're talking about. You said it's one of the challenging things that comes up is trying to upload and download files. So what are the challenges?

Luciano: Yeah, so first of all, when you talk about uploading and downloading, those are generally what we can call streaming operations. So you have a TCP connection and you will have to transfer bytes for a long enough period of time. And generally you don't want to put boundaries there because you might have a particular context where you are uploading or downloading very large files. Imagine, I don't know, videos or I don't know, like you mentioned big binaries because you are downloading an application.

So you can imagine that you need to transfer a lot of data for a sustained period of time. And if you're using a serverless environment, this is a big challenge because we know serverless environment tends to have very strict limits. Like in Lambda you have, I think it's five megabytes, the maximum payload that you can receive in a request. And also the response that you can send from a Lambda is quite limited as well.

So you can immediately see that if you want to implement this stuff straight in a Lambda, it's not really giving you a lot of amount of data that you can deal with. And another problem when we talk about downloads is that generally you want to keep all the data in a protected place. And then you only want to enable specific downloads after certain actions. Maybe the user is authenticated and you realize, okay, this particular user is actually authorized to view this resource.

So I would like somehow to give only them the permission to download the file. So this is another challenge because of course if you think about S3 you might think, okay, I'm just going to make a bucket entirely public. But then anyone can, as soon as they discover the key and the bucket name, can download that particular file. So this is not really going to be a sustainable solution. And if you think, okay, I'm going to put a Lambda in front of that, then again you have the problem of payload size.

So again, what's the solution? And thankfully if we use S3, there is a feature in S3 that is called Presigned URLs that can help us a lot in solving this particular kind of problems. The idea is that you can easily generate a URL that allows users to upload the content of an object directly into S3. This is the case of upload, but at the same time you can also do the same thing basically to generate URLs for downloading a file from S3.

So again, every time you want to authorize either an upload or a download, you can generate in that particular moment a specific URL for the user. And the user can use that URL to either upload or download from or to S3. So the interesting thing is that why this is called Presigned, because it's basically the URL contains a lot of information. And some of this information, like if you ever look into Presigned URLs, is actually quite big URL with a lot of query string parameters.

And some of these parameters are actually authentication parameters. So literally you have created a URL that has already built in the credentials that are needed for the user to either upload or download that particular resource. And at the end of the day, this is good because you are relying entirely on a managed service like S3, so you don't have to be worried about the infrastructure. Is it going to scale? Is it going to support all the data they need to support? So really you don't need any additional infrastructure or compute, you just need to make sure you generate the URLs at the right time before the user performs that particular action. So I suppose the next question is, how do we actually generate this kind of URLs? Yeah, like if you want to just generate one ad hoc for whatever reason without building it into the application, you can use the AWS CLI to do that.

Eoin: You can also use the S3 console in the AWS Management Console. And you also have like IDE integrations, so the AWS Explorer for Visual Studio, but also allows you to browse your bucket and right click and get a Presigned URL for it. So those examples, the console and Visual Studio, that only works for download, it doesn't allow you to do uploads. The more I suppose powerful, flexible way to do it is with the AWS API or the SDK where you can generate Presigned URLs for uploads and downloads.

So then if we take those two cases, how do you do a download? Well, you need to specify, okay, what's the bucket and what's the key? You need to specify, okay, what's the bucket and what's the key? And then you can also specify some additional configuration, like some headers that are associated with the download or an expiry. So how long does this Presigned URL remain valid for? And once you do that, you will get this really big URL you mentioned with loads of query string parameters, and that will link to the file.

So if user clicks on that or curls it, they will be able to download that file as long as they haven't changed the HTTP request anyway, that would invalidate the signature that's embedded in the URL. And as long as the expiry time has not elapsed. So that's the GET method and it works in a very similar way for uploads. With uploads, you actually got a couple of options. You can use a Presigned PUT, which works exactly the same way as the Presigned GET.

Everything is in the URL and you can also put in the content type and the content length header that's required. And then basically you just put the body of your file into the HTTP request body. So that's how PUT works. Presigned POST is actually like a special feature. It's an additional kind, it's a different kind of a Presigned URL. And it uses form data, like HTTP form data instead of using like a normal post with an Octet stream or a binary payload.

The Presigned POST, it comes with, you actually get a much shorter URL, but instead of having all of the data embedded in query string parameters, you get a set of fields back that you have to put into your form data. And a form data is basically like a multi-part body where you specify each field in the form. One of those, all of those fields that AWS tells you you have to provide in your Presigned POST response.

And then one of them will also be the file content encoded in there too. And there is a really good post actually talking about how this is sometimes the best option to use. It's by Zach Charles and we'll link to that in the show notes. And it's a good guide to using the POST method. The real advantage with Presigned POST is that you can specify limits on the file size that's going to be uploaded. Nice. So I suppose another interesting point of conversation is generally if you're building an application and you receive a file from an upload, you are trying to do something with that file, right?

Luciano: There is some workflow that is intrinsically part of your application. Just to make an example, you upload a picture and maybe you want to process it with, I don't know, recognition to try to identify objects or text or something like that in that particular picture. And then maybe you can attach some metadata and store it somewhere and make it available to users. So how do you actually trigger the second part of the workflow?

We know at this point how you can perform the upload, but what actually triggers is the rest of the workflow. And there are different ways that you can actually do that. Some are asynchronous and some can be synchronous. The asynchronous one is basically relying on notifications. You can either set up S3 object notifications or event breach notifications, and then you can listen to those notifications and trigger, for instance, a Lambda and then Lambda can orchestrate the rest of the workflow.

Or maybe you can start a step function. There is really no limit in how you actually process it. The only thing you want to know is exactly the point where the file was completed, completely uploaded, and at that point you can receive the notification and decide how to process that notification. Another use case that I've seen is basically, for instance, I don't know, the case that you are uploading an avatar in a web application and then maybe you want to make sure your profile is actually updated to reflect the new avatar. So you can implement that in a slightly different way, for instance, rather than using events. What you could be doing, you could have two different API calls. The first API call is actually using the pre-signed URL to upload the file. And then there is a second API call where you say, update my profile with this key, which is going to be my new avatar. So it's a little bit up to the client to coordinate the two different requests, but it's another valid solution.

Eoin: The other point I was going to mention on the automated processing is that you might ask, is the new S3 object lambda feature something that will help us here? But the S3 object lambda is something that allows you to do lambda post-processing when you do a GET or a HEAD request, but it doesn't support any kind of automation on her post. So no joy here, at least yet. And if you don't know about this particular feature, we have a blog post by Eoin that describes how to use all of that and what are the limits.

Luciano: So we'll have it in the show notes.

Eoin: Excellent. Excellent. Now, I suppose it's also worth mentioning that pre-signed URLs, you don't always have to use them. So if you've got full control over the client application, you have some other options as well. So what you can do instead is just embed the whole AWS SDK S3 client in your front end web application or in your mobile application and just use the higher level APIs that the SDK give you.

Sometimes there are some optimizations in there around large file uploads with multi-part that will be more beneficial if you just use the SDK directly. And all you need in order to be able to do that is some temporary IAM credentials that you can use in the client. So it's another way of doing it. So instead of signing the URL with IAM credentials on the server side, you basically just issue IAM credentials using like STS.

You can also do use AWS Cognito with identity pools to do that. So if that's something that you're comfortable with, it's just another approach that you should probably be aware of and maybe think about whether that's best. The Amplify SDK also makes that whole thing a lot easier. So it allows you to, I think through its storage API, allows you to interact with S3 in a reasonably simple way. It's probably worthwhile just talking about some of the limitations.

And we've already said that for a push upload, you need to know the file size in advance because you have to set that content length header. That's a bit unfortunate because it limits your ability to stream the content from an indeterminate source. So you can't really limit the amount of data you're uploading unless you do some really funky stuff like updating the policy of the bucket itself. So every bucket has a resource policy and you can put lots of restrictions in there.

And even those restrictions can apply to certain object prefixes, key prefixes. But it's not the kind of thing that you want to be updating all the time for very specific user flows. So there's another blog post that we can put in the show notes from bobbyhads.com and that's worth a look. And of course, the maximum file size worth stating that it's five terabytes. But that applies on S3, whether you use presigned URLs or not.

Worth restating again that if you use that special post presigned URL with the form data, you can overcome some of those limitations. You can, you don't need to specify the file size in advance and you can specify conditions that include what range of content sizes you support. And that includes a minimum and a maximum content length then. Presigned URLs, they do have an expiry time, so they have a limited time, but you cannot limit the number of downloads or uploads.

There's no easy way to do that. And you can't also easily limit the number of downloads or the downloads based on IP address. You would again have to go and change the bucket policy to put in a source IP condition. But it's probably more work than you really want to do to maintain all of these policies. And you might end up incurring quotas for like adding so many specific user in specific rules into that policy. So those are the limitations. With that in mind, and given that we've explained what it is, how to use it, do we have any closing recommendations for people? You also gave a hint that you might have a secret tip. Yeah, I'll try to pay back everyone that is listening so far.

Luciano: So basically in terms of recommendations, one of the most common recommendations that you will find is to try to keep URLs, presigned URLs short lived. Because a presigned URL doesn't really expire after you use it. It only expires after the time, the expiry time is elapsed. So if a user has a presigned URL, nothing is stopping them from using it twice or even more. So they can re-upload, they can re-download the file.

So the only real way that you can protect against that is to keep the expiry time as short as possible. But of course, don't keep it too short because some people have observed that if you keep it too short, there might be clock slightly off sync between servers. So if you keep it, for instance, in the order of a few seconds, the link, as soon as the user starts to use it, it's already expired. So probably slightly above one minute or two, it's probably fine for most use cases. Another tip is to enable cores. And I think this is especially important if you want to use it from the front end. If you want to use both presigned URLs, or I think that applies also if you use the SDK. Right, Eoin? You still need to enable cores in order to be able to do API calls from the front end. I'm not sure about that. I'd have to think about that.

Eoin: Okay. Cut me off guard.

Luciano: Worth verifying. If you know it, please leave it in the comments. But now let's get to the secret tip. And this is something that we actually discovered quite recently and we were quite impressed by it. So it turns out that presigned URLs are not only valid for uploads and downloads, but actually you can use them for any kind of S3 operation. And if you don't believe us, the simplest thing you can do to actually verify the statement is to try to use the SDK to create a presigned URL for a least bagged operation.

And then use that URL and see what's the response. And I'll give you another thing that you can try, which is actually a little bit more useful in practice, I believe. Which is, for instance, you can do a multi-part upload using presigned URLs. And the way you do that is basically you do the first operation of a multi-part upload. By the way, if you don't know what a multi-part upload is, it's basically rather than uploading the file in sequence, like byte after byte, you can split that file into multiple parts and then you can upload all the bytes of every parts in parallel. So it's basically a way to try to speed up the upload.

And the way it works is that you generally have to do two API calls. One to start the multi-part upload and one to finish. And in between, you can create new parts. And when you create new parts, you can basically use the presigned URLs to do that. And at that point, you have URLs that you can use to trigger that upload without needing to have additional credentials. And actually, this is something we figured out in a blog post by altostra.com.

So we're going to have that blog post as well in the show notes. And there are examples of code you can see there, which I think makes all of that thing a little bit more clear. So with that being said, I think we are at the end of these episodes and we are really curious to know if you knew about S3 present URLs and if you have been using them in production, what kind of use cases do you have? And I don't know if this is your favorite S3 feature. It kind of is my favorite S3 feature right now. So if this is not your favorite S3 feature, please tell us in the comment, what is your actual S3 feature that you like the most? So with that being said, thank you very much for being with us. Remember to like and subscribe and we'll see you in the next episode.