Help us to make this transcription better! If you find an error, please submit a PR
with your corrections.
Luciano: S3, an object storage service, is one of the oldest and most used AWS services. Even if you don't use AWS regularly, chances are that you have been working with a project that relies on S3. In this episode, we will cover some of the best practices to adopt when creating and configuring S3 packets. With the list that we are going to cover today, you'll be able to start with the right setup straight away and won't have to go back and revisit all your packets.
I am Luciano and I'm here with Eoin for another episode of AWS Bites podcast. Today's sponsor is fourTheorem, an AWS consulting partner that helps your organization to get excited again about building software. Find more at fourtheorem.com, you'll find the link in the show notes. So, I would like to start today with a very quick recap of what is S3, because maybe it's the first time that you are approaching this topic and I think it's good to get the basics nailed down, or maybe you haven't used S3 in a while, and again, it's a good practice to just review what are the main concepts that we will be relying on for the rest of the episode.
So, S3 is a storage service. The name actually means simple storage service, and the idea is that basically it allows you to store files in a scalable, reliable, and somewhat cheap way. There are two main concepts. The first one is bucket. So the idea of a bucket is pretty much, it is a container where you can store your files. So you can think about that as a folder or a drive in a kind of a file system, parallel word.
But of course, it's in the cloud and it scales very different. It scales to different sizes than you can just scale your local drive. And what do you put in a bucket? You put objects. So this is the actual term that S3 uses for the idea is files, pretty much. And every file is identified univocally by a key that needs to be unique in that particular bucket. Now, again, this is an AWS service, so you can use it from the web console, you can use it from the CLI, or even programmatically with one of the many SDKs that are available for different programming languages.
Just to give you some common use cases, when you use S3, what you could do with it, you could store images, videos, or other assets that you need, for instance, for a web application or a mobile application. You can use it as a way to store backups in case of disaster recovery. You can also use for long term archival. There is another service that is often used with S3 called Glacier and you can easily transition your files from S3 to Glacier for long term archival.
You can also use it to implement big data solutions or data lakes where you are going to be storing lots of structured files and then you can query those files directly from S3. You can also use it for application hosting, which means that if you're building a static web frontend, you can put it in S3 and expose it as a website. This is a topic that we covered a couple of times already. We will add the links in the previous episode talking about this in more details in the show notes.
Now why are we talking today about configuring buckets correctly? Because this is a common topic and there are very common mistakes that happen all the time. You will read them all the time in the news. Some of the common issues are accidentally leaving buckets publicly accessible, which there have been some very big industry failures. So if you're curious, you can search for the Equifax and the Capital One instances of this incident just to get an idea of the magnitude of the problem.
But the idea is that you just forget to make it private. So whoever figures out the bucket name can read any file in that bucket and all the sensitive information that might be stored in that bucket. Or other problems might be if you don't use a very good naming convention. That is because the names are unique across every account and every region. You might end up having a conflict with somebody else's bucket.
So you are not in a position where you can automatically provision that bucket maybe if you're using some script. So you'll need to figure out some workaround. So having good convention there will help you out to avoid conflicts. Other issues are accidental deletion. So you might end up deleting files and you maybe didn't have a way to kind of have copies of this file somewhere else. So you basically lost information that might be very vital for your business. And we will see how you can protect yourself from that. And the last one is missing encryption or in general misconfiguration that might force you to eventually just to fix the problem to have to migrate all the data to another bucket that is configured correctly. And therefore all the files copied there will be correctly provisioned. So shall we start by discussing bucket naming? Eoin, what do you think?
Eoin: Bucket naming shouldn't be that much of a big deal, but in fact, what you choose for your bucket names, especially if you want to have nice consistent bucket names across the whole organization, it is something important. And one of the important factors there is that while buckets are created in a specific region, the names themselves are globally unique. So avoiding collisions is important. You can think of them like domain names and you might even have people squatting on them like domain names.
So there's a couple of recommendations there. So one thing is to avoid personally identifiable information. Like if you're having a multi-tenanted app, don't include the customer's name or a tenant's name in the bucket. You know, that makes it very difficult to delete that information afterwards. The bucket name could also be then be exposed in like various cases like API calls, signed URLs, or even DNS requests.
So then when it comes to making them globally unique, well, you can include a unique portion and that could be a hash or it could be a combination of like the region, the account ID and the environment. So one example would be like ACME, project one, production and the account ID, and then the region name. And that will generally make it pretty difficult to have a collision in bucket name, but you might not like this because it publicly exposes some information like the account ID and that might be for your tastes or for your compliance.
So instead of that, you could just have a hash of all of those elements. I still think it's kind of nice to have some of those things visible just for readability and troubleshooting if you can kind of look at a bucket name and see where it lives, but that mightn't suit your organization. But I think either approach is generally good and it is a good idea to have a random part in it just in case there's that small chance that somebody else happens to have taken or maliciously squats on a bucket name that has an account ID that you have and your region.
The reason for including a region, by the way, is that you might start off with a bucket in one region, but later decide that you want cross-region replication. And then you can just follow the same convention and have a similarly named bucket just with a different region identifier in it. Now if you're using CloudFormation to generate your bucket, if you don't put the bucket name in the template, it will just create one for you and it will do that random identifier part for you as well.
So it'll take the stack name that you're using in CloudFormation and then just append something like a 12 character suffix onto the end. So it doesn't, it isn't very readable, but it does help to automatically avoid those naming collisions. And then you can just publish that bucket name somewhere like in an SSM parameter or a CloudFormation output where other people can use it. That's bucket naming. So it's a good idea to know exactly what your bucket name in convention is, have it written down, enforce it, and then you don't really have to worry about it anymore and you don't end up with different kind of name buckets all over the place and a bit of a mess. Beyond that, what other best practices do we need to think about Luciano when we're setting up buckets for the first time?
Luciano: One of the things I really like is versioning. So S3 supports this concept where basically if you do changes on a file, you won't override the previous version of the file, but it will just create a new latest version. So at any point in time, you can go back and see all the previous version. And that is something that works even if you are deleting a file. It doesn't really delete it. It just creates like a soft delete mark and you can kind of revert it and restore the file if you have versioning enabled.
So it is a really good practice to avoid either accidental deletion or accidental overrides of files. And it gives you that additional peace of mind, especially if you're storing useful information that you're not just going to lose it accidentally by, I don't know, maybe a bug in your code or maybe by doing something very quickly on the CLI and accidentally deleting stuff that you were not supposed to delete.
So this is a good one, but there is a caveat there that of course, if you are storing multiple versions of an object, you are increasing the amount of storage that you are using. So that will affect your cost. So this is something to keep in mind. And it might not be worth to enable this in every single environment. Maybe you want to do it just in production. You might not want to do it in other development environments or testing environments where the data is not going to be as important as production. So this is just something good to use and enable almost all the time. But caveat, it might make more sense in production than in other environments. And always keep an eye on cost. What about observability instead? Is there any way for people to see what's going on in a bucket and what kind of settings would they have to enable to do that?
Eoin: Yeah, there's definitely a good few tips here we can share for creating buckets for the first time or even changing the configuration of existing buckets you have if you don't have some level of observability. So one is to enable logging. And there's a couple of ways of doing this. You can with CloudTrail enable data events on S3, and then you'll get an audit log in CloudTrail of things like gets and puts on an S3 bucket.
So that gives you lots of detail, user identity information, all that stuff you get with the control events in CloudTrail. But data events have an additional cost. And if you've got lots of read and writes from your buckets, that can be very costly, actually. It can be more significant than the S3 storage cost itself, depending on your usage. So it isn't something I would recommend turning on by default, but it is powerful.
And you can just enable it for a subset of a bucket, like a prefix, or you can just enable it for write actions if you don't want to log all of the reads. If you want a similar level of audit logging capability but don't want to pay through the nose, the other option is to turn on server access logging. And then you just get a simple HTTP common log format access log, which gets placed into another bucket. And it just gives you simpler information, like where the request or IP is, HTTP status codes, the time, the object size, it doesn't give you as much detail as CloudTrail data events, but it is a lot cheaper. So I think that's a good one to turn on by default. So that's one to add to the list. Then we've got metrics. What can we do when it comes to getting insight through CloudWatch metrics?
Luciano: Yeah, the nice thing is, as many other AWS services, is that you get some metrics out of the box. You don't need to configure anything. They are just there for you to use when you need them. And some of these metrics for S3 are the daily storage metrics. So basically the number of objects and the bucket storage size. There are additional metrics that you can enable, for instance, the request metrics.
And you can create different configuration also for different prefixes. So it's not global for a bucket. You can be more fine grained if you want to. And basically the idea is that you will get a one minute level granularity where you can see the number of head, get, put and delete requests, where you can see the number of 500 or 400 types of errors. You can also see the total latency and the first byte latency, which can be very useful if you're trying to troubleshoot more kind of performance oriented rather than just error type of things. And you can also, there is a relatively new addition to S3, which is called S3 Storage Lens, which is something that you can use and it will give you kind of an overview across all the buckets in your organization and show you useful metrics for kind of an aggregated view of all the buckets. Let's move to security then. What can we recommend there? Yeah.
Eoin: Okay. Well, this is the really important stuff and there's really two things you need to think about. First one is public access and the second one is encryption. So when we talk about public access, this is where a lot of the horror stories have come out in the media. Now new buckets prevent public access by default. So now we've got good sensible defaults. You can also turn off the facility for people to create public buckets at an account level.
So that is something I would definitely recommend. It makes it a lot easier. You can also put in alerting, of course. When you disable public access, when you block public access, what you're doing is you're essentially preventing users from doing things like creating public ACLs. So with buckets, you have IAM controls like with every other AWS service, but ACLs are kind of an older access control mechanism that came with buckets initially.
And they're less commonly used these days and somewhat deprecated, I would say, but they are still used for various specific scenarios because ACLs still allow you to have access on an individual object level. So you can associate an ACL with an individual object for very fine grain control. But if you don't need that, you generally don't need them. So you can generally avoid ACLs in general these days.
Then you have bucket policies. So bucket policies, you should also think about, okay, what's my boundary here? How far do I want people to be able to be from my bucket when they access it? So it's what you can use to enable cross account access, but you can also use it to restrict access. So it can restrict access to users within a VPC or accessing the bucket from a VPC endpoint or accessing it from a specific organization.
So with IAM condition keys, you can say, okay, allow everybody within my organization to read from this bucket. And of course you can do all the usual fine grained access control you can with IAM. So that helps you to avoid public access and just keep your request boundary to make sure people can access it from only within your network or whatever else you need. The other important thing is an encryption.
Now since I think January of this year, bucket encryption is now on by default and new objects are encrypted with server-side encryption. There's three options when it comes to encryption on AWS S3. The first one is the simplest one and that's SSE S3 for server-side encryption S3. And that's when AWS manages the key and the encryption for you and you don't have to think about it. Now then you have the other extreme, which is customer key encryption, which is when you manage and store your key and you give the key to AWS every time you make a request and it will do the encryption and decryption for you.
But that has a lot of burden associated with it because you have to manage distribution, storage and rotation of those keys yourself. So the middle ground is SSE KMS where you have control over the keys, but AWS still stores them. So you can have a customer managed key or the AWS managed key. Now I think in general, a customer managed key is the preferred option since you have control and additional security, but you don't have the overhead of storing and distributing that key yourself as you do with SSE customer encryption.
And the thing about using SSE KMS, it means that even if your bucket is compromised, so somebody gets credentials that allow access to the bucket, they would also need credentials with access to the KMS key if they were actually going to read that data. So this is the really important point here. So I think KMS with a customer managed key and SSE KMS is the best balance here. And there was actually a good article by Yan Shui recently who mentions the fact that just because public encryption by default is now an option in AWS, your job isn't done. We'll provide the link to that link in the show notes as well, because that was a good one and well worth pointing out. So that's another one for our list, SSE KMS. What else have we got?
Luciano: I think another interesting point could be integration because very often you don't use S3 like standalone. You will use it as an integration point for other things that you are building. And there are different ways to trigger events or to interact with the lifecycle of objects in S3. So let's try to cover some of them because some of these will have an impact on the configuration of the packet. So for instance, one way is what is called S3 notifications, probably one of the oldest notification mechanisms in S3, which basically allows you to trigger a notification to either Lambda or SQS every time that there is a change in a bucket, like a new file being added.
And that requires you to explicitly configure and enable that notification mechanism. So something that you might want to do if you are thinking as that kind of use case for your particular bucket. There is another alternative, which is CloudTrail Data Events. Then there is another one, which is probably one of the newest, which is through EventBridge. So you can enable EventBridge notification. And this is probably the most recommended way these days.
So once you turn this feature on, basically you don't have to do any change. You can just listen to events in EventBridge and you don't pay additional charges either for processing because you can, for instance, you're just going to configure a Lambda to be triggered. You are only going to pay for the execution of that Lambda. Also it's very interesting that you can use EventBridge content filtering if you want to do more in-depth matching for specific events. For instance, maybe you're not interested in all the files, maybe only the files with a specific prefix. So you can do that through EventBridge content filtering.
Eoin: It's nice that they added that one because in the past we had three notifications, which meant you had to go and modify the bucket, which wasn't always possible. Especially if you were just deploying a new CloudFormation stack. And then the CloudTrail Data Events was like, okay, it could be costly as well. You could still access them through EventBridge, but it could be slow and then you had the additional cost of data events, which caught me out a few times, I have to say.
So I definitely think adding the new EventBridge method to the checklist is a good no-brainer for creating new buckets. So there are a couple of other settings that we might consider. I wouldn't necessarily put them on the must-have list, but there's some nice to haves that you might think about depending on your context and the workload. One of them is, well, a couple of them, I suppose, in the area of compliance and security are MFA delete.
So you can enforce that multi-factor authentication is enabled for the principal trying to delete an object. It's a little bit cumbersome to enforce in some cases, because now with things like SSO identities, you don't even have the MFA flag in some cases, but there are ways around that. And another one is object locking. So you can enable an object lock to prevent objects being deleted for a period as well.
That's often a compliance situation. We mentioned replication. I think you mentioned it a few times, Luciano, and it is something to think about from the get-go. Will I be replicating this to another region, another bucket? How much data will I be replicating? Do I need to set this up from day one? Just so I have all that data there from the start and to test it out and see how replication works, how long it takes, and to understand all of the different nuances with it.
Lifecycle policies. There's other different storage tiers with different costs associated with them. And you can move data between the tiers. It can get a bit complicated, but you can also save a lot of money. There are some good cases of people saving a significant amount of cost. I think Canva was a recent case study I saw come out where they saved a lot of money by using lifecycle policies. So again, do your calculations.
If you intend on using S3 for a lot of your data, you might save significantly. And you can even turn on intelligent tiering from day one. That might give you a good balance between complexity and cost savings from the start. The last one is access points. And these are being used more and more for different S3 scenarios. We talked about them for S3 object lambdas, which leverage access points. But fundamentally, it's just a way of having another way to access S3 buckets without using the bucket name.
Instead you use a different access point name, which is generated for you. And it allows you to have a dedicated resource policy with specific security controls around that access point. So it allows you to have different access modes for different people. And this is something that's also leveraged in the new global access point support. So if you do have replication across two different regions with buckets in separate regions, you can have a global access point, which allows users to access from the region with lowest latency. So those are all things to research and consider as part of setting up S3 buckets. Not necessarily must haves, but maybe before we wrap up Luciano, could we summarize our checklist of must dos for bucket creation? Sure.
Luciano: So the first one that we have is once you create a new bucket, you should turn on request server access logging because it's easy and it should be relatively cheap and it will give you that bit of visibility that you might want to have to see what's going on in a bucket. Then you might want to turn on request metrics just to get a better, more detailed set of metrics that you can use, again, for building dashboards and for troubleshooting latency issues or things like that.
Then we have enable versioning, again, extra peace of mind in case you might be worried that some file might be deleted accidentally. This one might be more worth it in production than in other environments just because of the cost implications of basically having multiple copies of the same object as you evolve it. Use global unique collision. Well, basically use names that will help you to avoid collisions with other bucket names.
Again, remember that bucket names are global across all accounts or regions. So make sure to figure out a way that is consistent, but at the same time reduces the chances of having collisions with either in the same organization and even with other organization that you don't even have control on. You can turn on EventBridge notifications because it's probably the simplest way to create integrations on top of files being created or deleted or modified in S3.
Then in terms of encryption, probably the easiest approach is to use SSE KMS with a customer managed key and finally make sure to disable public access. That should happen by default with new buckets, but just be sure if you are revisiting older buckets, just make sure that that's a setting that is there. So that's everything we have for this episode. I don't know if you have any other suggestion or best practices that you have been using. Please let us know, share it with us in a comment down below or reach out to us on Twitter or on LinkedIn and we will be more than happy to chat with you and learn from you. If you found value in this episode, also please remember to give us some kind of feedback, to write a review of the podcast or if you're watching it on YouTube to give us a thumbs up and subscribe. Thank you very much and we will see you in the next episode.