Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Eoin Shanaghy: S3 has got to be the most commonly used and most loved AWS service. It's simple to get started with, largely cost effective compared to alternatives, and scales massively. But it's not a file system. It follows a key value object store model, and this makes it a bit of a misfit in cases when you want to use it like a standard folder using regular file operations. Now, normal file systems are usually required for things like databases, applications that write append log files, web applications or CMS apps that assume a mounted folder for their data.
And even though S3 is well supported for things like big data and batch processing workloads, it can actually become a performance bottleneck if you've got lots of tiny files. Now AWS has just released S3 files. This is a new managed file system backed by S3. S3 files tries to be the sweet spot, giving you a proper file system with S3 underneath. The promise of it is that you'll get the scalability, durability and cost benefits of S3 with performance and behavior of a file system. And one of the big benefits is that it integrates easily into EC2, ECS, and even Lambda, unlike some of the previous options. We're going to dive very deep into S3 files, talk you through how to use it, where it fits, and how it compares against all your file storage options. You'll hear us share our experience, since we have also been using S3 files in real-world projects, and we also did some benchmarking. I'm Eoin, I'm here with Luciano. Welcome to AWS Bites, episode 154. Luciano, maybe you can start off by telling everyone why is S3 not a file system in brief?
Luciano Mammino: Yeah, the first reason is because it doesn't have the concept of directories. What looks like a folder is basically just a key prefix. You can use the familiar slash to make it look like there are directories, but it's just naming conventions. And in that sense, it doesn't have true directories. You cannot do things like atomic moves or renames. For instance, if you just want to rename a file or move it from one place to another, which basically means change the prefix entirely, you have to effectively, even if you use the AWS S3 MV command, you are effectively copying the object from one place to another.
And if you ever done that with a very large file, a very large object on S3, you might have noticed that it takes some time. While if you do that in a real file system, it's pretty much instantaneous because it's just literally renaming the file itself and not moving data or copying data. So that's another big difference, which is interesting. The other thing is that objects are immutable. You cannot modify a range of bytes inside an object like you could do in a file system.
And in that sense, you cannot really append either into an existing object. There is a little bit of an exception, which is multi-part uploads. But it doesn't really work in the way you might expect from a regular file system. And if you want to deep dive into the details, we have been talking about that in episode 124. So go and check that out if you're curious. Another thing is that listing of either the entire bucket or a prefix can be expensive, or at least more expensive than a regular file system.
And this is because there is no directory metadata in the same way that it would exist in a file system. So when you're running a list object operation, it's effectively querying over the keys in the bucket. Access control is something that exists in S3 buckets, but again, different from what you would have in a POSIX file system, for example, because you can generally determine access control using IAM policies or ACLs.
and not by setting up, I don't know, users and groups as you would do in a POSIX file system. And performance is also a little bit different because it's defined by how you structure your partitions by using prefixes. So if you put lots of objects in the same prefix, you can damage your performance. So it's typical to distribute your data over different partitions by using a bunch of different sub-prefixes, which makes querying S3 faster.
So specific tricks you will do with S3 that are not necessarily mapping to things you would do with a regular file system. And of course, there are a lot more subtle differences. And the real issue here is that there are applications out there that need file system semantics. So when you try to use S3, and mimic a file system, sometimes you might bump into things that don't necessarily match the abstraction that the application expects. So that's something to be aware. And we'll see today how S3 Files tries to fill that gap. But I guess before getting into S3 Files, which is pretty new, this is a topic that people have been trying to address for a long time. So what are the other solutions that existed for longer than S3 Files at least?
Eoin Shanaghy: Well, you have the FUSE user space file system option, which supports lots of file systems, but there's an S3FS FUSE library, which allows you to do that. A lot of people would do that in development, just to be able to explore buckets in a file system. There's also the very popular Python FS spec S3FS library. It's a different S3FS, which is used in a lot of big data applications. You also have Hadoop, has S3A, which is like a HDFS abstraction on top of S3.
And more recently, in the last couple of years, you have MountPoint for S3, which is Amazon's own file system adapter. We did a whole episode on this, actually episode 95. In fact, this episode covers a lot of the options out there for mounting S3 as a file system. at least before S3 files came along. Now on top of that, there are actually a whole lot of AWS services that provide a bridge layer between S3 and file systems, like FSX for Lustre.
Now this is a file system for really high performance computing with S3 as a backing data repo. If you're in the HPC space, that's one you'll come across quite frequently. There's Amazon File Cache. This is one that never really made it mainstream, I feel, but it's also a high performance option. It's a general caching layer for EC2 only that works on top of S3 or can work on top of NFS. And it's built on top of Lustre, which is one of those high performance file systems. The other one I can think of is Storage Gateway, which is this whole suite of services mostly for connecting on-premises storage to AWS. One of those is called S3 File Gateway, and it can present S3 as an NFS or an SMB file share. Now, we've covered that all in a previous episode, episode 95, so let's get straight into S3 files. How does it work?
Luciano Mammino: Yeah, what we mentioned already is that S3 Files makes a normal S3 bucket accessible as a shared file system. So you can still use the S3 bucket as normal, but you can also mount it as a file system. Any change you make in the file system is eventually reflected in the S3 bucket. And you can access S3 files from a bunch of different services in AWS, like EC2, ECS, EKS, and even Lambda using NFS. So data isn't just stored in the bucket itself, but you are effectively seeing it in those compute layers as if it was a file system, not normally available in those compute instances.
The interesting thing is that the way the connection with S3 is managed by S3 files, because files can be streamed from S3 to NFS mount, but in some cases, the files can also be cached. And this is by default the case for smaller files. So the idea is that the first time you are accessing that file through the file system mount, the file is going to be streamed, but also cached in an intermediate layer, which is going to give you increased throughput and lower latency.
So effectively, the next time you're going to try to access that same file, the read is going to be much faster and you will have a much higher throughput. And this is something you need to be aware because it's one of those things you can configure for performance. So it really depends on your use cases, the size of the files you are managing, what do you want to be in the cache versus what do you want to always be streamed directly from S3.
But by default, files smaller than 128 kilobytes are cached. And yeah, if you need to change that, you can do it. And you should benchmark your use cases to see if that actually improves performance or even changes your cost trade-off. We'll talk more about costs in a second. Now, EFS is used under the hood to provide the caching layer. So you can, like your mental model could be like it's going from S3 to EFS and then from EFS to the compute level if you are using this caching mechanism. Otherwise, it's just streamed directly from S3. And this is probably why you can see that this storage is available for Lambda as well, because as you know, EFS is also something you can use with Lambda. So in general, everywhere where you can use EFS, it's very easy to see why they made S3 files available. Now, with all of that being said, how do we get started?
Eoin Shanaghy: Yeah, the ingredients list for this is not too long, actually. So you need an S3 bucket, if that wasn't already obvious. And then your next thing you'll create is the S3 files file system. This is a resource that's linked to your bucket and a file system IAM role. And you can either link it with a specific prefix in your bucket rather than the whole bucket. And then within this file system resource, you define a couple of rules.
So you've got the expiration rules, which say how long data hangs around in the cache. I think it's 24 hours by default. But you can set it up to, oh, I can't remember. But it's a lot longer than that. So you can have your expiration rules, and then you have the import rules. The import rules allow you to say what the maximum file size is for cached data. So that's 128 kilobytes by default, but you can set it to higher than that if you like.
And you say whether the data is imported automatically when a directory is first accessed or whether the file is first accessed. So that could be quite useful in that if you access one file, S3 files can go ahead and import everything else in the directory within the size threshold, if you like. And these rules can be, these import rules can be different for different folders. It's worth mentioning that the IAM policy or the file system policy we mentioned, it needs to trust elastic filesystem.amazonaws.com.
So that'll give you a hint about how tightly integrated it is with EFS, but it also needs EventBridge and S3 access. So S3 files uses EventBridge under the hood for S3 notifications. The next thing you need is a mount target. And this is the network link between your S3 files and the subnets in your VPC. You need one of these for each availability zone you want to support. So you'll provide for each of these mount targets a subnet, security groups, and then say whether you want to support IPv4 or v6.
And last thing, you can also create an access point. If you don't do this, you just mount the file system itself, but access points allow you to provide specific directories as the root and provide specific POSIX user UID and group ID. So mount targets and access points look exactly like the same things you have in EFS, They have the same name. The file system resource is quite similar too. But instead of configuring the EFS performance and throughput options, which you don't have here, you just link to your bucket and set the import and expiration rules.
And as I mentioned, since this is EFS, it's a service that you have to access over the VPC. There's no public endpoint like you have with normal S3 access. For Lambda users, that means your function should be set up with security groups and subnets. And when it comes to mounting your file system, it's just like EFS. In Lambda, you provide the file system or the access point ARN and the mount path. And for EFS, you'll create a volume, a container volume from the file system or the access point.
And those volumes are then mount points in your ECS task definitions. Because it uses S3, we talked about how it'll stream directly from S3 for large files. You need to have either internet access from your subnet or VPC endpoints to S3. And your security groups, because it uses NFS, will need the NFS port outbound. That's 2049. I think all in all, if you're used to AWS, it's a relatively simple one. It's not the most complex AWS service setup we've seen. And it's familiar, right? It follows EFS very closely. There are some more things you can configure, like file system policies, resource policies, to get tighter access control. So I think that's probably all we can say about what S3 Files is, how to set it up. So let's talk about how it behaves, maybe starting first with performance, because this is pretty important.
Luciano Mammino: Yeah, I'm going to read straight from the AWS documentation just to be sure I give you the right numbers. And basically, each file system supports up to five gigabytes per second of write throughput performance. And they say multiple terabytes per second of aggregated read throughput, up to 250k of read IOPS performance and 50k of write IOPS performance. The maximum per client read throughput is 3 gigabytes per second.
And when accessing files that aren't cached in your file system, the file system needs to first retrieve the data from the S3 bucket, which has latencies in the tens of milliseconds. Data stored in the file system is read with low sub-millisecond latencies. Writes are staged on the file system with single-digit millisecond latencies. So that's all that AWS has to say about performance. My personal experience is that it might be a little bit tricky to put all these figures together depending on the way you use S3 files, depending on the shape of your data in S3. Like you have big files, large files. How much do you access all of them? How are you setting up the caching layer? So in reality, these numbers, I would treat them just as rough guidance, just to have a high-level understanding. But I would recommend people to do your own benchmarks and see, does it really work for your use cases? So that I think moves us into our experience. Eoin, do you want to share anything about that?
Eoin Shanaghy: Yeah, the nice thing I think there is that compared to just normal EFS, EFS can be a bit complex because it's got these different modes. The latest one is elastic throughput and then you've got IOPS, right? You've got provisioned IOPS versus burst, the old burst method. can be difficult to get a handle. And if you do provision IOPS, it can be very expensive. So what we wanted to do was just do a bit of benchmarking.
So we'd have a better idea of when to use this in our projects as well, when the sweet spot was, but also to share with everyone. So we wrote a fairly simple benchmarking application. It's not totally scientific because there's always other influencing factors here. We tried to make it as simple as possible. And the approach was we wanted to do reads and writes of small and large files to S3. So not that large, but just above the threshold where they would be cached.
and just measure the performance. And we also wanted to compare it against all the different EFS configurations you can have, and then run it on Lambda and Fargate. But also, not just Lambda and Fargate, we wanted to do different CPU and memory configurations. So we chose a few different configuration sizes between 256 megs of memory and 10 gigs of memory in both Lambda and Fargate. Now, the repo and results summary will be up on GitHub if you want to see it for yourself, and the link will be in the description.
The SAM CloudFormation template we created might also be useful just if you're figuring out how do you write one of these things and get it running yourself with S3 Files. I think we both had a few cases of trial and error trying to get the first one working, so we have something that now works. So what are our overall findings? Well, small file writes are generally slightly faster than S3 put object by about 10 to 20%.
That's what we can see. The reads are generally dramatically faster than S3 directly, which you would expect for cached data, like 5x to 10x faster. The large file reads are generally much, much faster with S3 files compared to EFS options. And that's because we know it's going through to streaming directly from S3. It's still a bit of a surprise to us. We would have thought that those large file accesses through the different EFS options, especially with provisioned IOPS would be faster, but this is what our results are showing.
So it's interesting, maybe there's some smart like parallel fetching going on there in the background. In terms of large file writes, there's very little difference. It's consistent across the board. Otherwise, I would say the performance of S3 files is quite similar to EFS with a tiny extra bit of latency for S3 files. So unless the only case where S3 files is significantly faster is in the read operations.
Now, since we tested with various memory and CPU configurations, it's worth calling out a general observation, which isn't specific to S3 files. If you have Fargate or Lambda with 512 megabytes or less, the network bandwidth really hurts. And IO can be 10 times worse than just with 2 gigabytes of RAM. The best performance always came from having 10 gigabytes of RAM. For Lambda, we know that memory allocation and network allocation is directly proportion.
Sorry. For Lambda, we know that memory allocation is proportionally tied to CPU and network. And it seems like there's a similar correlation for Fargate 2. So if you're doing anything that's IO sensitive, having a memory allocation of two gigabytes or more, if possible, will really help you. The results just start to fall off a cliff if you try to be really frugal when it comes to allocating those resources.
There's lots of benefits here, and your mileage will vary depending on your workload. But in general, I am saying this is not one of those services where I'm saying it has a very narrow niche. I think it has a broad set of use cases. If you've got a lot of hot data where everything's hitting the cache, maybe it's not going to work very well for you. But in a normal case, where most of the data you don't access most of the time, but you can benefit from some speed up, especially with small files, and you like the file system model, then I think it's a real winner. Right. After that, what's the downside?
Luciano Mammino: Yeah, there are quite a few and it's well worth being aware of them before just jumping straight into using S3 files for everything. So the first one we already mentioned to some extent is that this supports NFS only, so it doesn't support other network file share systems or protocols like SMB. The file system behavior, as we said, it's trying to do the best that it can possibly do to fill the gap between what S3 can do and what a regular file system is supposed to do.
So there are, of course, things that are missing or that cannot easily be replicated. For example, there is no hard links, there is no atomic renames. An interesting thing is that if you try, for instance, to modify a file from the file system, these changes are staged in the file system and then eventually they are synchronized to S3. which is kind of an interesting optimization, but comes with some potentially unexpected side effects that we'll explain in a second.
But the idea is that basically you might be writing into a few different operations, files in the local file system. Maybe the common example is you are appending, for example, a few times in a few different, I don't know, seconds. You wouldn't want, especially if this is a large file, you wouldn't want that every time you are doing a change, it immediately tries to synchronize back to S3. Because if it's a large file, you are doing a lot of unnecessary write into S3.
So effectively what S3 files is doing for you, they are waiting for a certain amount of time to see, are you doing any other write into this object before it's actually synchronized back into one atomic operation into S3. So the issue here is that you need to be aware that there is a delay that comes from this optimization. So if you are using a mix of access pattern, for example, if you are reading from the file system in one place, also writing from the file system, but then also reading directly from S3 somewhere else, you might see that there is a little bit of eventual consistency. So you might not see the new data as soon as you might expect. So just be aware of that. I think the default is 60 seconds. I'm not even sure if it's something you can configure. But yeah, be aware that the consistency isn't immediate. So be aware for eventual consistency. Yeah.
Eoin Shanaghy: And this 60 second wait, if you've got it, this is a period where it will wait to see if there are more write operations. So if you do have another write operation in that 60 second period, the 60 seconds will restart. So it can take multiple minutes before the file is actually write in that case.
Luciano Mammino: Yeah, I can imagine that one of the use cases that I don't know if you are kind of streaming logs from somewhere and then piping them to the file system itself. And these logs are basically streaming for the entire duration of the application, you're never going to see that file or those changes at least being reflected into the S3 packet itself. So just be aware that some of these use cases, you need to really understand the model to see if everything is going to work for you using the synchronization primitives that S3 Files provides for you.
And another thing that I actually bumped myself in one of the projects we are working with one of our customers, is that if you use S3 files and you have an organization with multiple accounts and maybe you have a control plan account where you have a kind of a bucket with data you want to share with a bunch of other accounts, You would expect that you could use S3 Files with a bucket that exists in another account, even if it's in the same organization and region, but that doesn't seem to be the case.
I actually couldn't find an explicit mention that this is a real limitation in the documentation, but everything I tried just didn't work. And so my conclusion is that this feature is not supported yet. I'm hoping that it's somewhere in the roadmap in AWS, because I think it could be very useful. And just to make you understand why this could be useful, I want to share a little bit of the project we are working on and why we thought this could be a good idea.
So we are building a SaaS application where different customers, they will get their own dedicated accounts. So they can run some kind of modeling workloads. in their own isolated accounts. So effectively, the shape of the account is that there is a central control plane where we have all the shared resources. For instance, all the models that the SaaS will expose are stored into S3, and they are organized with specific prefixes.
And then every time we onboard a new tenant, there is a new account that is created, added to the organization, and a tenant is going to have access only to a subset of these models. so we also need to put in place a mechanism to basically allow each tenant to read only the models that they have access to from this kind of golden bucket that exists in the shared control plane. So we thought that S3 Files with the feature that allows you to mount file systems only on specific parts would be a good match.
Also because these models are kind of immutable, so we upload them once, they're never changed. If they are changed, we upload an entire new version, so we don't even have all the issues we just described in terms of synchronizing the data. We literally just needed to have an EFS mount from S3, and S3 Files seems to be pretty good with all the caching mechanisms. but we don't suffer from all of the eventual consistency.
So unfortunately, because the cross account mount doesn't work, what we needed to do in the end is to figure out a synchronization mechanism where we can selectively replicate data from the central bucket in the control plane to a bucket that exists in each tenant account, which is a little bit annoying. It kind of works in the end. It just adds a bit of extra complexity that we didn't want to use. But otherwise, so far, S3 Files has been working really well.
We don't really have a benchmark for this specific project where we're going to have a quite varied mix of big and small files depending on the model. So I think maybe later on, if this is something interesting, we might do another episode with the details once we start to use it. with more and more customers and get some more realistic data. But yeah, so far, just now, you cannot do cross-account mounts, but otherwise, S3 Files has been working really well for this project. So there is one more caveat that I think is worth mentioning, is that if you rename and move files a lot, that can affect performance. And if you have a prefix with millions of objects, and you've just tried to rename that prefix, like just renaming the top folder, for example, in the file system, that could take hours. So there's just something to be aware. And I think there is also some kind of limitation, right, on like AWS will warn you that that might happen and you need to accept the warning explicitly.
Eoin Shanaghy: Yeah. If you've, it will actually look at how many objects are in your bucket before you create a file system. And if you've got like, the documentation says if you've got something like 12 million objects, that means there's a potential for a four hour rename operation. And it'll give you an error unless you accept a warning saying, you know, it's okay, I'm willing to accept the risk and avoid it.
Luciano Mammino: So, this is technically something that might fail at deployment if you accidentally...
Eoin Shanaghy: Unless you add an explicit accept warning property to your configuration.
Luciano Mammino: Okay. So, hopefully that gives you an idea of some of the trade-offs and the limitations and missing features. So, you are a little bit more informed when you decide to use S3 files for your projects, but I think we need to talk about cost. So, what's the story there?
Eoin Shanaghy: OK, well, you've got the S3 pricing. You're always going to pay for that under the hood. Then on top of that, data stored in your cache is priced at around 30 cents per gigabyte or more, depending on the region. What's the story with Sao Paulo? It's like 57 cents compared to 30 cents in US East 1. I don't know. Brazilian listeners, please tell us how you feel about this and what you do about it. Reads from the cache are around $0.03 per gigabyte.
Again, that starts at that price. That's reads from a cache, right? $0.03 per gigabyte. And file writes are around $0.06 or more per gigabyte, 6 cents or 7 cents per gigabyte. Now, if you're familiar with EFS elastic throughput pricing, you might notice that those prices are the exact same. The main difference is that files that aren't in your cache are read from S3 with no additional cost. So that's the way to think about it.
There is an additional cost here, but compared to EFS, there's a huge saving for reads that don't hit your cache. So in practice this means you can save a lot of money compared to EFS but there is a premium compared to just using S3. In exchange for that you get fast read and write performance especially on cached small files. If all or most of your data is small and frequently accessed it can end up all in the cache so the costs can mount up and maybe it doesn't make sense there. On the other hand, if most data is larger or less frequently accessed, you might have the sweet spot for S3 files. One thing we didn't mention yet is that this will also work with all the different access tiers in S3, like infrequently accessed, and Glacier, and everything. So you can still get those cost savings in your S3 layer. And yeah, otherwise, there are those trade-offs. You talked about Luciano, particularly the big one, which is this 60-second write-back delay.
Luciano Mammino: OK, so let's try to wrap this episode up. I'm going to try to do a quick recap first and then give you our final take. So S3 is amazing. We all use it in all kinds of use cases. But the reality is that it's not a file system. So there are still use cases where applications are expected to have a file system. So you need to bridge the gap somehow. And S3 Files is a new way to do that. And it is pretty promising.
it basically lets you mount S3-backed storage into EC2, ECS, EKS, and even Lambda using EFS concepts that you might have seen already if you used EFS as a service. And this is not strange when you realize that under the hood, AWS is using EFS as a caching layer between reading from S3 directly and making that data available as a file system into EC2, ECS, EKS, or whatever. So that's why you can see a lot of EFS things.
And it might be very familiar if you have used EFS. I think that the interesting story there is that you can keep S3 as the source of truth, but give the application a more traditional system interface when they need it. And you can also leverage this caching layer for improving performance whenever you, for instance, are reading lots of small files. And also if you were using EFS before, there is a potential here to save money because you are not always reading from EFS, you are not always keeping all the data into EFS, which is generally where most of the cost would come from if you use EFS with lots of big files, for example.
So our benchmark suggests that S3 files can be quite interesting for workloads where you're doing lots of small file reads, and effectively, in those cases, if you would read directly from S3, that might impact performance. But again, if you're only reading small files and they all end up in the cache, maybe the cost is going to be quite high and as comparable as just having the files directly in EFS. So just be aware of that.
I think it's still having a kind of a good mix of file sizes. It's probably the sweet spot when you need to use S3 files. Now, our take is that, with all of that being said, you still need to be aware that this is not magically turning an S3 bucket into a fully fledged file system. You should still understand what are the trade-offs, what are the limitations, and it is still kind of a middle ground. It is very practical, but just be aware and look carefully what your application is trying to do with the file system and make sure it's not trying to do things that are not supported.
And again, the other thing to be aware is be careful about mixed access patterns. So if you are using that bucket for S3 files, but also using that bucket directly, accessing it for read and writes, there might be synchronization issues. So make sure you truly understand that synchronization model. Make sure you understand that 60-second write-back limitation we discussed before. And with all of that into the picture, make sure your design still makes sense.
So if you are building a complex architecture, just put everything in the picture before just saying that S3 file doesn't work for you, because it might work if you are under the right circumstances. So now, as always, and especially now because this is such a new service, we have used it only for some experiments and into a project that is still very early, So this is just our early opinions and our early findings on this service.
So we'd be really curious to hear, have you used it? What did you find? And have you used something else that maybe you think is going to work better for you and why? Just let us know what are your experiences, what is your opinion, if you think something is missing, because we are always learning from our listeners, and we're always eager to hear your stories. And that brings us to the end of this episode. But before saying goodbye, we'll have to thank you, our sponsor, fourTheorem, for powering yet another episode of AWS Bites Podcast. And fourTheorem can help you if you're trying to design reliable, cost-effective storage architecture on AWS, especially if you're using S3, EFS, and now S3 files, or if you're building all kinds of serverless workloads with Lambda, containers, and any other thing like that. So just check out fourtheorem.com to find out more about what we do and some of our case studies. Thank you very much, and we'll see you in the next episode.