Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: Today, we are diving into a problem that might be more common than we like to think among cloud practitioners, copying data between S3 buckets or even S3 compatible storages. So this is something that can happen if you are migrating some workloads to AWS, you have been using S3 compatible object storage, and now at some point you decide to go fully on AWS, so it makes sense to move all the data to S3 as well. Or maybe the other way around, maybe you are escaping from AWS for whatever reason, or maybe you're just escaping the object storage part. So there are more and more S3 compatible alternative storage services, and some of them are actually becoming really, really competitive on pricing. So if you don't mind the extra complexity of having to manage workloads distributed across multiple cloud providers, this is actually something that can be an effective strategy to save some costs on your cloud expenses. Or yet again, there might be another use case, maybe you are just copying data from two buckets still in AWS, but maybe they happen to be in different accounts, and you know that giving permission across accounts is something that sometimes can be challenging. And if you're sticking to AWS, all the recommendations assume that you have one set of credentials that you can use to copy, to read the data and copy the data across accounts. And this is not always an easy situation to have. So this is another problem you might have to deal with when you're trying to copy data from one bucket to another in different regions and accounts. So today we're going to talk about all these kind of different use cases, and we will share a little bit of a story that we had personally, and how we ended up building a small CLI tool that allows us to simplify copying data between S3 compatible storage. My name is Luciano, and as always, I'm joined by Eoin for another episode of AWS Bites Podcast. So AWS Bites is sponsored by fourTheorem, but we'll tell you more about them later. So let's get into the S3 and S3 compatible industry.
Eoin: Yeah, the whole ecosystem around cloud storage is growing really rapidly now. S3 has been around for a long time and is still dominating. But there are a lot more interesting alternatives. And some of them are really competitive on pricing, trying to grab some of that market share that S3 has and ride on the coattails. Like if you look at S3, you pay about $23 if you have like a terabyte of S3 storage on the standard tier. Now there's a decent enough free tier, you get five gigabytes, which might be enough, and 100 gigabytes of egress, which is something that was recently increased by a significant amount, to try and I think combat some of this competitive evolution.
If we look at some of the alternatives out there, DigitalOcean has one, DigitalOcean Spaces Object Storage, that's $5 a month fixed price. But then you pay $20 per terabyte per month, and they give you a 250 gigabyte free tier. Then Cloudflare, Cloudflare R2, which is one of the entrants that I think really caused AWS to rethink their pricing strategy. That's like $15 per terabyte per month. And it is interesting for its zero egress fees approach.
So it's interesting, it seems like the market leader is keen to make it difficult for people to do egress, but the new entrants are very keen to say that they want to make that as cheap as possible. So similar to R2, you've got Backblaze B2, which is $6 per terabyte per month, which seems like the cheapest option you can get right now. Another one is Wasabi, which is $7 per terabyte per month. And then you have Linode with Akamai, they have an object storage offering for $20 per terabyte per month, but you get a terabyte of egress for free, which is pretty significant.
Now, there are other options, you don't necessarily have to go with another cloud provider for object storage, you can host it yourself. MinIO is a reasonably popular one for people who want S3 compatible object storage. If you need to host it in a data center, you might say that's only for the brave, but obviously, you might have your own existing storage that you've already invested in, or you might have compliance requirements that mean you have to keep it in your data center. Now, MinIO also does have a managed cloud service, but this seems to be a bit of a premium offering, because you have to spend at least $96,000 per year. But with that, you'll get 400 terabytes of storage, which works out to about $20 per terabyte per month. So there's a lot more. So if you do a web search, you'll be inundated. I think we did a lot more. So we had a similar case recently, and you know all about this, Luciano. So what's the backstory?
Luciano: Yeah, the backstory is that this was basically a few weeks ago, we needed to move the entire content of an S3 bucket to another storage, an S3 compatible storage managed by another cloud provider. Now, don't really ask us why. We are big fans of AWS and S3, as you know. But sometimes business requirements can get in the way, and you end up in unexpected places, and you just need to solve the problem. So I'm sure you can relate. And especially now that you know all about this other competitive offers, you can see why businesses might decide to do something like this.
So yeah, this was the situation we were in. And we thought this was like a simple problem, right? How hard can it be to just copy data from S3 to something else that promises you S3 compatible APIs, right? Seems like you can just do an S3 sync and call it a day, right? But of course, it's not that easy. And that's the reason why we are talking about it. And I just want to explain some of the requirements we had, so that you can understand why we ended up with a specific solution.
So basically, we needed to copy all these objects from this bucket to another S3 compatible service. Now, in fairness, it wasn't like a huge amount of objects. I think it was quite a couple of terabytes or maybe something more. But it was a lot of small objects, so in the order of like millions of very small objects. So the copy itself needs to be efficient. We wanted to make it efficient in terms of memory.
So possibly, we didn't want to kind of buffer everything into an intermediate machine to just copy to the destination. We wanted to do some kind of copy on the fly. So as you read the data from source, you start to copy it to the destination. And so ideally, another thing that this was more for like operational purposes, because there were applications actually using this data, and also the applications needed to transition to the new storage.
So the business decided that it made sense to start to prioritize newer files, because these will be the ones with, I guess, the higher probability of being used by the application. So another requirement is the copy process should take that into account and prioritize more recent objects rather than the oldest one. And then the other thing is that it should be possible to interrupt the copy process at any point and resume it later.
And this can include if something fails, maybe the machine needs to be rebooted, maybe the copy process itself, I don't know, has a bug and just fails. We don't want to restart from scratch because that will be a huge waste of time and also bandwidth. So let's figure out a way that the whole copy process can be interrupted at any time and it can be resumed later. So again, how difficult could this be?
S3 sync seems to hit most of the boxes here. But yeah, when we started to look into it, there were some problems that we'll tell you a bit more about later. And therefore, we ended up deciding, OK, we are going to create our own little CLI utility that is able to read files from the original strip bucket and copy the files to the destination service. But I think before getting into the details of this solution, which, by the way, little spoiler, is called S3-Migrate and it's fully open source. We'll share the link in the show notes. But before diving into it, I think we should talk a little bit more about our analysis of the existing solutions and why we couldn't use anything that is already available.
Eoin: And generally, we don't like to have to invent these tools ourselves. And you might think like S3-compatible storages should just work with S3 tools like the AWS CLI and the AWS CDK. That's kind of what we thought too. But when we did a little bit more research, we realized that in this case, it actually made some sense for this client to create a new tool from scratch. So if you just Google for how do you copy data between S3 buckets, you might end up on an Amazon AWS S3, sorry, on an AWS repost thread that suggests to use either the CLI, that's the AWS S3 sync command, which we use a lot, to be fair, or use S3 batch operations, which are very useful if you've got a whole number of copies to do or a whole load of objects to manage in one batch.
So these are all good solutions, but there's a couple of fundamental challenges with them. First one is that they assume you're all in on AWS. They don't, naturally enough, cover the scenario where you might be using an S3-compatible storage either as a source or as a destination. So normally when you do a copy operation on S3, it's managed by S3. The data doesn't have to go through your client, so you can just do a copy object API call.
But that only works if the source and destination are on the same provider. Even if you are all in on AWS, if the two buckets live in two different accounts, you need to set up cross-account permissions. And that can add a lot of complexity, because essentially what it means is that if you're doing a copy object operation, that's going to be signed with a signature from an IAM identity, and you can only have one principle there.
You can't have two principles. So that principle must be authorized to access the source and the destination in the read and write modes you need. So when you run the sync command, the AWS CLI operates with that one set of credentials, and it isn't going to work if you've got something on S3 and the other destination or the source is on Cloudflare, for example. So we were looking for something that could operate with two different sets of credentials, one for reading from an arbitrary S3-compatible source, and one for writing to another arbitrary S3-compatible destination. And since we couldn't find anything out of the box, being the nerdy programmers who probably suffer a little bit from the not-invented-here syndrome, we thought, well, how difficult can it be to write a little CLI tool that uses the SDK to do what we want to do? Luciano, you wrote that tool. So how does it work?
Luciano: Yeah, let me try to explain how it is built. So again, in a nutshell, it's effectively called S3-migrate. It tries to do something somewhat similar to AWS S3 sync, but allows you to provide two separate sets of credentials. This is probably the main difference from an idea perspective of the tool. So you don't necessarily have to have one single set of credentials. You can provide two for source and for destination.
And the tool itself is written using Node.js, specifically in TypeScript, and it uses Commander.js for the CLI argument parsing. SQLite for data storage. We'll get into the details of that in a second because it might sound weird right now. And of course, it uses the AWS SDK version 3 for JavaScript to interact with S3-compatible endpoint. By the way, fun fact, if you look at most of these other providers, they all tell you just use the AWS S3 SDK to interact with our APIs.
So this is actually a good sign that most providers are actually trying to be strictly compatible with those APIs to the point that it's not even worth for them to create their own clients because you can just use the existing SDKs and clients. So that kind of made it a little bit easier for us because we didn't need to learn a new, I don't know, set of library or even trying to figure out if we want this tool to work with multiple providers, do we need to, I don't know, have some kind of abstraction layer where you need to plug in different SDKs.
Thankfully, everything seems to work just fine with the AWS SDK for Java. Now, you might be asking the usual question here, why didn't you use Rust or Go? And of course, this is something we could debate on for hours and we could do like a flame war of sort. But yeah, if you just want the long story short, I would have personally loved to write it in Rust because I'm a big fan of Rust and I'm always looking for excuses to use Rust more.
But honestly, given that we have tons of experience in Node.js and TypeScript, and this seems a use case that you have lots of tooling existing that can support you in Node.js and TypeScript, it was just much easier and faster to deliver the solution using TypeScript. And the other thing is that from a performance perspective, it is true that maybe Rust could have made it a little bit faster and maybe more, I guess, from a memory perspective, a little bit savvy, like it's not going to use as much memory.
But at the same time, the real bottleneck here is networking speed. We are doing a copy, like a progressive copy of the data. So really, yeah, networking is the real bottleneck here. So even if we, maybe if we use Rust multi-traded AsyncIO, the multi-trading could have been, could have given us a way to parallelize a little bit more the copy. But there are other strategies that we put in place and we'll talk about that later.
So yeah, this is why we didn't use Go or Rust. But I don't know, maybe it's an exercise for somebody if you want to try to do something similar with one of those lines. As I said, the tool is fully open source. It's published on NPM, so you can just use it today. But by using something like MPX, you don't even need to install it. You can just try it just with one command and see if it works for you.
Now, we mentioned that there are two sets of credentials. It works in a similar way to the AWS CLI or the AWS SDK, meaning that you can use the usual environment variables like AWS Access Key ID, or you can use Endpoint and so on. But the difference is that you have, you can use the basic one. If you just use the basic one, that's kind of the default layer. But you can also override by saying source underscore AWS Access Key or source underscore Endpoint.
And similarly, you can override the destination. For instance, you can say destination AWS Access Key ID destination Endpoint. And the tool also reads from M files. So if you prefer to just put all this information in an M file because it makes your life easier, the tool is going to load an M file automatically if that exists in the current working directory. Now, the way that it is a little bit different from sync is that there are actually two phases.
Like you don't just run one command and it starts the copy. You actually need to run two different commands. And the first command is called catalog. And that's what we call the catalog phase, which is basically, what it's going to do is going to do a list operation on the source bucket and store all the objects in a local SQLite database. And the reason why we do this, this is effectively like a mini state file, if you want.
And this is what we decided to do to effectively have that kind of resumability feature on one side. So as we copy the files, we know exactly how many files there are to copy. So we can keep track of the progress. We can mark which ones have been copied. And the other thing we can do, because we also store the metadata related to all the objects as we discover them through the list operation, that's also what we can use to effectively do the sorting.
So if you want to prioritize the files that are bigger, smaller, or newer, you can do that. And effectively, we'll be doing, the tool is going to be doing behind the scenes, a different SQL query with a different sorting based on your parameters. So that's the reason why we have this kind of intermediate step, just to make it a little bit more flexible to understand how many objects there are and as you copy to understand what is the current progress, and then to do prioritization of different objects and resumability.
Once you have done the catalog phase, so effectively you end up with this state file, which is effectively a SQLite. You can open it with any SQLite compatible UI or CLI just to see what's inside. And with that, you can start the copy phase. So there is another command, s3 migrate copy, where you specify the source bucket, the destination bucket, and the state file. And of course, through the environment, you are providing all your credentials.
And effectively, this command is going to start to look at the state file, figure out what still needs to be copied and start to copy it. And of course, being a CLI utility, one of the challenges, of course, is that you need to have it in some kind of host system or your own personal laptop, like wherever, like it needs to be a process that runs somewhere. And of course, you need to control that process, make sure it's a long running thing. So probably you're going to have some kind of remote machine somewhere, install the tool there, provide all the credentials, create the catalog and then run the command and just monitor that the application is progressing without any issues.
Eoin: Okay, it sounds like there's a lot of capability here. And I guess the thing about building these tools is that it's achievable enough to get version one up and running. But already, even if you run it once or twice, you might be a start to think about how you can make it faster, especially trying to handle different types of data sets. You mentioned that in this case, the subject was a lot of small files on S3. You might also have a lot of large files. And you're trying to optimize for IO and parallelism and request throttling, a lot of that kind of stuff. So what kind of performance optimizations did you think about so far?
Luciano: Yeah, that's a very good question. And I'm going to start with a caveat that I think this is still a very early project. If you look at the repo, it clearly states that this is experimental. So don't trust it too much, or I will say trust but verify. So I'm sure that there are still loads of opportunities to improve it, and also in terms of performance. So with that being said, what have we done so far to try to give you options on how you can improve both performance in terms of data transfer, but also in terms of how much memory is being consumed at the host level.
So if you want to be very memory efficient, for example, there are options there as well. So one thing worth mentioning is that we use Node.js streams to copy data, and that's another thing that I'm a big fan of. So probably no surprises if people know me. The idea is that when you run a get object command using the AWS S3 SDK, the body that you receive in the response is a Node.js stream. So effectively, you are not eagerly consuming that data.
Almost you can think of that like you have a pointer to where the data is, and then you can start to fetch as you need it. And also Node.js streams give you a nice API where you can effectively combine streams together. So you could have a stream to read and another stream to write, and effectively you can pipe them together and let the data flow from one to another. And this is very useful because when you do a put object operation, you also have a stream in the body that effectively you are writing.
And in Node.js terms, you have a readable stream for the get operation and a writable stream for the put operation. So effectively, you can easily combine a readable stream with a writable stream and basically just create this pipe where you say read from one place, write to another. And Node.js takes care of most of the complexity there, because for instance, even handles back pressure. If you are much faster at reading than you are at writing, what generally would happen is that you easily exhaust all the memory reading, and as you try to write, you are not able to flush all this data fast enough.
So Node.js has a mechanism called back pressure handling, where effectively it kind of figures out when you have too much data accumulated, and it's going to stop reading, give time to the backend system to receive all the writes, and then it's going to resume reading. And all of that stuff happens automatically when you stream. So I think that's kind of an easy optimization to have, because all built in Node.js and we just took advantage of it.
There is some additional complexity, if we want to get into the nitty-gritty details, where there is a minimum amount of... Like when you use streams, you are effectively reading and writing in chunks. So you get blobs of bytes, and they generally have like a fixed size. The S3 API forces you to have a consistent chunk size when you're writing, and there is a minimum amount of bytes that that chunk size needs to have.
So we kind of have to do some... It's called like a transform stream. We need to put something in between that buffers enough data to be able then to write. But that's, yeah, just as much complexity as we added, then not just streams take care of everything else. And this is actually an interesting part, because this is another place where you can optimize. So you can decide to increase the chunk size, which effectively means you are going to accumulate more memory in the host system, because effectively you are creating more windows of data that are ready to be flushed.
And the bigger they are, of course, the more memory you are consuming in the host system. But at the same time, that means you are doing less API calls to the storage service where you're writing. So that can be convenient as well, because of course, every API call has an override. So, yeah, generally it's suggested to try to figure out, to find a balance where if you keep it too small, you maybe are doing too many writes, and there is an overhead on the operative system and everything else.
But if you find maybe what is a good chunk size, then you probably can optimize a little bit more on the write speed as well. Now, another interesting optimization is concurrency. Using Node.js, this is effectively you have a language that allows you to do concurrency relatively easy. Just be aware that, of course, this is still a single-threaded type of concurrency. So in this case, I think it works really well, because you are effectively waiting for I.O. most of the time.
So as you are waiting, you can have multiple copy operations that are kind of interleaved between each other, and they will progress together. But of course, this works up to a certain point. So you can try to figure out what is the maximum amount of concurrency that I can use, and there is a parameter you can specify, to the point where you don't see an improvement of speed anymore, just because there is so much interleaved operation that effectively you are wasting more time just jumping from one operation to another, rather than actually copying the data and doing progress.
So I think at some point, it might be beneficial to use proper parallelism, so trying to spin up multiple processes to do the copy. And this is something that is supported by the tool, but might be a little bit tricky. And actually, the AWS S3 sync does something similar as well, where effectively you can create a catalog only for a certain prefix in your source S3 bucket. So effectively, you end up with multiple catalogs as per many prefixes you use.
And then you can take those catalogs, and even in different machines, if you want, you can do the copy operation only for those subsets of all the data. So effectively, you can parallelize the copy across multiple machines, which gives you more parallelism. Probably you can use more bandwidth as well, because of course, bandwidth at some point can become a bottleneck as well. The only issue with that is that it is a little bit more complex to set up. And also, it's always a little bit challenging to figure out what are some good prefixes that I can use to effectively spread equally the amount of data that is being copied at every point of your parallelized solution. So it is an option, it is supported, but I guess depending on the shape of your data, it might be easier or harder to effectively adopt this solution. You mentioned that it's still in the early stages.
Eoin: So what would you like to see in terms of roadmap? What doesn't it do yet? Do you have any call to action for the audience to get contributing on this?
Luciano: Yeah, I think there are two sets of things worth discussing. One is like things that are not supported by design, and other things that are not supported just because we didn't have the time or the immediate need for those features. So the things that are not supported by design are things like copying, I don't know, attributes, or tags, or ACL rules, or anything, I guess, related, anything that falls outside just the data of the objects themselves.
Like in S3, you have so many options of things you can configure, storage classes, life cycles, things you can configure at the object level, at the bucket level. This tool intentionally doesn't try to do to replicate any of this. And one of the reasons is because this was not our immediate need. But then the other reason is if you analyze the problem a bit pragmatically, different S3 compatible storages are going to have a different level of support for those features.
So trying to be comprehensive and support all of these things, I think you easily end up with like a matrix. So what is supported and not supported, then it becomes quickly obvious that either you create a system that is like hyper configurable and then let the user figure out which configuration works for them, or it becomes effectively impossible to maintain this matrix of what storage supports what feature and then try to automatically leverage whatever is supported.
So that's something that by design, we didn't even try to implement. Similarly, also encryption is another thing. Like if you have encrypted objects, I'm not actually sure I haven't done a lot of testing, but we don't try to provide any option at this stage. So that is something that might get in the way if you're actually working with encrypted data in buckets. I guess it depends also on the encryption mechanism you are using.
And the one thing that actually I would like to see, but we just didn't have the time to implement it ourselves, is some kind of support for multi-part uploads. Because this tool worked really well for us because all the files were relatively small. But I guess if you have some kind of media intensive application where maybe you have lots of big images or even videos, and you can have files that spans multiple gigabytes, then maybe this won't be the most efficient way to copy your files. Probably you want to do some kind of multi-part upload to try to parallelize as much as possible, even the individual objects. So if anyone is open, maybe you are using this tool and you find it useful. It's open source, so feel free to send a PR. This is one feature that we would love to see.
Eoin: Nice. And it would be great to get more development on this, because if we look at some of the alternative solutions, there's a couple of open source ones out there, but a lot of them seem to have been written by people who needed to solve a problem and then maybe not maintained so well. There's one from AWS Labs, which is relatively new, but written in Golang. It uses S3 batch operations, so AWS only doesn't really solve a problem.
There's an older one called S3 S3 Mirror on GitHub. It's a Java-based one that allows you to mirror buckets from one to the other or from a local file system. And then there's one called NoxCopy, which was written in Ruby quite a while ago, but it seems to be quite deprecated. Now, if we look at not-so-open-source options, there's one called RClone S3 as well, and that's like a tool that allows you to copy data between lots of different sources.
So that could be FTP, Dropbox, Google Drive, and it also includes S3. So it seems quite powerful, but we haven't tried it yet. And then there's another paid cloud service called Flexify. This is actually what DigitalOcean recommends for migrations. We haven't tried this, but I thought it was worth mentioning in case you wanted to just throw money at the problem. I guess it would be interesting to benchmark this.
It depends on your use case, of course, but I wonder would tools like MountPoint for S3, which we covered on a previous episode, if you just mount two different S3 buckets using different credentials on your file system and then do like R-sync between them. What would the performance of that be like? I'm always kind of interested, but skeptical about solutions that try to map object storage into a file system abstraction. But MountPoint does work well for some cases, and same with the Fuse user, what do you call it? User file system for S3 as well. So options that I'm interested to get other people's take on as well.
Luciano: Yeah, and this seems like a common enough problem that I'm surprised that there isn't really a lot of literature out there on a lot of solutions. And I think it's going to become a more common problem with all these other solutions that are appearing everywhere. So I'm just curious to see what other people have, like if they had this kind of use case and what kind of solutions they came up with.
Eoin: Will Cloudflare and all these other vendors start adding tooling to do one-click import from an S3 bucket, do you think?
Luciano: I wouldn't be too surprised, to be honest, if they do, because I guess it's in their best interest. It's almost like all these tools that are trying to compete with newsletter things, and they all have imports from MailChimp, right? Because it makes sense for them to try to make it easier for new customers. Yes. They always work one way only, though.
Eoin: They never allow you to export. So I think that's all you get for today.
Luciano: Again, we are really curious to hear from you. Have you dealt with this kind of problem? Don't be shy. Let us know, because we are always eager to learn from you and share our experience, not just in one way. So please give us some of your experience. But before we wrap up, let's hear it from our sponsors. We promise you to tell you a little bit more about fourTheorem, and thank you, fourTheorem, for supporting yet another episode of AWS Bytes.
So migrating data is hard, but optimize your cloud setup doesn't have to be. That's where our friends at fourTheorem come in. It's an AWS advanced consulting partner specialized in serverless first solutions to slash cost, scale seamlessly, and modernize your cloud application. So whether you are streamlining infrastructure, accelerating development, or turning your tech team into a profit powerhouse, fourTheorem is there to help you out to maximize your AWS investment. So check out fourTheorem at fortyrem.com, and you can find some trusted partner for your next AWS project, I'm sure. So thank you, everyone, and we'll see you in the next episode.