Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: Migrating monoliths to the cloud can be scary, expensive, and time-consuming, but you don't have to massively re-engineer your application to do that. Today, we are going to present a case study and a potential strategy to move a monolith to AWS with minimal drama. We will discuss how a typical on-premise, three-tier web application can be migrated to AWS and made scalable and resilient. We will discuss some of the steps that you can take to make that happen. And finally, we will present some of the new challenges, but also opportunities that come once you shift an application to the cloud. My name is Luciano, and today I'm joined by Eoin, and this is AWS Bites podcast. Luciano, this is based on an article you wrote that's now available on InfoQ.
Eoin: It's called a recipe to migrate and scale monoliths in the cloud. We'll put a link to that article in the show notes. I think it's really good because it's a very clear process on how you think about this kind of migration. There's also a really good case study that gives a context for all the steps that follow, and there's a really good checklist in it. Maybe we can start with that case study. You talk about a fictitious legal company. What's the story with that? What's the context? Yeah, it's a fictitious company, but a reality that kind of company and the kind of project reflects the reality of many, many projects that I've seen in my career.
Luciano: And even projects that we are seeing every day at Forteorum. So I think it represents very well a good class of solutions that are still out there and that can benefit from moving to the cloud. In this particular case, just to set the stage, we can imagine that we have this startup that operates in the legal space. They've built a CMS for legal practices. So you can imagine that they offer this product to legal practices, and what they can do with it is that every practice can upload their legal documents.
And there is like a search index mechanism that happens behind the scenes, and then people logging in in the system, they can use keywords to find documents that have been uploaded before. So it's effectively a way to make legal documents easily searchable within the context of a legal practice. And we can assume that the current solution that exists today, like let's call it the MVP for this startup, is something built on premise in a very standard fashion. It's like a three tier web application where you have a front end, a back end, and a database. And we can imagine just for reference that, I don't know, the technology can be Python. So maybe they're using Django as a web framework and the database can be a relational database, let's say Postgres, just to mention one technology as a reference. So that's the system that we are operating in right now. Yep, it sounds very familiar.
Eoin: And I suppose that brings the question, so why do they have a problem? Why do they need to migrate? And what's the background story there? What are they actually trying to solve by migrating to the cloud?
Luciano: So right now, the whole application is hosted on premise on one machine. So everything is running in this like one monolithic server. And that has been working fine for the MVP. But of course, we know that it's not something that scales long term. And right now, this company is starting to grow a little bit because they released this MVP. They are working with sales to get new customers. And it turns out that they have been very lucky. They got quite a big legal practice that wants to try out this platform.
So what's happening is that suddenly they have a bunch of new users using the system all at the same time. And that's creating a lot of additional stress to the servers. So there is too much load on one machine. The whole application is sometimes slow and unresponsive, sometimes even unavailable. And the other thing is that this is a system that stores files. So right now, everything is monolithic in one server.
So there is literally a bunch of files that are being accumulated in the file system. So it has been happening a few times that the file system was totally full and somebody had to manually allocate more space, more disks. And while that was happening, the whole application was unresponsive and it was effectively a downtime and an incident that needed to be managed by somebody. So the customers were a little bit disappointed with all of that.
And similarly, you can imagine there is stress also in the database because if everything is in the same machine, everything is competing for resources. So as soon as something is stressed, everything else doesn't have the necessary resources to work optimally. So all of that is basically one single big point of failure. If anything fails for any reason, the whole application is failing, going down, being unreachable and customers cannot use the application. They cannot search for their file and they cannot ultimately do their job. So the prompt that we got from this scenario is, okay, but if I move everything into the cloud, everything is going to be better. But at the same time, the feeling is that if you move to the cloud, it's a very big and scary investment that might take a lot of money and time. So it's like, how do we find a trade-off there that makes everyone happy? Yeah, I guess that's an important question. We covered this before in one of the previous episodes on how do you migrate to the cloud.
Eoin: There's a lot of different options. It can be very overwhelming. So I suppose you have to bear in mind what are the skills, how many people, how much cloud awareness do you have, as well as real world problems. What are they going to solve? How much time do they have? And ultimately, really, they've got to get this out there in time for customers to achieve success with it, not impact existing customers and scale with their growth. So what do you suggest? Yeah, so my suggestion would be that trying to reach the best outcome with the minimum amount of investment in terms of time, money, effort.
Luciano: So an idea could be, can we find an architecture that is not dramatically different from the current one, but that at the same time allows us to move everything to the cloud and make it more resilient and scalable, which are the main problems that we are facing right now. The system doesn't scale, and if there is any crash, everything burns, basically. So that's kind of the line that I would like to keep here, so that the challenge is literally, how do we make that happen? And at the same time, we are working with a small team, so how do we minimize also the amount of information overload on that team that will need to actually do the work and learn all the new concepts that come with the cloud? Yeah, I guess it's a difficult thing to resist the temptation to adopt all of the new tools and toys you get with the cloud and to try and simplify.
Eoin: So what do you think is a reasonable approach that solves all the problems, gets some of the advantages of the cloud, achieves the business goals of scaling for the new customers, but doesn't overwhelm the team with a whole lot of new learnings and distractions? So the architecture that I had in mind is something actually quite common when it comes to cloud architectures, especially if we look at the very beginning of the cloud, it's like more traditional three-tier application cloud version, if you want.
Luciano: And basically, the idea is that you have an application load balancer, which is kind of the entry point to the entire architecture, so it's where we receive the requests from the users. Then that load balancer is going to forward all these requests, not to just one machine, but at this point you can use as many machines as you need, so you have this kind of pool of EC2 instances, and they all run the exact same copy of the application code.
So it's literally just taking the monolith and multiplying it n times, where n is going to be a factor of the traffic that you get and how much resources do you need to run the application. And of course, another big problem that we mentioned is files, and those files cannot live in a file system. Well, I suppose they could, but it's more ideal once you are in the cloud to use something like S3, which has been literally built with that goal of making it easier to host files in a safe and distributed way.
So definitely, we should try to leverage S3, and if you have used S3, you know it's not dramatically complicated. It's a reasonable change to make in the architecture. And we can also discuss some tricks to make it easier at the beginning if you don't have time to kind of adopt the SDK and do a lot of code changes. And then, for instance, another big problem that comes with having multiple instances is that you cannot have a local state anymore. You need to, for instance, if you have users that log in, you need to manage their sessions, and this session cannot live in one machine.
And again, it could if you use TikiSessions, but that's not the best way of doing it. So the best thing to do is to use a session storage. Maybe something like Redis can be used to host all that data. So connecting all the instances to Redis is another part of the architecture. And finally, the database. We mentioned that the recurrent solution is essentially a process running in line in the same machine.
What we want to do is to ideally remove all of that from inside the machine and have it independent and scalable and resilient on its own. And there is a perfect service for that in AWS, which is RDS, that being a managed service allows you to get a Postgres database running, make it distributed. You can have read replica. You can have all the features you need, just a click away from you. You don't need to manually write the scripts to provision all that stuff yourself. Okay, I like that. I mean, it's, I guess, a sane approach to this problem, right? It's not overwhelming the team with new things like serverless architecture and containers.
Eoin: And so it keeps a lot of the skills in their comfort zone, right? And it minimizes the amount of new cloud technologies they have to adopt and sets them up pretty well for the future. So I hope a lot of people would kind of copy this model, especially when you're working with a team that's compromised in terms of the amount of time they have to adopt new skills. And no, this is a good first step, I think. So maybe we can talk about the steps to actually make this happen. So now we've got the target architecture in mind. We need a roadmap, right, to get there. So where do we start? What are the first things we need to start preparing?
Luciano: Yeah, I will definitely start by, of course, creating an AWS account. So let's create the target environment. And one thing that I will try to do straight away, and this is probably a little bit of a burden to the team if it's something that they haven't done before, but I consider it almost necessary if you want to be successful in the cloud, is to start to adopt infrastructure as code. So everything you do in the cloud is not something you do manually by going to the web console and clicking around.
Of course, you can do that while you're learning, but when you're building production-ready solutions, you should use infrastructure as code. So this is a step where the team needs to maybe invest a little bit of time and play a little bit around with it and learn the basic concept. And of course, they can select whatever tool feels more natural to them. We have another episode dedicated to that, but CDK, Platform, Terraform, Pulumi, there are many tools out there.
Whatever feels more natural, they are all good enough for the goal that we want to achieve. And then finally, the other thing we need to do as something that is needed to set the stage is to create a network where the whole application will be deployed. So that can be also a little bit of a learning curve if the team doesn't have experience with building visual networks in the cloud. And in particular, with AWS, there are some concepts that you need to learn. What is a VPC? What are availability zones? What are public and private subnets?
And how to configure all of that. If you use CDK, maybe you can get some defaults, but we spoke in another episode how that can also be dangerous because you might end up not really understanding what's going on in the architecture and maybe provisioning things that you don't really need and end up with an expensive setup, like NAT gateways and all this stuff. So yeah, this is probably another point where the team needs to spend a little bit of time, learn at least the basics, do a few experiments, and once they are comfortable, they can start to use that learning from the infrastructure as code to provision the VPC. And at that point, we have an AWS account, minimal understanding of infrastructure as code, and a virtual private network that we can use to host the entire application.
Eoin: I think those points you just made give a good outline of why you don't want to burden a team with too much when you're migrating to the cloud, because even with this simplified, sane approach, you already have an AWS account and possibly AWS organization fundamentals to understand. You have infrastructure as code to understand. And the basics of AWS networking, like what's a private subnet? What's a public subnet? What's an internet gateway and a NAT gateway?
What are the pricing impacts and security impacts of all of these components? So there's enough there in terms of good, solid AWS foundations to understand. I think it's probably enough for the first dive into AWS. So with those fundamentals in place, I think with migrations in general, data is key and data retention and avoiding data loss is important. So data is probably a good topic for the next phase of this journey.
What do we have to think about? You mentioned file storage. I think moving from an on-premise disk or an on-premise NAS to S3 is one of the lowest overhead parts of this and one of the biggest benefits because you can suddenly stop worrying about disks filling up. And it's one of the biggest wins, I think. So is that where you'd start with the file migration? Probably, yes. I think in general, as you said, if you can show the customer that all the data is already in the new environment and all the data gets replicated automatically or as automatic as possible to the new environment, that gives a lot of confidence boost.
Luciano: Because as you said, the data is king and that's the main concern. Like maybe I'm not too concerned about being offline for a few hours while I migrate, but I'm definitely going to be concerned if I'm going to lose some data. So if we can reassure a customer, a company that that's not going to be the case, that there are ways to actually keep the data in sync as we move through two different systems, I think that that's literally a big win and we should aim to that.
So I agree that this is a good next step to address to build more and more confidence that we are going in the right direction. So yeah, talking about S3, the easiest thing that I could think of is, okay, let's start by creating an S3 bucket and let's make sure that every new file that gets created in the old system is also created in S3. So that might require code changes, but there are tricks there. I mentioned that before. For instance, you can use virtual file system like Fuses 3 and things like that to keep the code as unchanged as possible because the code is good at reading and writing files from the file system.
With a virtual file system, you will only have like a different virtual folder that you use to read and write and that virtual file system will take the burden of actually using the AWS APIs to actually read and write into S3. I don't necessarily recommend that because there are problems that come with that solution, but at the same time, if you don't want to change the code too much because you don't have the time, it's something else you need to learn, it's new dependencies that you need to bring into the application.
And maybe at that moment in time, it's not easy to do that. That can be a solution right now to just start to see the data popping into the S3 bucket. Then another thing you can do once you have new data being written also to S3 is to just go into the current machine, the current monolith, and do an S3 sync from the CLI and that will copy all the existing files over into the bucket as well. So at that point, you have all the new data coming in, but you also copied all the historic data. So at that point, you have S3 perfectly in line.
The next problem is the database data. And that's also a big one because if you have a relational database, how do you keep it in sync with another copy of the relational database, right? Then it's going to be running in AWS. We mentioned you can use RDS. So the next thing you should do is just go to RDS and create a cluster for your Postgres. And then how do you actually bring the data from the current system to this new RDS cluster?
And there is actually a service dedicated to that. It's called Database Migration Service. And one of the things that it does other than just helping you to migrate all your schema and copying the data, but it can also work in the original system, so in the on-premise system, and make sure that every time there is new data in that on-premise database, this data is also replicated to the RDS database. So this way, again, we are creating that system that allows not just to copy the data once, but also to keep copying new data as it arrives, which gives us confidence that we can take all the time that is needed for this migration without having to put the system offline. So that the old system can still work and new data will be replicated, and we can switch over to the new system whenever we feel ready.
Eoin: That sounds like a good pragmatic set of decisions there. I think you also have the option of manually migrating your database data. But maybe that's a little bit more difficult than it was with S3 where you can use the AWS CLI to do an S3 sync. Similarly, you could probably go a step further and migrate your S3 data using Storage Gateway and have more of a pattern like you have with the Database Migration Service. But S3 is probably just a little bit simpler to migrate because you don't have to think about all the transactional updates happening and file systems are a little bit simpler to reason about. So you've got options there, but you don't have to go all in and choose Storage Gateway, which has lots of options and its own set of complexities.
Luciano: Yeah, and then the last thing is to provision Redis, and you can do that in a managed way on AWS using something like Elasticash, for instance.
Eoin: And the good thing about Redis is that it tends to be quite schema-less, so you don't need to really worry too much right now about, I don't know, how are you going to structure the data in Redis.
Luciano: So just spinning up the cluster is probably enough for you right now to get started. Okay, so Redis, I suppose the important thing is to size it correctly, make sure you have enough memory, and it's going to work for your performance, but assuming, it depends on what you're using it for, and that probably brings us to the application and how the application leverages Redis.
Eoin: And I think we've talked about preparation, we've got our data migrations started. So this is everything in the right order so far, I think. Probably a good time to start thinking about compute and the application itself. So is it just a lift and shift? Do we need to make much change there? I would say almost, but there is like a big mindset shift, I think, when it comes to this kind of architecture.
Luciano: And the reason why is because in the initial state, we have only one machine. So you can imagine that machine to be like totally stateful. Everything that happens, connections, sessions, are all managed. They could be managed in memory in that one machine. The problem is that as soon as you have multiple machines, even just two machines, the load balancer will route traffic to them in kind of a round robin fashion.
So it's not guaranteed that a user sending a request the first time will end up in the same machine when they send a request the second time. They might be bouncing between two or more machines. So if the state is not somehow available to all the machines, that becomes a problem because a user might log in into one machine, then send that request to another machine. And basically the second machine doesn't have any clue about that particular session.
So the problem is how do we keep all the instances as stateless as possible? Which means we need to put the state somewhere else that is shared. And that's why we created the Redis cluster. And for this particular application, I expect that the main kind of state that we need to keep track of is just user sessions. So we can kind of simplify it that way. We already say that files will be copied in S3 so that that kind of decouples as well the statefulness of the application into something a little bit more stateless.
But there is another interesting thing to bring in mind that is you cannot, you could probably do it, but you shouldn't do it, that you can SSH into one of the machines to do operational stuff. And operational stuff could be, I don't know, tail logs because you're trying to troubleshoot something or even just install updates or do code changes because you're trying to fix or update something. That doesn't make any sense anymore because you, first of all, if you're looking for logs, you have no guarantee that the logs are being produced.
The logs you are looking for are being produced in the machine that you just connected to. It might be any other of the machines or maybe that original machines that where you saw a potential bug doesn't even exist anymore. Because you have to think these machines are dynamically, they could be configured to dynamically appear and created and destroyed to be elastically scalable. So that that concept of I'm just going to SSH to do operation. I think it's a big no no when you move to this kind of architectures. So what is the solution? The solution is to use images, machine images like AMIs to provision your instances. So you make sure every instance is literally the same. Everything is stateless. So we said we move all the state outside the instance. But also you'll need to start adopting observability tools for things like logs and metrics. And that makes also all this information in a way stateless, meaning that it's moved outside the instance itself. That sounds good. And I guess people can make their own judgment as to whether they need an auto scaling group.
Eoin: You might also just decide to bring up a number of instances, like three instances and multiple AZs. And if you know your traffic is never going to exceed the compute amounts of three instances and you're just doing it for high availability. That's completely OK, too. You can decide to adopt an auto scaling group at a later stage. Absolutely. So we talked about some of the networking fundamentals, public and private subnets. You've mentioned the application and we've got auto scaling. We talked about multiple AZs. What are the other, I suppose, front facing networking considerations that we need to take? So we're starting to wire our application closer to our user. What are the parts that we need to think about there?
Luciano: Yeah, one thing that we didn't mention is HTTPS, which, of course, it's going to be a critical thing for a system like this, where users are logging in and there is sensitive information being uploaded. So we definitely need to have HTTPS. The good news is that in AWS, there are ways to make that somewhat simple and managed, because you can use services like ACM to create the certificates and manage the lifecycle of the certificate.
And then a certificate with ACM can just be attached to the load balancer and the load balancer can deal with all the SSL termination. So it becomes kind of from the user to the load balancer is HTTPS and everything else you don't necessarily have to keep doing HTTPS unless you want to, of course. So the things that we need to configure is create a certificate with ACM, attach the certificate to the load balancer.
And of course, when you create the certificate, there are different ways to validate that certificate. You need to prove that you have control over the domain and you can do that either by email or with DNS records. So depending on how you are set up there, you might pick whatever way is most suitable to you. And finally, if you want to do auto scaling, you need to make sure that your application has a kind of an ELT check endpoint that the load balancer can use to verify that when a new instance is brought up, it's actually ready to receive requests. And also if the instance crashes for whatever reason, the load balancer can recognize and remove it from the pool of EC2 instances. And with that, you also need to configure the targets and auto scaling groups. So there is a little bit of extra configuration. Also, what are the scaling rules? Do you want to scale based on, I don't know, average CPU or number of connections? Things you can decide based on what are your expectations in terms of incoming traffic. OK, so that sounds like it'll set people up for a seamless switchover as long as they understand exactly what they expect in terms of what domains they're using.
Eoin: They need to think about are they using the same domain, different domain, but the important thing is to be able to test your old system and your live system, make sure they're both working and then seamlessly switch over with the no deployment steps really just to use DNS. That's always the safest way to do things. So there's at that point, right, we've got our application up and running in the cloud. Users can start using it right away. Existing users should have noticed no difference, maybe just a dramatic increase in performance and stability. And we know that we're scaled for future growth as well. So in terms of thinking about the team, people who actually have to do this work and support it, and we don't want them to lose too many sleepless nights. So what are the things that teams need to learn? What are the fundamentals? We talked about some of them there. Maybe we can summarize.
Luciano: Yeah, we definitely mentioned infrastructure as code as being one of the most important investments, I suppose, because if you do that right at the beginning, it's going to pay off big time as you deploy the application the first time, but then especially when you want to do changes in the future and update the application. So that's definitely one, and it can be a big one, I suppose. If you've never done it before, it can be a little bit overwhelming. So this is probably the one thing I would recommend to really spend your time and make sure you feel comfortable with it.
The other one is AWS networking. You don't have to become an expert, but at least understand the basics, what are the different concepts and need to be comfortable thinking that you are not just running a server in the public internet or on premise and somehow with a public IP. But you literally have your own virtual network where there are different things running inside, they are connected with each other, and then how do you expose that to public facing internet?
So just make sure you understand all the basics there and how the different AWS concepts allow you to implement that kind of architecture. And another thing we didn't mention, but it's probably important, is to understand AWS permission. So get yourself comfortable with IAM because, of course, we'll need to have instances that are able to read and write to S3. So to the very minimum, you need to be able to define the IAM policies that allow that.
But of course, as soon as you learn IAM, that can be beneficial in general in AWS to make sure that every time you are integrating different services, all the policies are configured correctly. And also that's important for users logging into AWS, what kind of permissions do they get? So something to learn anyway as soon as you need to start managing that AWS account. And finally, how to create AMIs. There are different ways and different tools, but of course, it's something that you need to do because this is how you change, well, how you create the code in the first place that goes into every machine, but also how do you change it every time you want to do a new release.
Eoin: So I think that's a good summary of all the skills you need, and there's enough there. And if you could focus on those basics, I think after a success like this and with those skills, you've got a team that's really well set up to grow on AWS really well. So what's next? Once that's in place, what should the team thinking about in terms of, okay, now that we're there in AWS, where do we go from here? What are the improvements we can make? What new opportunities does this open up for us? Yeah, I think there will be in general some new challenges, but also new opportunities once the new system is running in the cloud.
Luciano: We mentioned already that there will be challenges in terms of observability because again, you have a lot of things happening in different systems. How do you make sense of if there is an issue, like where the issue will even be? Like where do you start looking? Where do you find evidence about that issue? Where do you collect more information to be able to troubleshoot and solve the issue? And all of that comes with the topic of observability and learning how to do that in the cloud and all the tooling.
It's another skill that the team will need to start developing. And that probably requires a lot of code changes, making sure that all the information is logged correctly or metrics are being created, alarms are set. And then you also need to develop operational skills. How do you react to incidents? Who is going to be available? What are they going to do to address problems? Things that maybe you were doing to some extent with the monolithic system, but now they get to a different degree of complexity just because you have more moving parts.
And then similar topics are testing. How do you do testing now? Because it's not just one system. How do you make sure that all the new different parts of the system work end to end? And with that, you can also start to think about building and deployment. Can we automate some of that stuff, even just the building part? But if you can even get to a point where you do full CI-CD, that's kind of even better goal to have. And again, this is a little bit of both of a challenge and an opportunity.
But there are also other opportunities there that are very interesting because the goal that we hopefully achieved at this point is that we have an architecture that can scale and be more resilient to failure. There is not a single point of failure anymore. And if things fail, you can have systems in place that will automatically spin up new instances and the system can auto-ill up to some extent.
The interesting thing is that at this point, as soon as your product grows, you have more customers, you need to develop new features, you can start to think about two options there. One, you can start to think about microservices so you can start to break down the existing application into individual services and then give different teams different responsibilities. But also you can approach that way of thinking in a more, I suppose, safe way, which is you don't necessarily have to do full monolith to microservice migration.
You can think, okay, if we need to develop a new feature, how can we build that one feature in a way that is decoupled from the existing monolith? And that's something that you can do in AWS for instance, you can use ABA gateway and then Lambda as a backend and then tell the load balancer this particular feature, I don't know, slash search maybe, goes into the ABA gateway and then it's managed by Lambdas rather than being managed by the monolith application. So that gives you ways to experiment and get more comfortable with different tools that are available in AWS before you actually dramatically change the entire application. And similarly, you can experiment with SQS for instance, and Lambda to offload some of the usual things like, I don't know, sending emails, notifications, processing data in the background. So you can also leverage additional tools and do that as soon as you see an opportunity to do it with very small and tactical changes. This is great. Yeah, I think there's a number of opportunities.
Eoin: It really is a good appetizer for people who are thinking about taking this approach and I think the whole order of things and doing things simply in a managed way and then opening up these opportunities for later is good. You're not taking on too much too soon. If you want to learn more about the details of this particular strategy, there's a lot of detail in that really great InfoQ article. The link is in the show notes below. But if you want to know about all the different ways, Episode 18, How Do You Move to the Cloud, we're going to link to that and we'd really love your thoughts and other alternative ideas on migration strategies because there's a lot of them out there. So let us know what you think and we'll see you next time. Bye.