AWS Bites Podcast

38. How do you choose the right compute service on AWS?

Published 2022-05-26 - Listen on your favourite podcast player

When it comes to choosing compute services on AWS, there are a lot of options, including EC2, ECS, Lambda, EKS… New ones keep emerging all the time! Selecting the right one for each application is no longer an easy choice. In this episode we discuss why you need compute services and what kinds of problems should be offloaded to something else entirely. We suggest how you can develop a methodology to make the selection process easier and less biased within your company. We discuss at a high level what are some of the different compute options available in AWS and finally we provide a few different options example use cases and describe how we picked the compute service for each.

In this episode we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: When it comes to choosing compute services on AWS, there are a lot of options including EC2, ECS, Lambda, EKS, and new ones keep emerging all the time. So selecting the right one for each application is no longer an easy choice. So today we want to talk about why do you need compute services and what kind of problems should you actually be offloading to something else entirely? We're gonna talk about creating a methodology to make the selection process easier and less biased.

We'll talk about the different compute options available in AWS and we'll give some example use cases and how you make the right choice in each case. My name is Eoin, I'm joined by Luciano and this is the AWS Bites podcast. In the previous episode, we talked about migrating a monolith to AWS. And in that case, we talked about considering data, compute and networking. For compute in that case, we chose EC2 because it made sense for the team. This time we want to talk about different options for compute and how to have a system for choosing one over the other in different use cases. But we should probably start with the basics. It might seem obvious Luciano, but what do we mean when we say compute?

Luciano: Yeah, I think it's actually not obvious at all. I think historically I will describe compute as you get a virtual machine and you can do whatever you want with it. And that's used to be literally everything from your application code, like maybe a web server or even a database or even an event bus and everything, including services and business logic, we're living together in this one virtual machine.

So I would describe that as a compute layer of its own. But I don't know, if we want to be a little bit more specific, maybe we can try to isolate the everything else and focus more on actual business logic. So what do we mean in general by business logic? And there are different use cases that we should consider. For instance, I don't know, if you have to run specific algorithms or if you have differentiating, parts of your businesses require custom code to be run to actually make something happen.

This is definitely one category that we can describe as something that can be fulfilled with compute. Other common things are when you need to run, for instance, control flow or some sort of orchestration. So you basically have some data maybe coming in and you need to decide what to do with that data and then produce some output based on rules of some sort. That's definitely another category where you can have a compute layer that is dedicated to that particular kind of use case.

And very similarly, there is a concept of integration. So maybe you are trying to connect to different systems, maybe using different types of APIs. And again, that might be something event-based. So an event comes in, you need to deal with that event maybe by sending the data to another system and this way you can connect multiple parts of your application. And finally, another interesting use case is data access, which is the idea that you might have a data layer and you need to run specific code to allow the access to that data layer.

And you can imagine, I don't know, if you have a web application, maybe you have an ORM layer that allows you to execute query against the database. That can be transactional or not, or you can have other maybe background processes where you use them to do like data gathering, data mining, or even data processing. Like for instance, if you have big data workloads and you need to enrich data, manipulate data.

So in all these cases, you can find, I suppose, some degree of compute. And the idea is again, that your business will have certain needs and to express all these needs, you probably need to write some code and this code needs to run somewhere to satisfy those needs. The interesting thing is that I have a feeling that in modern architectures, especially if we think more and more about the concept of serverless or the concept of low code architectures, it feels like there is a goal of trying to reduce that level of custom compute as much as possible. And the alternative is to use services that are managed and given to you by some third-party provider like AWS or other cloud vendors. I don't know if you agree with this view, but this is like what I feel about where the world is going.

Eoin: That's definitely the direction of travel. Yeah, this whole, if you ask what does serverless mean, it's not about functions or Lambda really, it's about removing compute as much as possible. And do you even have people using this term, service, serverless service full applications now where you're talking about composing really just doing integrations and composing these third-party services together and trying to remove that code.

I think it's a good practice actually. So in an AWS context, maybe some examples of that would be using step functions for orchestration instead of imperative reams of code and trying to figure out how do you handle errors? How do you handle the delays? How do you handle back off and retry and circuit breakers and all that stuff? So now you don't need Lambda for a lot of those things and step functions even.

So you can do direct SDK integrations and talk to SQS or EventBridge or S3 directly from step function states. So you can see the potential to remove traditional imperative code running in a function or on a container is getting better all the time. Then you also have other examples like when you're integrating things together, you can use event services like EventBridge, SQS, SNS, Kinesis, and you can combine them with API Gateway or AppSync. So you can integrate APIs inbound and also third-party webhooks together without having to write a lot of custom logic.

Luciano: Yeah, it's interesting that some of the services you mentioned, like in the last few years, we have seen more and more features that actually looks like they're trying to reduce the amount of custom code, like the opportunity to filter certain type of events or to remap the structure of an event before it's forwarded somewhere else. Like you can do today all this stuff with just configuration rather than writing your own custom code that takes the data as an input, change the data, and send it back somewhere else.

Eoin: I was just about to mention AppSync and API Gateway and directly integrating with DynamoDB or other backend services or APIs. And exactly, you could do that. You can use input transformers for some services or with AppSync and API Gateway, you can use a lot of VTL velocity language to do those mappings. And it really becomes then a choice of, do you want to use VTL or do you want to use Lambda for that translation there?

And in that case, you're just using Lambda as a slightly more powerful data transformation there in some cases using the language of your choice. So it's less compute and more data transformation. Yeah, this kind of architecture can be very beneficial. It can remove a lot of effort. It can push a lot more responsibility to AWS or someone else, but it does require like specific skills. We've mentioned this before. There is a mindset shift. It's not as suitable for everybody. And you also might have existing software that relies on specific containers, sorry, specific frameworks running in containers or on instances. So we can't pretend that this is the way everybody should be going right now. So it's not always a simple choice and it depends on your context. So I think what we'd recommend is having a technology selection methodology to help your team or your organization choose a good solution depending on each context.

Luciano: Yeah, I think I agree because it's, I think there are two extreme scenarios. One is that you always stick with what you know, and then there is the kind of no evolution. And then eventually you will have a huge technical depth and it's very hard to move on from there. The other opposite extreme is when you always try to the newest shiny things, just because you want to have fun and you want to try the new things.

And maybe you end up in situations where you rely on technology that is gonna fade away because maybe there wasn't enough adoption as was initially expected, or maybe you run into cases where there isn't a lot of documentation and community and use cases and examples. So you are left on your own to really reinvent the wheel with this new technology. So I think those are two very extreme use cases. So we need to find, I suppose, a balance between those two extreme cases and having like a methodology that you can rely on and is somewhat objective, I think can be very beneficial for a team to find the right trade off for them. Yeah, I think we can maybe mention a few pillars there. I don't know if we will come up with a very like step-by-step methodology, but I think we can give people insights on how to build their own methodology, maybe better than some common ideas.

Eoin: We can all be very emotional when making technology decisions. And when you have a group of large people with different perspectives, it can become sometimes even bitter or suboptimal or you end up with some regrets or feelings of somebody else forced their way in technology selection. But you also just want to, for your own benefit, remove your own biases. So I think having the idea of having a methodology is great just for removing that and allowing everybody to move on together with a sense of shared consensus and that you've made a good decision for everybody.

Luciano: Yeah, I actually hate when in some companies there is a phrase that comes up that is, we are a company, technology X, like no matter what, we always use technology X, like we intentionally committed to use this technology forever. And I think that can have its own benefits because maybe you structure training and hiring around that technology, but at the same time in the long run can become very dangerous because it kind of blocks any opportunity for evolving the combat itself, the technology, maybe leveraging some new opportunities that are available in the market. So that's another case where you can be emotional in a dangerous way and maybe lose opportunities to do something different that can be more beneficial for your business. So where do we start? What kind of suggestions can we give to people for building this methodology?

Eoin: So I've done this a few different times and different times I kind of adjust the selection criteria, but the one I always include is try to put in place a measure of simplicity. And this can be a difficult one, but it's really the number one factor. And the idea is that you don't end up with lots of unnecessary complexity that's hard to foresee. So you think about shifting more of the responsibility to your vendor and away from your team.

So there's different like operational complexity, difficult troubleshooting, but it's up to you to kind of define what the criteria for simplicity are. But simplicity is a really big one, right? Because any small piece of complexity will be compounded over time as you go into production and the rubber hits the road. So simplicity is always a good one to try and score. Another one is like a lot more practical and quantitative, which is performance and scalability. So I find this is something that people either overthink in the beginning, like overestimating the performance characteristics they need or else they dismiss it entirely until it's too late. There's never a happy ground. But you can also, you can very quickly just put some numbers together on what your usage is expected to be like based on historical data or market data or whatever your growth projections are, put in some performance criteria and just use that for your, to make sure, does the service I'm choosing fit within that? And then it makes it a very easy decision.

Luciano: Yeah, it needs to be more of a sanity check, I suppose, rather than, I don't know, artificial goals that you want to achieve just to prove that that technology is the best one.

Eoin: Yeah, for sure. And remember as well, that this is something you can revisit over time. Because you expect your architecture to evolve and your compute options can change as well over time. And this is another reason to go with the simple option, because simplicity often implies what I think Werner Fokkel's once called evolvability, which is that if you don't have a lot, if it's a simple service, you don't have a lot of upfront investment in it.

So you can actually back out and reverse the decision very quickly as well, and it doesn't hurt you very much. So a couple of other criteria you could put in your methodology are like reliability, resilience and security. Again, going with managed services should help you there, but you should try and have a measure of that. Some of the less quantitative ones are like developer experience. So that's a really important one. We often think about how, look at how things will work in production, but we forget that most of the time developers are working on the development environment, not the production environment. So how easy it is to get your development environment up and running, onboarding new developers, all of that is really important. Do you have any other ideas for some other factors we can include in this selection methodology?

Luciano: Yeah, one that I will suggest is like skills related to, like how does what you know already in your team helps you to adopt a specific compute layer, right? Do you need to learn something entirely new or the model that is being offered with that particular compute layer actually fits very well your current skills? So this is something that you need to keep in mind, because of course, if you need to learn a lot, that means time invested in giving people, I don't know, space to learn and experiment, and also to fail, because probably you're gonna be doing something wrong at the beginning before you learn all the patterns and all the right ways of using the technology.

So all the stuff is something you need to account for if you're going to learn something new. So it might not always be desirable to do that if you maybe have very strict timelines, or maybe if you don't necessarily need the characteristic of a specific compute layer. So that's a good trade off to keep in mind. The other one is cost, because of course, it's another dimension that is always worth considering.

Different services have very different cost factors, even different cost formulas if you want. So depending on how you intend to use that service, it might be much more convenient for you to use one or another compute service. So keep in mind, again, maybe take some figure, use it on your expected use cases, historical data, whatever you can think of in terms of giving you a guarantee that you know what's gonna be the trajectory of your usage, and try to put some data in and figure out, okay, how much more expensive is this solution going to be compared to this other one, and try to evaluate on that metric as well.

Eoin: These criteria are beginning to sound like list of pillars from the well-architected framework, with like some of the softer ones added in, like developer experience and skills suitability for your team.

Luciano: Yeah, the other one I wanted to mention, you kind of mentioned it already to some extent, is the ease of deployment, which is more like how much effort is gonna require you to actually put something in production, and then once it's in production, to keep it running, and every time you need to do updates, make sure you are in a position that it's easy enough for you to update the system. So that's another metric that is definitely worth considering depending on the kind of application you're trying to build.

Eoin: Yeah, so with all those factors then, I think you can, the idea is we're taking something that's very subjective process, and we're trying to make it a little bit quantitative. So for each of those criteria, you can give every one of your computer options a score, like from one to five, and you can get everybody on the team to give a score, and if there's widely diverging opinions, you can discuss that and form a consensus, right?

But the idea is to try and be as objective as possible. What I found is that by employing this process, you end up with quite a good shared consensus on the team, and much more of an understanding of different perspectives. And once you go forward with those decisions, you actually end up with, well, you know, we've got some very close scores here, but let's pick this one, and we also have optionality. You know, we can switch down the line if we architect this in the right way, and we know exactly what we're dealing with because we've put a lot of effort into understanding exactly what the pros and cons are. So it tends to remove all of that emotional baggage that comes with, you know, opinions-based technology selection.

Luciano: Yeah, and I think it makes also every team member more involved, or at least they feel more involved in the decision, which will make a big difference rather than having a decision being imposed from above, and maybe you end up not agreeing with it, and then because of that, you actually kind of resist in trying to adopt the technology as much as possible or as best as you can. I think when you have this process in place, it's much easier to get everyone on board too, even if it's an experiment, but at least to go ahead with the experiment and very honestly try to see how it plays out in the context of the team. But I have one question at this point because we are defining this process, and it feels like once you do this experiment, eventually you have one result, which is like, okay, for our particular need, this is the technology of choice. Does it mean that you end up picking only one, or is there room for different parts of the application picking different layers of compute?

Eoin: Yeah, that's a good point. It's definitely not the idea that you say, okay, now we go with Fargate for everything, or now we go with Lambda for everything. It's a bit like the way the database selection practice has evolved in a lot of companies now. Previously, you would pick one database that you would put at the heart of every system, probably ended up growing very large. Nowadays, people are more used to the polyglot database idea where you have different data stores for different purposes. You might have NoSQL for some data and different relational databases for others, maybe a data lake as well. So with compute, it's the same thing I would suggest. So for different applications or use cases, different processes, you can apply this process independently. So you end up with a mix of compute types, and that's completely okay because you're picking options that are optimized for the particular application and scalability.

Luciano: Yeah, I like that because I think the analogy fits really well today is very well accepted that you're not gonna use one database for everything. I mean, if you do that, you probably gonna end up in a suboptimal place. So I think it works also for compute that if you try to find the parts that fits best the particular problem, then probably you end up in a position where things are glued together in a better way and work better than just like one generic thing trying to do everything.

Eoin: So should we talk about the compute options that are actually available on AWS now?

Luciano: Yeah, I think the most obvious is EC2 is probably the one that has been around the longest, right? And it's the most generic. You just been a virtual machines and you're free to do whatever you want with them. Probably also the one that requires more like maintenance and management on your side. So it's worth thinking it that way. It's like, yeah, the most generic, the most, I don't know usable for the largest variety of use cases, but also the one that would require a lot of investment in terms of non-differentiating things like which operative system do I use? How do I keep it up to date? Security and all that kind of stuff. All of that is it's on you. So keep that in mind. And I think we can move down the line of more and more managed and we from EC2, we can go to kind of containers. So we have Fargate, ECS, EKS are all compute options that you can use where you start to offload a lot more to AWS the amount of operational stuff that yeah, AWS will take care on for you rather than you having to do it yourself.

Eoin: Yeah, I think it's interesting to assign a simplicity score maybe in your head too, for each of those options. Cause I would have certain view on that, but it's a good thing to explore with your team. Exactly how much complexity you're getting into and the difference for different container options you can end up with very different scorings.

Luciano: Absolutely and we already mentioned this in the previous episode where in that particular use case, we decided to go with EC2 rather than using containers, just because the team didn't have any previous knowledge about containers, not because containers wouldn't have been beneficial for them. It's more how much load do we put into the team to learn new things. And we decided at that point in time, just switching to AWS was already enough cognitive overload. So that could be maybe a next transition when the team is confident with AWS, they can start to get more confident with containers and then switch to ECS, EKS or Fargate and start to upload more and more to AWS. So I think you need to also keep in mind as we said, the kind of knowledge bias and decide how much more knowledge do you want to gain in one go rather than sticking with what you know.

Eoin: That makes a lot of sense. Yeah, in that case, I suppose you just put an increased weight on the skills match than on the simplicity factor, which is completely okay as long as that's an agreed approach by everybody.

Luciano: Yeah, and if we keep moving into more and more managed, I think that the next ones are uprunner, light sale or Beanstalk, which is probably the most traditional older version of light sale and uprunner, but also things like SageMaker or AWS batch, which are a little bit more specialized. But again, you kind of, you still can run your own custom computers just in a more constrained fashion and you offload a lot more into AWS.

Eoin: Yeah, AWS batch can be very useful in some cases. It runs on top of ECS really, but it can make the process a lot simpler as you say, yeah.

Luciano: Then what do we have? I think we can also consider code build as part of compute. Maybe it's a little bit of a stretch, but at the end of the day, you are in control of what kind of code is used to build something. So it can become a little bit more general purpose than you might want to think about it. So yeah, you can have your own totally custom build processes and write them in any language or any technology that you really want to. And you just run it on that code build compute layer.

Eoin: Quick and dirty approach, but it works really well in some cases, yeah, for sure.

Luciano: And then we have, I guess, AWS Glue spoken of that one or EMR, where you are using this kind of bespoke compute layer more for data management and data processing. And finally, the last one is Lambda, which is probably the most generic, but also very lightweight in terms of, like it's as pure compute as it gets. And you just write one function with your business logic and that's it, you generally don't have to worry about almost anything else. Did I miss anything? Is there any other service worth mentioning?

Eoin: There are probably others you could shoehorn in there, but I think that's a pretty good picture. I think somebody said there's 19 different ways to run containers on AWS. But look, I think we've covered enough and people get the general idea. Maybe we should start to conclude by talking about some of those example case studies and real world scenarios. So you already alluded to the legal, EGLE application from your article and from the last episode, which we'll link in the show notes.

Luciano: Yes, yeah, on that one, I think the most relevant thing was there were a bunch of different trade-offs that were made in that case study that are very relevant, not just to the technology itself, but also the team and the stage of the project. So I think we recommend you to watch that episode or read the article to kind of get the full context, because there we go with EC2, which seems a little bit of a counterintuitive choice based on what we just said so far. So I think it's interesting to get the full context to understand why that was the suggestion for that particular use case.

Eoin: Another one I'd like to throw in here is a couple of machine learning cases. In a few different projects, I've come across examples where people had a pre-trained model, like an image recognition model or something like that, which has come from a team of data scientists working on GPU instances they had access to in some data center, and then they want to productionize it on AWS, and they want it to be able to scale quickly and work elastically, depending on the number of images that come in, and that can be very variable.

And I've actually gone through the process of trying this on different platforms and surprised by where I eventually landed, because now my default kind of scoring methodology on this would lead me actually to AWS Lambda, which you wouldn't expect to be the ideal platform for machine learning, but with the meta memory you have and the container support now, and the fact that you don't really often need GPUs for inference when you're actually just running the model, Lambda scales really well, so you can have results in seconds or less, but you can also scale to thousands in an instant, which you can't really do with SageMaker. And SageMaker ended up having a lot more complexity, it was less common, so there wasn't as much documentation and tutorials out there, and the interface was a little bit non-traditional, you know, it was just a bit more bespoke. So I ended up running more applications like that on Lambda and orchestrating it with step functions and API gateway, and I found that's really nice solution in general, if your model works in Lambda.

Luciano: Yeah, hopefully we will also get GPUs on Lambda soon.

Eoin: Yeah.

Luciano: At some point, but I agree with you, it's probably a niche use case for now, and you can live without it and still reap the benefits of the simplicity of Lambda. Okay, is there any other example that we want to present?

Eoin: So there's one that I thought would be worth throwing in last, which is just in terms of batch processing. We mentioned AWS batch in already, but there's a use case, which is similar to projects we've worked on in the past, and if you consider like a financial company, like a pensions provider, and they have a lot of clients with different portfolios of pensions. So, you know, when you're choosing a pension, you have to say how risk averse you are, what your risk level is, and your pensions provider are supposed to purchase assets and invest your money in something that affects that risk level.

So if you can imagine that that pensions company will have to calculate, okay, for all of the assets we invest in, what is the risk level and how does that change over time? And how does that affect individual clients and their portfolios? So on a frequent basis, that pensions company will want to calculate risk level based on all the data they get in. So it could be stock market data, data about their clients, contracts from their various different stakeholders they deal with, and they need to run some sort of statistical model to calculate that risk.

The workload, the compute option there will depend really on the volume of data, the number of deals you have to calculate and the performance requirement, you know, how quickly do you need to get access to it? So traditionally, I think a lot of people would go to some sort of high performance compute infrastructure, like big instances with very rapid networking between them, shared state, message passing infrastructure.

But in terms of the scoring methodology, we've discussed this kind of scenario, but again, lead me towards either something like Fargate or indeed Lambda. In the context of Lambda, what would actually forces you to do is divide all this work into small pieces that can run on Lambda functions. And then you try to make Lambda give you as much concurrency as possible so that you can get all of these little pieces executed as quickly as possible.

And they're scaled in thousands of jobs in parallel. Just if you look at the performance factor in the scoring chart, it takes care of that, but you also get reliability, security and higher availability out of the box. And then your jobs, because they're running in Lambda, they're also stateless. So for a developer experience, it means that if you've got an individual job failure and you need to troubleshoot it, you can just run that job in isolation and it doesn't rely on some clusters shared state in order to be able to troubleshoot it. So there's lots of options there and there's no one right answer in any of these cases. But I think this was just an example I wanted to bring up because it's like something that traditionally you would have solved with high performance compute cluster, but now you can do it with just very scalable commodity cloud computing with Lambda or Fargate containers.

Luciano: Yeah, I think what you just said there at the end in terms of, because if you embrace Lambda, you're kind of forced to go down that path that you need to think, okay, how do we break this problem down into more kind of manageable pieces that run concurrently? I think it also forces you from an architecture perspective to build something that is probably closer to the idea of microservices, where you have small components that can be executed and tested and developed autonomously. And then you kind of orchestrate all of them together. So there could also be advantages in terms of, I don't know, the simplicity for the team to work on an individual component other than just troubleshooting at runtime, but even building it from day zero to production. I think you can divide a lot more of the work and bring everyone on board more easily on the different components, rather than having one big monolithic thing that could be very hard to develop and test and run locally.

Eoin: And I think we're seeing more and more of this, people realizing that you can use this commodity function as a service or containers to do like high performance computing or even scientific computing at scale. We're seeing more and more examples of that. So I hope these examples should explain why it's useful first in having a system for scoring each of the options, and then how you would apply that scoring system for different use cases in order to make technology decisions. We do have an article already on this on choosing AWS compute services. And it has a methodology just like this. And you can see the scores we've given to various services. So there's a link to this article in the show notes. And in upcoming episodes, actually we're going to dive deeper on specific applications that we've worked on at 4th Erem and show how this methodology played out in reality. So until then we recommend that you check out the previous episode on migrating a monolithic legal CMS application to AWS without the drama. And we'll see you next time. Smridge cataly falsehoods