AWS Bites Podcast

108. How to Solve Lambda Python Cold Starts

Published 2023-12-15 - Listen on your favourite podcast player

In this episode, we discuss how you can use Python for data science workloads on AWS Lambda. We cover the pros and cons of using Lambda for these workloads compared to other AWS services. We benchmark cold start times and performance for different Lambda deployment options like zip packages, layers, and container images. The results show container images can provide faster cold starts than zip packages once the caches are warmed up. We summarize the optimizations AWS has made to enable performant container image deployments. Overall, Lambda can be a good fit for certain data science workloads, especially those that are bursty and need high concurrency.

AWS Bites is brought to you by fourTheorem, the ultimate AWS partner for modern applications on AWS. We can help you to be successful with AWS! Check us out at fourtheorem.com!

In this episode, we mentioned the following resources.

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: Python is one of the two most popular languages for developing AWS Lambda functions. When it comes to data science, statistics, and machine learning workloads, Python is the undisputed leader. But it has often been difficult, and sometimes even impossible, to deploy certain data science workloads in AWS Lambda. The 250 megabyte packet size limit in Lambda has been at odds with the heavy nature of Python data science modules like NumPy and Pandas, not to mention machine learning packages like PyTorch and TensorFlow. And this problem might actually occur with other runtimes as well. So today we're going to talk about some benchmarking we did on Lambda functions and present some really interesting findings. We're going to talk about zip packaging, Lambda layers, and also talk about the trade-offs between zip, images, and Lambda layers. And by the end, you'll hear how container image packaging can actually solve this problem and even provide superior performance. This episode also has an accompanying source code, repository, and detailed blog post. I'm Eoin, I'm here with Luciano, and this is another episode of the AWS Bites podcast.

Luciano: So, Eoin, I'd like to start by asking you the question, why would you even consider Lambda as a target when doing data science with Python? Because this is generally heavy workloads, so Lambda might not seem like the right solution, but maybe I'm missing something important there.

Eoin: I think there's plenty of people using Python for just API-based workloads and normal kind of data transformation on AWS on Lambda. But when you think about data science, you have to also think then about all the other options you have on AWS for running those kind of workloads like Python Shell or PySpark workloads on Glue or Elastic MapReduce, you have more as well, and then you have SageMaker, SageMaker Notebooks.

So you can also think about services like EC2, ECS, right? Lambda, I would say, is best suited to two classes of workloads, so those that are really bursty, where you don't really have constant traffic, and those where you need a lot of concurrency very quickly. And that could be like data processing, it could be like high-performance computing, high-throughput computing, lots of data science, financial modelling, scale, even kind of HPC stuff like fluid dynamics or all sorts of scientific modelling.

And the great benefit of Lambda there is that executions can start faster than any alternative. Like, you can get thousands of concurrent instances in a few seconds, and now since the new announcement at re-Invent, you can increase by like a thousand every ten seconds. When you think about it, you've got two classes of data science workloads, I guess. On AWS, you've got ones that are highly coupled, and then you use something like Spark or Ray or Dask or one of those distributed frameworks to spread it across lots of stateful nodes.

Lambda isn't really suitable for that kind of workload, where each concurrent unit in a Lambda environment is not going to communicate with the others, right? It's stateless. So instead, Lambda functions are more highly isolated, and you would just run lots of them in parallel and just orchestrate them with step function or with other schedulers. But there are plenty of good use cases for Lambda with Python, and we've even done lots of machine learning inference on Python very successfully, where you don't need a GPU.

I think we've talked about that in the past. So, like, Python is definitely a great fit for Lambda. You know, we use it for doing real-time data analytics for IoT, for doing event-driven ETL or batch processing ETL, preparing data, machine learning, and aggregation in general. So if you have a workload that is, I suppose, more for data scientists to do ad hoc, hands-on work, then you're probably going to use something like JupyterLab, SageMaker notebooks, and maybe use an orchestrator like Airflow or DAGster. There's a load of them there in the Python ecosystem. Lambda functions can run up to 50 minutes, which is usually plenty, and use up to 10 gigs of RAM. That can be enough, but sometimes it's not, and then you might run for something like Fargate or AppRunner instead. So it's always a trade-off. If Lambda does suit your workload, you can avoid a huge amount of infrastructure and all that kind of stuff. But if you've got something that's long-running, then you might just go with EC2 or a container instead.

Luciano: You mentioned the difference between long-running, short-running. We are very well aware of the limitation in Lambda, but are there other limitations that are relevant when you want to do data science with Python and trying to target Lambda?

Eoin: When we started using serverless architectures and combining it with Python data analytics, we were generally, this is going back a few years, so container image packaging wasn't a possibility, and you were always working within the 250-megabyte size limit. And that's everything unzipped, all of your layers all together, all of your code, there's no way to get around it. And when you look at even just a naive basic data science package with Python, a lot of the stuff we would do would have NumPy, Pandas, and PyArrow as well, and PyArrow is also a bit of a beast. So if you've got these things, then by default, you're already exceeding the 250-megabyte limit. And you might want to ship Boto3, your own version as well, and AWS Lambda Power Tools for Python is pretty indispensable as well. Those things aren't massive, but even PyArrow is like 125 megs just on its own. So that storage requirement becomes a really big problem.

Luciano: Yes, so I remember doing lots of tricks to try to strip down the binaries that sometimes you get from these packages, because behind the scenes, they try to use native packages just to be faster, and you can reduce the size a little bit. But yeah, I guess at some point, there is always that limit, and the more libraries they use, the more likely you are to bump into this limit that you cannot bump in any way. So that, I guess, brings us to the next question, which is, you mentioned already container deployments. What are the pros and cons of using zip packages versus container deployments? And I guess for this particular use case, is one of the two options more suitable?

Eoin: I'd say in general, zip packaging is still the preferred way, and the main reason for that is that it's the only way to make sure that AWS has the maximum amount of responsibility for your whole environment. And I think when container image support was announced in 2020, I believe a lot of people were excited about it, because it allowed you to take your existing images and kind of shove them into Lambda in a weird way.

But I remember you had people like Ben Kehoe as well, who's always quite wise, I would guess, in his assessment of these things, and he mentioned that, you know, once you have a container image, you're essentially taking responsibility for the runtime, because it's like a custom runtime, and we talked about this in the recent runtimes episode. So you suddenly need to maintain that runtime yourself just because you're using container image deployment.

So while it's giving you benefits of being able to use container image packaging tools, like Docker and Finch, a new one from AWS and Podman and all these tools, that's really great. You also get the 10 gigabyte package size limit, which is 40 times greater than the 250 megabytes you get with Zip. But now all of a sudden, if you have a Java base layer, and there's a bug in the JVM, it's your responsibility for patching that JVM and releasing a new base image, whereas with Zip packaged and one of the AWS supported runtimes, they're responsible for that packaging, and it just happens while you sleep in the night, usually, which is a really great benefit. So I wouldn't understate that benefit. Security is job zero for all of us, really, so it's pretty important. But you have to make these tradeoffs, you know? And again, a lot of people are running container images anyway, even in other environments alongside the Lambda functions. So they might say, well, look, I'm running container images, and I have my whole security tooling in place to do the patching and upgrades anyway. So I'm not really concerned about that additional drawback.

Luciano: But one of the things that might come into play here is performance. And I heard a few times from different people that they are worried about going to containers because they expect worse performance. So you mentioned that we did some benchmarks, and I'm curious to find out whether that's actually true, and especially in this particular case where we use Python and all these different data science libraries.

Eoin: Yeah, I mean, there's a lot of factors at play here. And the traditional way we just solved this problem before we had container image support, there was, with serverless framework, there was a very popular plugin called serverless-python-requirements, and it would do exactly what you mentioned. It would take all the tests out of your Python packages. It would remove your readme and any documentation, and then it would strip any native shared libraries as well.

So there'd be no debug symbols. It was also common to remove the PYC bytecode files as well, the precompiled bytecode, just to save on that extra space. Of course, that might result in a performance hit. So at every step, you need to think about the tradeoff, right? So when you're using Lambda, you're supposed to be just writing the code and let AWS manage everything else. That's the promise. And if you end up having to do all this heavy lifting to strip out your packages, you kind of wonder, are you really realizing that benefit?

So with container images, when they came out first, I suppose 10 gigabytes, the assumption and a lot of the observation was that cold starts were faster. And that's kind of intuitively something that makes sense because you say, okay, well, it's going to take longer to cold start a 10 gigabyte function than it is a 250 megabyte function. What we did was we started to do some proper benchmarks on this and measure cold starts.

And we put together some benchmark code base with lots of different permutations of Lambda deployments. So we had zip package ones, we had zip package ones with layers. And the whole idea of using layers is actually that there's a provided Lambda layer from AWS called the AWS SDK for Pandas layer. And that already has Pandas, PyArrow, and NumPy already stripped down and optimized for you. You can look into their build process and they're like compiling PyArrow with all the minimal flags and they're stripping out debug symbols, and that's how they do it. So layers doesn't really give you a benefit inherently, but the fact that somebody else has gone to the trouble of doing the stripping for you kind of gets around that problem. So that was why we tested the layers option. And then we tested images as well. We tested it with lots of different memory configurations because we know that memory configuration can affect Lambda performance as well. So you mentioned that you did some benchmarks.

Luciano: What are the results of these benchmarks? Is it container the favorite option in terms of performance or is it still zip in terms of delivering better? Maybe call starts versus execution time as well could be another dimension to explore. Sure, yeah.

Eoin: So in this benchmark application, we've got the CDK project that deploys all these permutations of the Python functions we mentioned. So we try four different supported Python versions, four different memory configurations from like one gigabyte up to 10 gigs. And then we execute them all from a cold start situation. So we deploy it into a region where we haven't deployed these things before, and then we start invoking the functions.

So we have a script that basically just invokes all the functions in parallel 2000 times, or however many times, we just say 2000 times. And then we extracted all of the inner durations from the logs of every single invocation. And we started plotting these using Jupyter Notebook and Matplotlib. So since we're talking about Python data science, we're using all the familiar tools here. Now, the initial results we get from the first invocation are pretty bad for images.

And this kind of proves the suspicion that most people have that the first time you end up with significantly worse cold starts for images. And the difference we're talking about is that for zip package functions, we're getting about four seconds of cold start. But for image package functions, it's more like nine seconds to begin with. So it's significantly worse. So this makes sense, I guess, but we run it again.

And the second time we run it, now we can force a cold start again by changing the environment variables of the functions. But we did that, but we also waited 90 minutes as well, just to be sure and let things settle. And the second time we invoke it, the results are completely the opposite. So this zip package functions invocation times are still three to four seconds, pretty much the same as the first one.

But the image package functions cold starts have gone down to one to two seconds, mostly in the one to one and a half second range. So this is completely different. So we decided, okay, let's leave it. And the wait overnight and try it again the next day. And the next day we try it again. Everything is cold starting from the start again, and we get the same results, images way faster, down to one second, one and a half seconds, whereas the zip package functions, they are still between three and four seconds.

And by the way, the one with layers is a little slower than the one without layers when it comes to zip packaging. So I think this confirmed some of our suspicions in both senses, because we had also been hearing from other members of the community and other AJ Stuyvenberg has been doing a load of great research around cold starts as well. So it wasn't totally a surprise to us. It was kind of showing us that if you want to optimize your Lambda functions and you're doing heavy dependencies in Python or any other language, that image deployments seem to actually be a great fit. You just have to exclude that first batch of implications after you deploy your function.

Luciano: Yeah, I'm still a bit surprised about this result. And I'm wondering if you try to figure out exactly what's going on behind the scenes to justify this better performance in terms of cold start of containers after, let's say, a second round of cold starts. So I'm wondering, what have you tried to try to understand what's really going on? Maybe you tried different memory configurations. Maybe you tried to understand, is it something at the file system level that is happening? And maybe in general, you can walk me through the process of how did you do these benchmarks?

Eoin: Yeah, I mean, the benchmark gathering process is pretty simple in that we had just an illustrative Lambda function that's generating some random data, doing some aggregations on it with pandas and then uploading it to S3 after using PyArrow to convert it into Parquet format. And this is pretty standard, I guess, in the Python data science world. So when it came to understanding the benchmarks and why we were getting this performance, bear in mind as well, some of your data analytics workloads might take 30 seconds or 90 seconds to run, in which case your 4 to 10 second cold start may not be that big of an issue.

This workload I had was taken about 500 to 1,000 milliseconds to run, so the cold start was massive. We were able to figure out what was going on and why images were giving us better performance because luckily, the Lambda team at AWS wrote an excellent paper on it, and it's called On Demand Container Loading in AWS Lambda. The link will be in the show notes. It's not a very long paper, but it is pretty interesting to read.

And the gist of that paper is that they've put together a whole set of optimizations around container image deployment that doesn't exist yet for zip package deployments. So I guess when they were building in the container image support, because they had to support larger file systems, they had to figure out how are we going to make this work without creating 30 second cold starts for people. And how they do that is pretty clever.

So when you deploy a container image function, it's not like a normal container runtime environment. They take all of your container image layers, your Docker image layers, and they flatten them all out into a flat file system. And then they chunk all of those files into like 500K blocks. And those blocks are encrypted, but they're encrypted in a special way that if two customers deploy the same 500K block, because they're using shared similar dependencies, then they'll be able to be cached once, which is really cool, right?

Because your private blocks are still going to be private, but they will recognize if people have got the exact same binary block of 500K. And that way, it doesn't have to deploy all of the common stuff that everybody's going to have across base images. You can imagine Linux shared libraries, other utilities, Python modules, node modules, whatever else it is, Java jars, they'll be cached, and everybody can benefit from that cache.

And within this whole caching system they built for container images, they've got a tiered cache. So caches exist in each Lambda worker on the actual node where Lambda is running the code, but they also have an availability zone cache as well. So if chunks are not in the worker cache, it'll go to the AZ cache, and if the AZ cache doesn't have the block, it'll go to S3 only then, right? And the paper reports cache hit rates of 65% in the worker cache and actually 99% in the AZ cache. So this is why we're getting this massive performance benefit. So even though your container image has 10 gigs, and it can benefit from this cache all the time, and it also benefits from the fact that in a 10-gigabyte container image, you don't need to load most of the files most of the time. There's only a subset you'll ever need. So they've got this virtual overlay file system that takes advantage of this, reads from the caches, and makes it highly performant.

Luciano: That's an amazing summary, and yeah, it's also very interesting to see that AWS has gone to the length of publishing a paper so that we can learn all these new techniques and see how they are building systems at scale, how they are building this amazing caching strategy. So I guess going through to the end of this episode, do you have a definitive recommendation on whether people should use zip or containers? Maybe not necessarily just in the data science space, but more in general?

Eoin: Yeah, I still think zip packaging isn't as developer-friendly as container image packaging, but because you get AWS still taking the responsibility for your runtime, I still think it's the best preferred way initially, and only if you have a need such as this to go for something beefier. Then you've got a lot of module dependencies. Maybe you've got even machine learning modules in your image. Then you can think about using container images, and I would say it's not as bad as we thought it might be, and as long as you've got the security situation covered and you're patching and upgrading your base images regularly enough, I think it'll do fine.

It's probably also worth pointing out that because we did multiple memory configurations, we were able to see if there was much of an impact from changing the memory size on performance, and the answer is that there wasn't really, and there's probably an easy explanation for that. We know that when you run Lambda, you set the memory size, and the amount of CPU and network I.O. and disk I.O. you get is linearly proportional to the memory you set.

So I think it's 1,769 megabytes of memory is equivalent to one virtual CPU, and if you use less than that, you'll get less than the CPU. If you use 10 gigabytes, you'll get around six CPUs, but actually in the cold start phase of a Lambda function, they always give you two CPUs. Even if you've only allocated 256 megabytes of memory, you'll still get two CPUs to kind of give you that extra boost while your function is cold starting.

So the memory configuration didn't really affect cold start performance. There were some minor discrepancies, but nothing impactful, at least for this set of benchmarks we ran. We also kind of looked at the individual module load time, so we broke the benchmarks down into the time to load individual modules. Pandas, I think, is one of the worst ones for load time. It can take up to four seconds on Lambda without any optimizations. So it's meaningful, all this load time. We also looked at things like the time to initialize Boto3 and the time to initialize power tools. These all had an impact. You can see the exact numbers in the benchmark reports. We've got lots of visualizations and tables of all the results in the blog post, but I think you don't have to worry too much about memory configuration. For cold start, that's more of an issue for the handler execution itself.

Luciano: You mentioned already a few interesting resources, like the paper published by AWS. Is there anything else worth mentioning other than that paper, our source code, our blog post?

Eoin: Yeah, actually, just before we published it, I was able to catch Heitor Lessa and Ran Isenberg's talk on pragmatic Python development in Lambda from Reinvent. It gave me some insight into some more tools you can use for Python performance analysis. I'd definitely recommend you check that out. If I'd seen that beforehand, it might have changed the approach on this benchmarking. I mean, I think the results are still the same, but they have some nice tools for visualizing import times in particular and runtime analysis as well.

I think another thing to think about with all this is that when we see the import time performance for Python, and we just looked at pandas and PyArrow and that sort of thing, but other modules can be even more significant. We didn't even deal with like Scikit or PyTorch or any of those. When we think about Snap Start, the new feature that was launched for Java Lambda functions last year, if that support comes in for Python as well, then you can imagine a world where when you deploy your function, it can go through this module import time and then checkpoint it and freeze it.

Then when your function is invoked, the cold start won't have to include the module load time anymore. That's going to make a big difference if the Lambda team can introduce that support for Python as well. We've got a good few resources there. Luc from Donkersgoed has written a couple of great articles on cold start performance, which we'll link in the show notes. And we've also got the blog post, which gives all the details that we've gone through here, but a lot more as well on all the visualizations of data, as well as the source code repository. So we do recommend that you check those out. And with that, please let us know if you've got any tips and tricks for optimizing Lambda functions, especially Python Lambda functions. Thanks very much for joining us, and we'll see you in the next episode.