Help us to make this transcription better! If you find an error, please submit a PR
with your corrections.
Eoin: More people are getting into AI and running their own machine learning models. Whether it's generative AI, image processing, recommendation, or text-to-speech, the need for somewhere to run machine learning models is increasing. For many, they'll use hosted services and pre-trained models using something like OpenAI or AWS, but others want more control, improved data privacy, and might want to manage their performance, scalability, and cost at a more fine-grained level.
Today, we wanted to cover a slightly controversial choice for running machine learning models, and that's AWS Lambda. By the end of today's episode, you should have a clear idea of how and when Lambda can be used to run machine learning predictions and when to go for something a little bit more traditional. I'm Eoin, I'm joined by Luciano, and this is the AWS Bites podcast. AWS Bites is sponsored by fourTheorem, an AWS partner with plenty of experience running machine learning workloads in production. If you want to chat, reach out to Luciano or myself on social media. All the links are in the show notes. Now, back in episode 46, which was, "How do you do machine learning in AWS?", we talked about the various ways to run machine learning models in AWS, and we briefly covered there the idea of using Lambda for inference, and this is the specific topic we wanted to dive into in more detail today. As always, we should start with why. What are the use cases? Why do you need to run machine learning models?
Luciano: I think it's important to clarify that generally when we talk about machine learning infrastructure, there are two different categories of workloads. One is when you need somewhere to train and test models, and the other one is where you are thinking about, you have a model, I need to run this model somewhere. And when we say run models, we mean having some kind of infrastructure that can take inputs and run predictions or inference. So today we're going to focus only on the second category, which is inference. So for training is generally something a little bit more complex. You need more specialized infrastructure like many GPUs most of the time, and you can also do training with CPU, but it's generally much more limited.
It's going to take probably much longer, and depending on the type of model you are trying to build, it might limit you to the size of that model itself. So generally GPU is kind of the way to go when you're thinking about training, especially the more complex the model, the more you'll need to invest in infrastructure with lots of GPUs. So we focus today instead on inference, and it's something that can also benefit a lot from GPUs, but it doesn't always require using GPUs. You can also use, for instance, CPU. But let's talk about some use cases. One common use case is medical imaging.
For instance, if you want to run an automated diagnosis of an X-ray scan on demand, and maybe you have a few images every hour, maybe running a model on a CPU against a particular image may take one minute, and I think one minute delay on having a response is probably acceptable in that particular use case. You don't need an instantaneous answer with the diagnosis for that picture. You can probably wait one minute. Another use case is audio transcription, for instance, for video calls. Maybe you are using a system that records your video calls in your company, and you want to have a way to have automated minutes like transcriptions and summaries of that meeting. And also in that case, it's probably acceptable to have some delay. Maybe a process running on CPU takes like half an hour to produce all of that summary and transcript. It's probably okay to receive an email half an hour after the meeting with that document attached. Again, it's not a use case where you need an immediate answer for that particular task. And finally, I have another example, which is, for instance, you want to use word embedding models to index documents. Maybe you are building a SaaS platform where users can attach different kinds of documents, and you want to make this document searchable. And maybe you want to make it searchable through, for instance, like a chat UI where you're using some kind of Gen AI capability. And of course, you need to index all the documents in such a way that then the Gen AI can access it. So you're going to be using specific models to do all of that. And that might take a time, sometimes like, again, half an hour, one hour. So the documents will only be available for search after a little while. But for most use cases, this is probably acceptable as well. By the way, I mentioned the word embeddings. It's one of the new terms that comes around a lot when we talk about Gen AI. If you don't know what it is, don't worry, we'll cover that during this episode. Now, I said that these applications generally need specialized hardware, for instance, GPU. Should we spend a little bit more time clarifying what is the advantage that a GPU brings when compared to CPU?
Eoin: Yeah, the state of the art for most machine learning models uses deep learning. And deep learning is essentially employing deep neural networks. Neural networks have been around as an idea since, I think, the 1950s. And deep neural networks are kind of an evolution of that, that has become more popular in the last decade or so. And the idea essentially is trying to model how we think humans' brains work, or actually any other animal brains work, simulating the idea of neurons and synapses and connections between nodes in our brains. So deep neural network architectures can have thousands or even millions of nodes. And as they're trained, at least at a basic level, the connections between nodes obtain weights, right? And it's those weights that are the most important and the bulk of the parameters for a model. So when you hear people talking about model parameters, weights are generally the biggest number of them. And they need to be stored in memory.
So large memory requirements are typical. And the storage format are generally multi-dimensional vectors. You hear the term tensors a lot, and that's basically vectors of many different dimensions. And these are used to represent those networks and their parameters. And the operations that need to happen in the CPU or the GPU, these are generally fast floating point matrix mathematics that you need to occur. And that's why GPUs came into play, because GPUs originally developed for graphics. That's where the G comes from. We've all had GPUs in our desktop PCs and laptops over time, particularly popular with gamers, then became popular with Bitcoin miners. Now the biggest demand is coming from machine learning, because GPUs have thousands of cores optimized for this kind of calculation and can run the many calculations in matrix operations and in parallel as well. So they're highly optimized for this particular application. And they also have higher memory bandwidth for loading large data sets, which is quite important for performance in this space as well. We talk about GPUs, but of course, there are other AI chip technologies evolving that are not GPUs. AWS has its NeuronCore and NeuronCore 2 architectures, which you can utilize if you use the Tranium or the Inferencia instance types in EC2. And these are not GPUs, but are basically designed for more cost-effective and more power consumption optimized machine learning. So I think we might see a lot more of that. Google also has TPUs, the tensor processing units. So we may see more adoption of those options instead of just pure GPUs, as people are kind of making trade-offs and optimizing for the availability of hardware, the cost and power consumption critically as well. So hopefully we've outlined why GPUs and other special cores are really well suited for machine learning. So then why are we doing a whole episode based around AWS Lambda and CPUs only? Why would you bother with CPUs?
Luciano: Yeah, that's one of our current complaints about Lambda that doesn't support GPU, at least not yet. But that doesn't stop us from running machine learning inside Lambda, which is what we are going to be covering for the rest of the day. But let's talk a little bit more about what is the trade-off with CPU versus GPU, because CPU compared to GPU is generally widely available and much cheaper. And when you run things in the cloud, you definitely have lots more options when you think about CPU compared to GPU. You can run things locally as well, and generally easier to run on many developer environments because CPUs are a lot more widely available and standardized than GPUs. And it can scale much faster in a way that if you need to go through levels of concurrency by spinning up multiple instances, like multiple machines or Lambdas or whatever, it's generally easier to do that if you think just about CPU. Because when you bring GPU into the matrix, generally either the cost becomes more prohibitive, or maybe you have limits that will stop you from spinning up thousands of instances, or even the provisioning time might just be much higher than it is with provisioning CPU-based instances. So, again, this is kind of the trade-off. While with CPU you don't have the power of GPU for parallel matrix math, there are other things that you can take and use with CPUs to still have decent levels of performance. For instance, recent advancements in CPUs have brought us SIMD, single instruction multiple data, which is a CPU extension that allows you to run vectorized operations. And one example of that is AVX2, which is also available in Lambda. This is an Intel CPU extension and has been in Lambda since 2020. So if you write software that can take advantage of this kind of capabilities, you're still going to have pretty good performance on a CPU, and you don't necessarily need to use a GPU. There are other examples. For instance, for ARM processors, you have NEON, which is another extension that allows you to run SIMD. Now, even though you have a CPU model execution, it's not always obvious to say that GPU is always going to be faster. I think it really depends on the use case that you are trying to address and the amount of data that you might want to process in terms of actual size of the single unit of data, but also in terms of how much data can you actually parallelize in one go. And we can make an example. For instance, let's say that we have a neural network that can process between one and 100 images in parallel into a limited seconds having a GPU. Let's use this as a baseline. Now, if you take the same thing and put it in a Lambda, maybe you can run one inference with that Lambda in two seconds. So it is a little bit slower, but the advantage of Lambda is that then you can much more easily run thousands of instances of that Lambda than it is of running, for instance, SageMaker instance with a GPU. Also, if you run a SageMaker instance with a GPU, that instance is going to take minutes to spin up, while when we think about spinning up a Lambda, that generally takes seconds. So that gives you an idea that there might be cases where you can just take the power of parallelization and fast bootstrap times of Lambda, and you might end up with something that can be even more convenient than just having one or a few instances with a GPU that are going to be much faster to do the single inference, but maybe all the bootstrapping time and the scalability is going to be overall slower. So it's not always obvious to say that GPU is faster than CPU. I think there are lots of use cases where you can make traders, and if you use Lambda with CPU, you can still come up and win the race of this is actually going to be a better approach than just spinning up GPUs. So what do you need to get up and running? Are we going to run, are we going to think for instance about Python Lambda functions with PyTorch or Tensorflows or something else?
Eoin: Well, Python is supported by pretty much every model and framework out there, so it's probably your go-to when you're getting started. As we mentioned in a very recent episode, a lot of the Python libraries are very heavy and can then lead to longer deployment times and initial cold start times, so we'll have that link in the show notes. I would say that the space is fast evolving. It's almost like the machine learning framework space is a little bit like front-end frameworks about five years ago where it's just moving so fast and new ones are coming out all the time. But maybe before we get into that tooling, we can talk about an extreme example and kind of play with this idea a little bit. Since Gen AI is all the rage, can you actually run large language models on AWS Lambda? I mean, surely not is probably the default response to that, but there are lots of open source models out there and people might want to take advantage of open source models to run things in a private way just for their own experimentation or to really focus on data privacy and security, and they will be thinking about how to optimize the infrastructure then. And we're talking about open source models like Llaama from Meta or Mistral or the new Microsoft one, Phi2, or even stable diffusion for images. Yes, generally the requirements for these models to run them are huge, but not always. And when we hear people talking about these models, they generally talk about the number of parameters in their model.
When Meta released the Llaama2 model, this is an open source large language model comparable to GPT 3.5, GPT 4 in some ways, it was released with three different parameter sizes, 7 billion, 13 billion and 70 billion. And there are models out there with hundreds of billions of parameters. So what does that mean in terms of resources you need? Well, it depends on the numerical precision of the model. So you might have a model that's using 32 bit floating point values. So that's four bytes per parameter. So then your memory requirement is going to be the number of parameters times four. So if you have that 70 billion parameter Lambda model with 32 bit precision, that's 140 gigabytes of memory to run it. So you need a pretty high end GPU or you need to start thinking about parallelizing over multiple GPUs. And for this, even for the 7 billion parameter model, you're talking about 14 gigabytes. So it's quite a lot. But since resources are constrained, and not everyone who's enthusiastic about this space has access to GPUs with that kind of memory, the community is putting a lot of effort into getting pretty good results with fewer parameters and lower precision parameters as well. So if you imagine using four bit integers instead of 32 bit floating point, this is a process called quantization, where you can convert it into a lower precision model, all of a sudden, you can take the 70 billion parameter model, or sorry, a 4 billion parameter model and run it in two gigs of RAM. So by tweaking both of those factors down by a significant amount, you can still get pretty good performance. And when I'm talking about performance, I mean, accuracy of the models and the inference results. The just because you're scaling down by a factor of 10 or more, it doesn't mean that you're scaling down accuracy linearly, often you can get almost as good accuracy, depending on the use case and the model.
Since GPT and chat GPT came out, a lot of the most exciting developments in the whole LLM space has been the development of these quantized models and the performance you're getting. Now, of course, running it on CPU is rarely going to be as fast as GPU. There's a lot of factors that can affect performance. So it's difficult to say with any certainty, but 100 times slower performance, like in the example you gave Luciano on CPU, it's not unexpected. That's quite typical.
Back to the tooling then. So we talked about Python and we know about TensorFlow and PyTorch. We talked a bit about those in the previous episode, but a lot of work now has been done in creating native frameworks and implementations. So machine learning frameworks that don't need all of the Python interface or a much lighter Python interface. And llama.cpp was one of the first one of these, and this was started by Georgi Gerganov, who then also went on to create ggml, which is a pure C machine learning library designed to make machine learning models accessible on commodity hardware. So if you want to run machine learning models on your Apple Mac ARM processor, like an M1 or an M2, you could really look into this because it's got really good support as well as for just CPU execution, also good support for Apple silicon GPUs. And the ggml framework is now a company, ggml.ai, and it has funding to develop it further, which is good news for us, I think. And I think it seems like a pretty good fit for Lambda because you can build really small package sites that are really fast to deploy, pretty good on cold start time as well. And then there's bindings available for different languages. So you're not glued into the Python ecosystem, if you don't want all that heaviness, or you're just not a Python fan. And this ggml framework will adapt to different CPU and GPU architectures, depending on where you want to run it. So it's easily portable from your local development environment into runtimes like container runtimes or Lambda. And Georgi Gerganov has also created a lot of quantized versions of the models in the format acquired by ggml. So you've, instead of having the 32 bit or 16 bit for floating point versions, you have four or five or eight bit integer versions that you can use to reduce your memory consumption. There are other alternatives apart from ggml, like the ONNX runtime, but we haven't used it directly. We have been working with ggml and experimenting with it over the past few months. And we found it pretty useful. And the results, while I wouldn't say we're ready to deploy it at scale in production, we've had some pretty interesting results. So Luciano, would you want to take us through some of the, I guess, examples of Lambda for machine learning we've been doing, at least publicly citable ones over the past few years? Yes.
Luciano: One of that we mentioned before in this podcast is the way we create the transcripts for our podcast, which is basically using SageMaker. And the startup performance of that is a little bit of a pain. It's not a deal breaker because again, we don't really need instantaneous response, but we were a little bit curious of checking what is the difference? What are the trade-offs if we try to run the same thing on Lambda? Can we do anything better? And what are the results? It's going to be cheaper. So what we did was basically experimenting and trying to figure out exactly what we could achieve and what kind of results we could get.
And we were very pleased to see that Gerganov also created a version of Whisper, which is the model that we use from OpenAI to do the descriptions. And this version has been important as well to C++ and with bindings for lots of languages, for instance, Node.js, WebAssembly, Rust. There is actually an amazing demo that Gerg created of running this model on the browser using WebAssembly and it's totally available online. We will have the link in the show notes if you want to play with it. So that kind of shows that once you bring the model to C++, it opens up a bunch of use cases that are not always so easy to access when you just have models in Python. And this is a use case that we work with. And I think we were very pleased with the results. Seems a good tradeoff of performance is a little bit slower to do the inference, but of course, it's much faster to bootstrap the environment. And I think we might spend a little bit more time in future episodes talking through the details of this experiment. This is still very early on for us, so we're still trying to figure out exactly if the tradeoffs are convenient or not for this particular use case. There are other use cases that are actually something interesting that we have been playing with. For instance, you can use LLM models, for instance, Llama in Lambda. And this is something that basically you can try to ask questions and it generally takes about 30 seconds to give you a response. So maybe it's not necessarily the best use cases because when you use LLM and try to create kind of a chat-based interface, you want to have a more real-time type of answer. And with Lambda, it tends to do everything kind of in a batch approach where it processes everything, it creates that response objects, you get the response object that then you can use in your frontend so you can see those 30 seconds of delay and it's a little bit painful to use for that particular use case. And there are other use cases that we have been working with. Actually, the oldest one was four years ago when Lambda container image support was announced. We were able to create a demo where we were embedding one of these models to do x-ray analysis. And we were able to run 120,000 x-ray images in about three minutes, which is pretty impressive. We have a repository with all the examples and the code to run all of that and we will have a link for that in the show notes. Is there any other use case that comes to mind, Eoin?
Eoin: Something that we've been looking at recently a lot actually is within the Gen AI space, retrieval augmented generation, or RAG, and it's becoming very common and it's one of the areas where Lambda might play a role. Just a quick overview of what RAG is. We mentioned also the concept of text embeddings and promise that we'd define it. So the reason RAG has become popular is that LLM models like ChatGPT, they have a limited context window. So the input size you can put into a prompt.
So if you want to query all of your company data and get a factual response processed by an LLM, so it's got good language in the response, you can't just put all your company data with a question into a prompt. It's too much data. It's not going to work. So RAG is one of the solutions to address this. Instead of putting all the data into the prompt, you retrieve relevant sections of your company data from a knowledge base and then put those sections as context into your LLM prompt, allowing you to get effective answers and summaries.
And because you're using a real knowledge base as your context, there should be a much lower chance of hallucination or fiction in your response. Now, in order for this RAG to work, you generally need to index your documents first and put them in a repository. So this could be a traditional lexical search like with Elastic Search or similar, but a more common approach now is to use a word embeddings LLM model. And this is similar to any other LLM model, but it's basically just creating a vector, multi-dimensional vector representation of text in documents. And by having that multi-dimensional vector stored, you can then do a semantic search on all of that textual data because it's a numerical format. You can do like a KNN search just to find similar terms to the question in documents and then retrieve those snippets of documents, then take them and put them into the context as part of the prompt. And that's the whole idea of RAG or retrieval augmented generation. And now the OpenAI, Bedrock and many more have text embedding models for you to do that, that you can then use with the other, like with the chat models. And when you use a model to create a vector embedding, you'll then store it in a vector store. Like you could use Postgres with the PG vector extension. You can use just S3 with Meta's FAISS storage mechanism. Then there's other third-party solutions like Pinecone, Memento, and then you can perform those semantic searches when you have a query. So the LLM chat part of that is fairly straightforward, but you need to think about what do you do when you've got lots of documents coming into your company's knowledge base and you need to asynchronously process them and add them to your vector store. And so this is kind of a sporadic bursty activity that doesn't really require real-time performance and a Lambda with a reasonably sized text embeddings model could work pretty well for that. So I think this is one of the areas where you might find Lambda being used, even though you might end up using a fleet of GPU instances or similar for the knowledge-based search or for a chat interface for your company's knowledge base, you might end up being able to offload all of the embeddings generation to a Lambda. Now of course this is a little bit of an advanced optimization. Bedrock on AWS can do text embeddings and LLM predictions in a serverless way. We talked about that in our Bedrock episode, but you're limited there, right, to available models and the need to consider pricing and quotas. So if you want to use an open source model it's a little bit more difficult and that's why you might go more of a custom route. So check out that previous Bedrock episode if you want a similar solution. We know I suppose that these Lambda experiments we're talking about are kind of specialist, they're quite experimental and not necessarily ready for the prime time, but it's still really interesting and I definitely recommend for people who are just interested in the space to check out those frameworks like ggml, llama.cpp, whisper.cpp, if you're running stuff on a Mac especially or your laptop in general. If you don't want to go put all your data in open API there's also other great frameworks on top of them like private GPT and local GPT which can run pretty well on a Mac or similar hardware and give you that chat GPT-like experience but all within the safety of your own development environment. I think that's generally the conclusion time for this episode and while these experiments are interesting and a little bit of fun it kind of remains to be seen whether Lambda can be an important service in the Gen AI space but for other more tried and trusted ML applications doing inference in Lambda can definitely simplify, save costs, give you great scalability and performance as well. But let us know what you think as always. Are we losing the plot a little bit going left of field with Lambda for ML or have you also had good results? Thank you for listening and we will catch you in the next episode.