LLM Inference with Bedrock

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: If you're curious about building with LLMs, but you want to skip the hype and learn what it actually takes to get something working in the real world, this episode is for you. We have been building a lot of LLM-powered applications lately, both for ourselves and with customers. And I'm talking about workflow automations, smart data pipelines, query generators, AI-powered dashboards, that kind of stuff. And along the way, we picked up a pretty good collection of battle scars, I'd say.

So what we learned is that the art part is not getting a demo to work. The art part is actually making it reliable, predictable, affordable in production. So that's what we are here to share with you today. And this trend isn't slowing down. We feel that almost every new project that comes through our door has some kind of AI component that needs to be baked into it. So it's not really, should we use an LLM here anymore?

It's more questions like, which models do we pick? How do we call it? How do we make it trustworthy? How do we keep the bill under control? And if you're building on AWS, you know that a lot of these questions lead to Amazon Bedrock. So today we are sharing what we learned about running LLM inference on Bedrock, what works well, what surprises us, and the gotchas that nobody warns you about until you find yourself debugging, hopefully not at 11pm on a Friday.

So today we'll see what is a quick definition of what an LLM is and what do we mean by inference, the kind of AI power application that we have been building, why AI is a lot more than just gen AI, and what we mean with the word agents. And finally, we'll start to talk a little bit more in detail about Bedrock, what it is for, and the different gotchas that we learned about. My name is Luciano, and I'm joined by Eoin for another episode of AWS Bites. So maybe we can start this episode by giving a quick recap of what is an LLM and what do we mean by inference. I think that a lot of people might be familiar with some definition of those, but it's probably worth giving our own view on this.

Eoin: Let's start with LLM, Large Language Model. We may know that this is a type of neural network trained on huge amounts of primarily text data. They learn statistical patterns in language and can generate Remarkably coherent context-aware text. And the landscape of available models is growing really, really fast. you might know OpenAI's GPT family, Anthropics Cloud, Google's Gemini, Metaslama, Amazon's Nova. Then there's Mistral, DeepSeek, Qwen, GLM, and Minimax.

The number of these really capable models keeps increasing, and that might be great news for builders. So then we talk about inference quite a lot, and that's when you use a trained model to generate some output. Training is really expensive. That's the long process of teaching the model. This is what model providers do and it's not cheap. We're talking about millions or billions of dollars in compute data and research.

There's a reason all these companies need very deep pockets or very persuasive pitch decks. Inference is what we do as users or developers generally. We send a prompt, which is input text, and the model generates a response. And it may be an imperfect analogy to understand the business of LLM providers. Training is like spending years in medical school racking up enormous student debt. Inference is the doctors seeing patients and hopefully making that investment back one consultation at a time.

We're the ones booking the appointments. Now, the analogy breaks down a bit in a few interesting ways. In reality, training an LLM takes weeks or months, not a decade of studies. But you do need a lot of GPUs. And unlike a doctor who specializes in one field, an LLM is more like getting degrees in medicine, law, engineering, creative writing, and a dozen other fields all at once. So that's what makes them so versatile and why they're showing up in a lot of different applications.

And then there's tokens. You might have heard a lot about tokens. We've covered it before, but LLMs don't think in words, they think in tokens. And a token is, you could say it's roughly about four characters, maybe three quarters of a word in English, something like that. That's just kind of a rule of thumb. And a detail that's easy to overlook is that different models tokenize text differently. So the same sentence might be 20 tokens in one model and 25 in another, depending on how each model's tokenizer splits the text. This can matter more than you think because you pay for inference based on input tokens and output tokens. So when you're comparing pricing across models, cheaper per token doesn't necessarily mean cheaper for the same job if one model uses more tokens to represent the same text. But tokens are the fundamental unit of cost, rate limiting, and context windows in the LLM world. So it's good to get comfortable with this concept because it comes up everywhere. So Luciano, given that all this stuff is everywhere now, what are we building?

Luciano: I guess, first of all, it's worth clarifying because I think at this point, almost everyone has used things like Cursor, Copilot, Claude Code, Codex, whatever, all these coding AI agents. And I think it's worth just saying that this is the technology that we are talking about. LLMs are effectively what's powering all of these tools. And that's one type of use cases. So these are like coding agents that you use whenever you're writing some code to get assistance from this intelligence and knowledge that has been baked into the model.

But we see a lot of use cases where we are embedding LLMs directly into applications. Applications we build either for ourselves while we experiment and learn more about this technology, but we see an increasing demand from customers, and we've been already building a few examples. And just to give you an idea, things we have been building are like smart data transformation pipelines. So you can imagine that there is an LLM component that helps the user to describe what they want in natural language.

For example, I don't know, merge these two example datasets, normalize them in some kind of way, flag duplicates, for example. And then the system using the LLM is able to convert that natural language requirement into reproducible deterministic code that then can be baked into a pipeline that can be reused over time. So effectively, it's almost like giving somebody that doesn't necessarily know how to code that pipeline themselves an easier door to basically be able to describe what they want with a process that somehow gives them a preview of the results, slowly converge to something that is actually doing what they want to achieve, and then save that into reproducible code that can be reused later on.

So imagine almost like a notebook, but rather than writing code, you use LLMs to get to the final version of the code that you want to use. And this is just one example. Another one that comes up a lot is for data analytics. Being able to generate queries, for example, for Athena, is one we have done recently. But you can extend that idea to other databases like Redshift, Postgres, Elasticsearch. And the idea is, again, if you don't know the specific language that is required to query the data, you can use an LLM to convert some kind of human language of that query.

For example, I want to know, I don't know, the top 10 spenders in this e-commerce. Give me the username, for example. an LLM should be able to do a good job in converting that to a specific query for a database, and then you can execute that query and give the results to the user. So again, it's always about trying to lower the barrier of entry to use specific technology, so you can use more natural language and have the LLM do all the hard work of converting that language into something that's more specific for, I don't know, converting data in some way, making queries, But there are other use cases.

For example, we can automatically generate dashboards, for example, based on some of these data pipelines that I mentioned before. Another example we built is the system is capable of understanding the type of data using an LLM and picking up some of the metrics that are more relevant and creating dashboards with charts that make possibly the most sense. And again, it's not always perfect. I think there is a lot of...

DLM can generally do an average work and then you still want a human to maybe go in and refine. But we are seeing lots of use cases like these ones. And other ones we have seen online from... Just looking at other examples are, for example, customer support automation, where you can have a chatbot that helps the user to start asking questions and eventually route them maybe to an actual human agent that can help them to perform certain actions.

Or sometimes it's the LLM itself that can do certain actions on behalf of the user. and other ones are document processing. So imagine like an OCR process, but much smarter than that, because a classic OCR will just give you the plain text. While when you combine OCR with an LLM, you can actually ask questions to documents and get out structured data from your documents. So, yeah, I guess before we get carried away with all these examples, I think it's fair to say, to remark at least, that these LLMs are not magic and they are not necessarily the right tool for everything.

In general, I think LLMs are good to effectively convert this kind of requirement in some kind of human language, so you can type something and try to describe what you want to achieve, and train the LLM to convert that requirement into specific actions that will make sense for the system you're building. So basically, understanding text is one of the main superpowers of LLMs. But they are not good at everything that needs to be deterministic and precise, because they are probabilistic by nature.

So if you ask the same question twice, probably you will get a slightly different answer the second time. Sometimes these LLMs can hallucinate, which is when they confidently state something that is not necessarily true, and they cannot necessarily do arithmetics very well. So sometimes they will do mistakes there. The classic example is if you tell, I don't know, how many vowels there are in the word strawberry, most of the time you might get a wrong response.

This used to be one of the common jokes when LLMs came out. I think they are getting better, but the point is that there are lots of things that LLMs are good at and many things that they are bad at, so don't try to use them for everything. Try to understand what is good about them, what is bad, and then pick them only for the right use cases. And I think the key principle is try to use LLMs for everything that is a little bit fuzzy.

Again, as a human that is trying to describe something, and you want to have that understanding of the description and do something. But all the precise parts, I think you should still try to use code and more regular automation to achieve those results in a more predictable way. I think, in general, the last point I want to make here is that it's going to become a standard building block, as many others that we have been using throughout the years. It's always, as any other building block, to understand what are the patterns, what are the good things, what are the bad things, what are the common problems, and hopefully, today, we are going to be able to cover some of that. Anything you want to add on this part, Eoin?

Eoin: Yeah, it's definitely, I think, recommended to experiment frequently and not assume, if you haven't already experimented frequently, that you just add a feature request and use LLMs in production for the first time, and it'll be smooth sailing. I think we've seen statistics that the vast, vast majority of these projects are not making it to production right now. for a whole host of reasons. Maybe people don't have the right data, they didn't have the right use case in mind, or it's the results just aren't effective enough to meet the use case that was envisaged.

Like we find more success where the use case is very well defined and simple. And it's a good idea, I think, in general to focus on areas where you're spending a lot of time that you might benefit from this level of automation, but simple things rather than trying to assume that AI is so intelligent you can throw the the most complex problem you have at it, which usually ends in failure. We can maybe talk a bit about AI and Gen AI and what we mean by agents as well.

AI has become shorthand for Gen AI in popular conversation, but Of course, AI is much broader and has a much longer history. Traditional machine learning classification, regression, anomaly detection is still AI and still incredibly useful. Computer vision, speech recognition, recommendation engines, they're all AI, but not necessarily Gen AI. And AWS has a whole ecosystem of services for the more traditional AI angle, like SageMaker, recognition, Textract, et cetera.

We won't cover any of those today, but still worth knowing about, it doesn't always have to be LLM based. So Gen AI is specifically about generating new content like text, images, code, audio, and video. And when we say LLM inference in this episode, we're talking specifically about Gen AI. Now, inference in practice. is generally, at the simplest level, text generation. You send it a prompt, and then you get an answer, which we call a completion.

But increasingly, we're also talking about agents and agentic workflows. And these are really, I suppose, more sophisticated loops where the LLM can try and reason or simulate reasoning, plan and take actions. It's like orchestration of multiple steps of an LLM, really. So what we mean when we talk about agents, like an agent is like a smart loop. Rather than hoping that you get a good completion back, the LLM receives a task, decides what to do, uses tools, observes the results and iterates.

And this is one of the main things that makes agentic LLMs so powerful, the actual tools that you can use. With tools, you can expand the LLMs capabilities far beyond just generating text. We know that generating text alone is subject to hallucinations and errors, but by combining LLMs with access to deterministic tools, it can actually become very powerful. So the LLM itself still just generates text. It could describe what tool to call and with what parameters, but then your code executes the actual tool and feeds back the results.

And you can write tools that do virtually anything, like check the weather at a given location, look up a customer record in a database, call a third-party API, read files, run code, or just trigger complex workflows on behalf of a user. And this is what turns an LLM from a fancy autocomplete into something that can actually take actions in the real world. With this power comes a lot of responsibility and safety boundaries are required that prevent the agent from doing things it shouldn't.

So serious guardrails are required to stop it from executing destructive actions, leaking personal information, or going off topic. If you're doing it in an AWS world, defining roles and very minimal permissions help a lot with this, as well as your network boundaries. So when you have an agent as well, there's also consideration for memory and context. The agent will have to maintain state across steps, building up context as it works through a problem.

So you can think of it as LLM plus tools, plus the loop, plus context management, plus guardrails. Put all those things together and you have an agent. In practice, there are loads of frameworks that help you build these patterns, like LangChain, strands from AWS, the Vercel AI SDK, and there's plenty more in every language. Plus, AWS has its own Bedrock Agents feature. We're not going to dive deep on the whole agentic side of things today, or on the AWS services specifically built to host and run agents at scale. That probably deserves its own episode, but It is important to understand what we generally mean by agentic because it shapes how you think about inference. It's not just one prompt in, one response out. Loops, tools are called, you've got context building, and it all translates to more tokens, more latency, and more things to think about when you're setting up your infrastructure. With all of this context in mind, let's talk about Bedrock, what it is, and why it exists.

Luciano: Yeah, exactly. So I think at a high level, Bedrock, we can think about it as the AWS managed service for accessing foundation models via API. So you can imagine it as a unified set of APIs for calling hundreds of models And there are many providers, like we mentioned already, some of them Amazon, Anthropic, Meta, Mistral.ai, DeepSeek, OpenAI, and more. And through this unified API, you basically get a few interesting things.

The first one is that you don't need separate accounts or API keys for each provider because Bedrock is kind of your own central place. And it runs within the AWS ecosystem, which means that you can also use IAM for authentication. You can use CloudWatch for monitoring. You can use VPC endpoints if you want to keep everything as private as possible and not have traffic going through the public internet. You can use CloudTrail for auditing.

So all the nice and convenient things you generally use when you build production ready systems on AWS. Of course, there is an alternative. You are not forced to use Bedrock. You could use the APIs of the different providers directly, OpenAI as its own API, Anthropic as its own API. Pretty much every provider needs to give you access to the model when they offer the cloud version of the model through an API.

So you can just go through them, create an account, and call their API directly. And I think this is not too bad, it probably works fine. If you're doing prototyping and small projects, it might actually be a little bit simpler than just getting started with Bedrock, which probably comes with a little bit of extra complexity, especially if you're not too familiar with AWS. But I think then you need to know what you're missing out, because I think you need to understand that if you want to go production-ready, probably what Bedrock is giving you is worth it, and it's worth the initial effort of learning Bedrock and learning all the tools that you get with Bedrock.

And just to give you a few examples, you will get security and compliance because basically this is probably one of the main selling points, especially if you're working in industries where it is important to respect the privacy of the data of users. what Bedrock guarantees you is that data stays within your AWS account boundary. You can pick specific regions where the inference runs. So if you have also legal requirements where you need to make sure that your data never leaves a specific region, like Europe, for example, you can do that through Bedrock.

data is encrypted in transit and at rest. Effectively, there is an agreement between AWS and the model provider that they will never use data that you send to the models to do additional training in the future. This is probably one of the biggest selling points for Bedrock. So effectively, you can trust Bedrock a little bit more than just having to go through the agreements that you will get with each individual provider, which is probably going to be very different terms and conditions.

So if you want to test, for example, both Anthropic and OpenAI models, you probably need to go and read through the two different agreements and understand if they will work out for you. While with Bedrock, you have a more unified experience. Once you understand the guarantees, you have a system that allows you to try different models. Then we already mentioned governance, because you can use IAM and CloudTrail.

You have, again, model flexibility. I already mentioned that, where if you want to try to see which model works best for you, once you are in Bedrock, it's relatively easy to switch between the models that are available there. And then there are a bunch of interesting Bedrock-specific features, which I don't think we're going to be spending a lot of time on them today, but you can easily build a knowledge base, sometimes called RAG, We already mentioned agents, there are entire subservices within Bedrock and frameworks that allow you to make it easy to build and run agents in production.

You have the concept of guardrails, so being able to effectively limit some of the capabilities of an LLM, for example, if you want to make sure it doesn't go off on a path that you don't like, maybe, I don't know. Classic examples are like limiting the LLM interaction to, for example, not be able to talk about politics or maybe not go outside a scope that maybe is the scope that is specific in the domain where you are implementing the LLM.

Or maybe you can remove some PII, so there are ways to detect that PII is coming into the conversation with the LLM, so you could obfuscate some of that PII before it goes back to the user. So you have all these additional features that I think are really important for when you're about to go to production and you want to make sure you are ready for it. Now, there is one interesting caveat that I find a little bit disappointing sometimes, because although I said that there is support for hundreds of models, not all the mainstream models are out there.

For example, a good example is Gemini, which is a very capable model, and it's not currently available in Bedrock. You can imagine this is due to competitive reasons because Gemini being from Google, of course, it's sold through Google Cloud. So I don't think it's very easy for AWS and Google to agree on a way to make that work on AWS as well. I know that it's currently the same for OpenAI. There are the GPT-oss models available, but you don't get GPT. For example, 5.3 will be the bleeding edge model at the moment. That's not currently available in Bedrock. I suspect that that might change because I'm hearing that there is a big round of investment coming into OpenAI. where Amazon is taking part. So maybe that will change soon enough.

Eoin: They announced as part of that, that the GPT models would become available in Bedrock. So that is the plan. Yeah. It only cost $50 billion. That was the price.

Luciano: Exactly. So yeah, right now, just be aware that if you want to use Gemini currently, that's not going to be available. It's probably not going to be available for a long time. While if you're interested in GPT, Models, they will probably become available very soon, but not just right now, the moment

Eoin: we are

Luciano: recording this. And again, just want to remark that there is so much to talk about when it comes to Bedrock. Today, we're going to focus just on trying to use the LLM programmatically part. Maybe we'll have future episodes if there is enough interest, and if we get to learn enough to make it work for us to create an entire episode dedicated to the other features. So with all of that introduction, how do we get started using Bedrock?

Eoin: We talked about Bedrock maybe well over a year ago, I think. And since then, there's a new access model. So the first thing you need to do is understand this a little bit. You'll find a lot of outdated articles out there. Bedrock used to have a model access page where you had to manually enable each model. In commercial regions, that old workflow is gone. Today, access is mostly IAM plus one-time agreements for some models.

So you'll want to follow the current documentation rather than anything from 2023. Models are now available by default in commercial regions as long as your IAM identity has the right permissions, like bedrock:InvokeModel. This brings Bedrock in line with how other AWS services work, which is nice. There are a couple of things that can trip you up, right? So some Bedrock serverless models are served from the AWS Marketplace.

The first time your account uses one of those, Bedrock automatically tries to create a Marketplace subscription, and you need IAM permissions for AWS Marketplace in case that's something that trips you up. Note that models from Amazon, DeepSeek, Mistral, Meta, Quen, and OpenAI are not sold through the marketplace. So this only applies to certain providers. We'll talk more about this gotcha in a little while.

Anthropic models specifically still require a one-time use case submission. You just fill out a little form. and you can invoke them. You can complete this through the Bedrock Playground in the console or using the API. And if you use AWS Organizations, if you complete it at the management account level via API, it extends the approval to all organization accounts. Once that bit is done, you can pick a model.

And you've got the Claude models, as we mentioned, which are very popular, excellent for decent complex reasoning, coding, long documents. This is one of the ones we use the most, I think. And then you've got the Amazon's own Nova ones, which are the kind of the budget option, you know, good for price performance, balancing, especially for simpler tasks. And the light and micro Nova ones are very cost effective.

You have the Meta Llama ones, open weight models, good general purpose. Maybe starting to show its age, I don't see a lot of use of them. And then Mistral is good for coding and multilingual tasks. Qwen and GLM are really starting to make an impact, I think. And in our opinion, there's lots of potential for those to be competitive in price. So that's just some examples. And as you mentioned, Luciano, you don't have all of the competitor models, but we can expect OpenAI's ones to become available at some point in the future, provided that agreement goes well.

Good idea to start with a capable model like Cloud Sonnet to validate your approach, then see if a cheaper, faster model can handle it. No point in prematurely optimizing. And the Bedrock web console offers a good UI that allows you to send messages to multiple LLMs at the same time, so you can compare responses. Probably worth also mentioning that, as you can imagine, everybody's experimenting with Bedrock and with LLMs.

And as a result of that, it might be more difficult than you expect to get the quotas you might need if you really start to run this at production and need the scale. So prepare to have to make a business case and plead for quotas that are beyond prototype POC scale. Once you have your model, you can call the API. So we're talking about the Invoke model or the Converse API for the more standard chat interface.

We generally recommend using the Converse API. It's more of a unified interface across models with a consistent format, a bit like OpenAI's chat completions. And it's not all about text as well. You can use images and documents in the Converse API in the same message format. So multimodal use cases work out of the box. And it supports streaming with Converse Stream for real-time token-by-token output, which, if you're doing chat, is probably a must-have.

And you have the AWS SDK for doing this in your language of choice, Python, Boto3, JavaScript, TypeScript, Java, et cetera. These SDKs are generally split into two parts, like one for the control plane and one for the runtime. So if you look at the Boto3 option, Bedrock Runtime is probably the one you'll be using more often. And the Bedrock one is just for control plane stuff, management of. Bedrock models, that kind of thing.

A new thing as well, since the last one we talked about Bedrock is cross region inference. This lets AWS route your request to whichever region has availability and capacity. And it's a pretty big deal because I think this is the first time, there used to be like a adage that you could say, if you wanted to do something in multi-region in AWS, you had to specifically configure it in each region and configure the synchronization.

This is the first time where you've got pretty much seamless routing from one region to another. And we can imagine that this is just down to the fact that GPU availability is scarce, so it makes sense to distribute it to whatever region has capacity. And the way you do that is by using a model ID. You've got a model ID, which might be something like anthropic.cloudsonnet4-version1. But a cross-region inference profile ID is something you can use instead.

And it will have a routing prefix, like us. or eu. Or it could be a global routing, like global. That'll give you maximum throughput, but no geographic restriction. So it depends on your compliance, data retention, data residency requirements. New prefixes and profiles might be added over time. And at the SDK level, they're pretty much interchangeable. I think if you look in our, I think it was in our PodWhisperer where we use Bedrock, we talked about that in recent episodes that you could see in the commit history when we started using these inference profiles.

Newer models like certain Claude and Lama versions only work through an inference profile. That's why we had to change it in ours, and we'll return a validation exception with on-demand throughput isn't supported. I think we were a bit confused when we saw that for the first time. If you hit this error, just add the routing prefix to your model ID. And the IAM permissions are different as well, so you'll have to make sure you set that up.

If you're thinking about observability and monitoring these, you'll want to know where your requests got routed to. So you might check CloudTrail. Rerouted requests include an inference region field. And you can set up a CloudWatch metric filter on this to monitor your routing patterns. So I guess by default, if you're not too concerned, just default to using inference profile IDs, but for everything, there isn't really much of a downside. You just might want to think about the region you want to use and where you want your data to go. So given that, you're up and running, you can do inference. Should we talk about the cost and see if we can make it clear how much it might cost?

Luciano: Yes, it's actually not that difficult in terms of just the arithmetics of it, because we talked already about tokens. Tokens are generally classified in tokens in and tokens out, or input and output. Input is what the prompt you send to the LLM, output is the completion that gets generated by the LLM. And interesting enough, those get different prices, like a price for input, sometimes as million of tokens, sometimes in the thousand.

I think it was actually changed recently that now in the pricing pages by the thousand used to be by millions, which confused me when we were writing the notes for this episode. But yeah, it doesn't change at the end of the day. The actual pricing is just a way of visualizing the cost unit for input and output. And each model is different, so make sure to check what is the cost for the specific model you want to use.

Some models are more expensive than others. Generally, the bigger, more capable models are more expensive, but those are generally the ones that can be more reliable if you're doing complex tasks. So again, it might be worth starting with the more advanced ones just to make sure you can refine the first implementation of what you want to try to achieve, refine your prompt and everything. When you have something that works, you can try to see if cheaper models can also handle that task.

as reliably as the more expensive model. And that's just a strategy to reduce cost. The interesting thing is that there is no upfront commitment. So as many other AWS services, you just pay for what you use, which is nice because if you have very occasional use cases, or maybe you don't know exactly how much you're going to be using an LLM power feature, that gives you an opportunity to grow as you go. There are a couple of tricks that you can use to reduce costs if you have specific use cases.

One of these is batch inference. I honestly haven't tried it yet, but my understanding is that basically you can defer the execution of a bunch of LLM requests. So you have a little bit of a higher latency in being able to get the response, but that comes with a 50% discount on the cost of input and output tokens. If you don't have a real-time type of experience where a user is waiting for a response in line, maybe you're doing some kind of, I don't know, overnight batch processing, and you need to do, maybe analyze lots of documents, whatever it is, probably you can use batch inference to bring the cost significantly down.

And then there are service tiers, which also is not something I have really invested a lot of time into really experimenting with, but effectively, there are different tiers of discounts and costs that you can use to try to bring your cost down, or maybe you can have one or three months commitments with reserved, and that will give you more guaranteed capacity for predictable workloads. So just make sure also to check the service tier in the pricing page to understand what that's all about, because it could be important for your use case.

Then there are a few other things that can be relevant here. For example, there is a concept of prompt caching, which basically allows you to reduce the amount of tokens that get sent to the LLM. So that's another way that can save money on the input tokens cost. So it's almost like the way I understand this is almost like you are saying, if I'm going to be running always the same prompt, for a specific interaction, then you could create almost like a snapshot of that.

So that's what gets cached. And then you are resuming from that session with maybe an additional piece of text. So you are not paying for all the initial input, which gets cached and it comes with a discount. And that way, you are effectively avoiding to resend that text over and over to different prompts. Yeah, I think if you go into Bedrock and you start to use all the other features, of course, they come with their own pricing. But again, today we are focusing more on the inference part. If you're interested in knowledge bases, understanding flows, agents, fine-tuning, all these different features of Bedrock, they will have their own pricing and different dimensions you need to consider. So go and check those out if this is something that interests you. So now I think we should try to quickly touch on some of the issues that might be tripping you up. What do you think, Eoin?

Eoin: Yeah, we touched on one already, which is throttling. There's quotas at two levels, requests per minute and tokens per minute. They're per model, per region, per account. And new accounts get shockingly low default quotas, like two or three requests per minute for some models, and that can be a real blocker. Even established counts can have conservative defaults. They're not giving this stuff away, they're really rationing it out.

So you might get 429 throttling exception errors. Maybe plan for it by adding exponential black back off with jitter. AWS SDK can do that using adaptive mode in its settings. use cross-region inference to spread load, monitor with CloudWatch, and apply increase requests early. Don't wait until you're ready to go to production. And the max tokens parameter in your requests affects throttling. This is an interesting nuance.

Bedrock reserves some tokens based on your max token setting up front, even if the model generates far fewer. So setting that too high can burn quota faster than you expect, because quota mathematics reserves based on what you asked for, not what you got. And some models like Claude, output tokens count more heavily against your quotas with a burndown multiplier, like 5x for Claude. Now, model access isn't as simple as it looks.

This is another gotcha. Even though it's simplified, as we try to say, you still need the right IAM permissions, marketplace subscriptions, all of that stuff. Different models might be available in different regions, and some might only be available in US regions initially. The marketplace, we mentioned IAM permissions are required in order to get model access because of those serverless models going through the marketplace.

So you might hit that when you switch to a new model from a provider that you haven't used before, or to a new model you haven't used before, and your function or your service role doesn't have marketplace permissions. If you need to resolve this, and you don't need to give your Lambda marketplace permissions permanently. Instead, just get somebody with the right permissions to invoke the model once to trigger the auto subscription.

So you could do that, as we said, using the Bedrock Playground. The marketplace subscriptions are per account, so you'll need to do this in each account as well. Now, for the anthropic form we mentioned, which is separate from the marketplace subscription, if you complete that at the management account, that's enough. And you might have to wait 15 minutes then for the model to become available. There is a really weird error you can come across.

Access denied exception. Model access is denied due to invalid payment instrument. And this happens because some bedrock models are delivered through AWS Marketplace and the subscription process requires a valid payment method. And it documents what payment method issues and geo-restrictions can cause this. You can typically hit this when your account has a payment method that Marketplace doesn't accept for subscriptions.

We've seen this with European accounts using SEPA or SEPA direct debit system, some India-based AISPL accounts, and certain EMEA credit card configurations. Everything else in your account works fine because those services don't go through the marketplace. But the moment you try to use the marketplace, such as using a bedrock model that requires it, it'll fail. And to fix it, you generally have to add a credit card as a payment method.

Some users report that they need to temporarily set the credit card as the default payment method, then complete the subscription, and then switch back. And then after 15 minutes, fingers crossed, it works for you. Another point to mention is the converse versus invoke model. Invoke model means you have to format the request body for each provider's structure. That's why we recommended using the converse one, because it's standardized for all of them. Okay, then we talk about structured outputs. I think this is where it gets really interesting, Luciano. How can we take this really cool topic and summarize it?

Luciano: Yeah, we are in the process of publishing an entire article that goes deep dive into this. So I'm just going to briefly mention what we're talking about, and then we'll defer you to the article if you want to deep dive. But basically, one of the main problems that you face when you try to integrate an LLM into something programmatic is that The LLM generates text, but what you want is generally something more structured, like a JSON object that you can parse and then reuse into the rest of your code.

But there are problems. You can tell the LLM to respond with a snippet in JSON, and then the LLM might get a little bit creative sometimes. So sometimes it's just going to use markdown fences where you have, I don't know, backtick, backtick, JSON, then all the JSON inside, and then backtick, backtick, backtick. And then you need to write code that can remove the backticks and all the markdown wrapping and just take the JSON and do a JSON parse.

Sometimes, even worse, it happens more rarely with the more capable models, but I've still seen it. Sometimes the JSON that you get is not perfectly compliant. You might get a trailing comma, or even worse, you might get fields that you didn't define initially just because the LLM is getting creative. So this is kind of a common problem. Oh yeah, there is another interesting use case where the LLM actually gives you multiple JSON snippets.

So it's kind of reasoning and saying, this was my first attempt, then I realized that this didn't apply. Oh, now there is another more refined version of the JSON you need. So your parsing code might get more and more complex as you find out about all this different variation of text that the LLM can generate. So structured outputs is the solution to this. And it's basically a way to constrain the model to follow a specific JSON schema.

So you can literally instrument the model interaction to say, when you respond, you cannot derail from this schema. So try to populate this exact JSON schema and give it to me as a JSON object, don't generate any other text. So that basically gives you a much more reliable way to get answers that then you can use in your code reliably and avoid all the retries or random failures that might trip you up. Again, there are lots of details on how you can define the schema, how it actually works in Bedrock, and effectively, you need to learn exactly how to define good schemas so that you get the best results.

We'll have a bunch of tips in our upcoming article, so watch out. episode notes down here because we'll put the link there once it's available. So with that, I think we get to the end of this episode. I think today we learned quite a lot about LLMs, what inference is, why you should be considering bedrock, And in general, I want to summarize that our take is that if you are building anything new with LLMs, I think Bedrock is really a solid default choice for production inference, especially because you get all the guarantees from region availability, more legal concerns in terms of data privacy, ability to make sure that basically your data is not going to be used for training, which is generally a common issue, plus all the other additional services that come with Bedrock that probably you might want to start using as you get more and more familiar with LLMs and Bedrock itself. Now, our usual call to action is if you use Bedrock, do you like it? What you didn't like? Maybe you found other random issues that we haven't encountered yet. So please share them with us because that's how we learn, just by keep sharing and talking with the rest of the community. So we always love that you find our connection details in the links. As always, feel free to reach out on socials. One last word, thank you to fourTheorem for powering yet another episode of AWS Bites. If you want help building AI-powered applications on AWS that are reliable, cost-effective, and production-ready, make sure to check out fourtheorem.com and reach out to us. So thank you very much, and we'll see you in the next episode.

153. LLM Inference with Bedrock

Let's talk!

Prev

Next

AWS Bites Podcast

153. LLM Inference with Bedrock

Let's talk!

Prev

Next