Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Eoin: The hype around generative AI is dying down now, but we are beginning to see a growing ecosystem and lots of real-world use cases for the technology. We've just built some features using Amazon Bedrock that help us to produce this podcast and save some manual effort. So today we're going to talk about how we automate the process with Bedrock, and we'll give you an overview of Bedrock's features. We'll also talk about how to use it and share some tips and code on cost monitoring. My name is Eoin, I'm here with Luciano, and this is the latest episode of the AWS Bites podcast. This episode is sponsored by fourTheorem. If you're looking for somebody to accompany you on your cloud journey and need an AWS advanced partner, check us out at fourtheorem.com. Now, before we get started and talk about Bedrock and how we use it, let's talk about the manual process or semi-manual process we did before Luciano. What was the drudge work we wanted to remove?
Luciano: Yes, so we are talking about the work that is involved when you create, in this case, a podcast, but I think it applies to any video that you want to publish on YouTube. So you create this video and when you upload it, you need to provide a bunch of additional information, effectively metadata that makes your content more discoverable and provides all the necessary context information to your viewers, but also YouTube to make the content searchable.
So we are talking, of course, about descriptions, tags, but also other things like the chapters that you can add. You probably have seen these in our videos or in other YouTube channels where you can add a specific type of text that says this section, starting at this point, is about, I don't know, introduction. This other section is about why do we need serverless? And what happens is that YouTube is gonna use that text information in the description to split your video into chapters and you will have those chapters as an easy way of jumping through different parts of the video.
So all of this information is something we take the time to create for every single video we upload because I think it makes them more discoverable and it gives our viewer a better experience, but of course there's a lot of extra work that we have to do for every single episode. And the process that we have been doing so far before the feature that we are gonna be talking about today more or less looks like this.
So we have already some automation, which is called PodWhisperer. We talked about that before and we'll link to the previous episode talking about that. And PodWhisperer, in short, what it does is able to extract a transcript of everything we said during a specific episode. And that information is something we can use in different ways. So all of that stuff is done through a Step Function. And again, you can check out that video.
We'll give you some other details later. But other than that, the other piece of information we have is our notes that we keep in Notion. So we have already some kind of groundwork that we can use as every time we want to create a title or a description or tags, we can look at all of this information without having to watch the entire video again and decide, okay, what kind of title do we need? What kind of description and so on.
I admit that sometimes we use ChatGPT just to summarize, for instance, all our notes and give us maybe a draft of the description or some ideas for titles. But then there is still a lot of manual work in trying to adjust that result until we are happy with the final outcome. And when it comes to creating YouTube chapters, it is a very manual process. What I used to do so far is I just watch the final version of the episode at 2X, and then I keep track of all the points where I see, okay, now we are changing topic, and this is probably worked a dedicated chapter.
So I take track of the timestamp and I create the format that YouTube expects. So we do all of this stuff, and then when we upload the video on YouTube, we need to copy paste all of that information correctly. And some of this work, I think it's nice to do it manually because it's good that you can add a bit of a personal touch. And you might have seen that we are taking a bit of creative license when it comes. I don't know, sometimes we take a style that looks like Tesla, the scientist. Sometimes we think about, I don't know, let's give it a medieval touch because maybe we are using some medieval artwork. So it is nice that you can take the kind of freedom and make it a little bit more original. But at the same time, we thought that using some generative AI can help us to do all of that work faster and with less manual work.
Eoin: When Bedrock came out recently, this kind of was the inspiration. And especially, I'm gonna link some tutorials in the show notes because one of the notable things about Bedrock is that when Amazon announced it, they created lots of really good tutorials and workshops and example repos, some of which are like really, really impressive content. I don't think I've ever seen that for a new service before.
But let's first talk about Bedrock. So Bedrock is Amazon's new service for building generative AI applications. It's quite a bit different to SageMaker in terms of experience where SageMaker is designed to remove some heavy lifting, but you still have to understand about containers and models and the model libraries like PyTorch, et cetera. With Bedrock, it's a lot more managed high level service.
And the idea of Bedrock is that it gives you access to third party foundation models from lots of different companies through a single API. And you just use one API to get a response to an instruction or a prompt. And the models that are available on Bedrock right now are from Anthropic, the Claude large language model. So that's like a good one with a focus on safety, safety and security and non-toxic data and safe data from reliable sources is their focus there.
And then you've got Cohere command model for cases like customer support and business scenarios. You have the AI21 Jurassic large language model, and then you have Amazon's own Titan models, which are like general purpose large language models. And those ones are really aiming to be the lower cost ones where you're just really trying to cost optimize. They're not fully available yet. Not all of the models are available yet.
And then if you're doing image generation, you have the stable diffusion models available there as well. There are also other models planned like the Facebook/Meta Llama 2 model is also supposed to be coming soon and a lot more expected to arrive. So if you want to use a model that isn't on Bedrock, but is available elsewhere, like on Hugging Face, you would need to really host the model somewhere else like on SageMaker in the more traditional way.
But going back to the ones that are available on Bedrock, then if you're just trying to start to build practical, like chat features or text summarization, image generation features, build it into your application, I would just say to people that I think it's a lot easier than you would expect. And there's very little work you have to do. It ultimately depends on your use case. And if you need to pull in additional data, but generally for the kind of use case we're describing here, it's really quite a simple addition.
So it allows you to quickly use these models and then add things like chat applications, text summarization, knowledge search based on additional data you might have in documents, doing text to image creation or image to image creation as well. Now there's a very small bit of setup, which is that because you're using these third party models before you go and use them, you have to go into the console and go into Bedrock model settings and explicitly enable each model and agree to their end user license agreement.
So I guess there's because of the nature of these applications, the fact that they're non-deterministic when you're using models and there's impact there, you just have to get the legal sign off on those pieces. Once you've done that, it's basically serverless. So you can start making requests without having to wait for any infrastructure to be set up. Now, if you're comparing Bedrock to other things like maybe OpenAI APIs, for example, the idea here is that with Bedrock, there's more of a focus on privacy and security.
So the big difference compared to other alternatives is that your data is encrypted. It's not shared with other model providers. You can keep your data and traffic within your VPC as well using PrivateLink. So you've got encryption and your data isn't gonna be used to train the models further and that's part of the agreement that you get. So you can use those models all as is, but there's also a whole other set of APIs there for fine tuning them if you need to using your own training data on S3 and simplifying that as well.
But we're just talking really about using foundation models for inference, just for getting a response back and without any special training or any kind of fine tuning. So for our use case, we decided to use the Anthropic Claude V2 model because it supports the largest input size by far. It supports up to 100,000 tokens, which usually equates to around 75,000 words. And we want to be able to give it a full episode transcript and our episodes can be anything based on a historical evidence between 2000 and 30,000 words. So that might be 40,000 tokens. So that's what we started with. Luciano, what's then the goal of the design of the system? What do we want it to do? We've got the problem, we've got the model. What was our thinking from that point?
Luciano: We already had a piece of automation. We already mentioned PodWhisperer a couple of times and PodWhisperer is effectively a Step Function. So it's a workflow that orchestrates different things. And eventually what it does, it creates a transcript for us. So the idea was, okay, we have already the Step Function. We have already a step that gives us something that we can use as an input. So what we need to do next is basically extending that Step Function, taking that input and basically do more stuff with it.
So once we have the transcript, what we can do as the new steps that we introduce in this automation is basically we can create a prompt that will instruct the large language model to generate few different things. One is the episode summary, then the YouTube chapters with precise timing and topic and a set of tags for the episode. I think here is worth mentioning also that when we generate the transcript, it's not just the text, but we also keep time references to all the different bits and pieces.
And this is how we are basically capable of doing chapters with precise timing, because we are giving the model not just the text information, but also the timing of every text occurrence. So it can determine exactly which piece of text is said at which specific point in time. So once we create this prompt that needs to kind of summarize all of this instruction in a way that the model can really understand and give us the output we expect, we need to pass the prompt.
So we need to actually make an API call to Bedrock. And we also need to give the full episode transcript, of course, because that's part of the context that we need to give it. And after a while, when this request is processed, we receive a response, we need to pass this response, and this becomes the next step in our integration. So what we finally want to do is basically we want to create a pull request to our website repository, which is something we manage with a static set generator called Eleventy.
And it's all managed open source in a public repository on GitHub. So we will also have the link if you're curious to see exactly how we build our own website. So we create this pull request, and this pull request contains all this information nicely laid out in the PR description. So this is what we got from Bedrock, but also contains the transcript that we can incorporate in the website as well.
And this is something we were doing before. So the new bit here is that the pull request description will contain all this additional information, description, chapters, and tags in a way that we can easily copy paste it into YouTube. And this way we're saving lots of time. Of course, we still take some manual time to review everything and decide whether we like it or not and add a little bit of personal touch, but I think that's saving us already a lot of time. I think one of the interesting bits here, which at least when we started to work on this wasn't obvious at all to me, is the part where we defined basically the prompt engineering. How do we tell the model, what do we want, and in which format it should give it to us? So do you want to talk a little bit more about that, or what?
Eoin: Yeah, the prompt syntax for every model is slightly different. For example, for the cloud one we're using, you need to specify like human colon, then your instruction, and then a new line with assistant colon, and then finish it with another two new lines. That's just the way that the model has been trained and expects input. Beyond that, it's kind of like trying to come up with the right phrases and instructions and restrictions and examples so that it has the best chance of giving you the kind of inference results you're looking for.
And the way you can do that, you can start off with a Bedrock playground in the AWS console, and you can type instructions there. The API or SDK is really simple for models. You're just doing an invoke model request. That's what we're doing. And there's only a couple of parameters you need to pass in. You can look at the documentation for the parameters that you need to specify, and then it's just understanding how to format your prompt.
So for us, it's just a string with this human start, and then we're asking it the instruction. So what we're saying is provide us with a episode summary, first person plural, and we're aiming for around 120 words. And then we say followed by 10 chapter summaries for this following transcript JSON, right? So we're gonna include the JSON in the instruction. And then we're also asking for the chapter summaries to be based off the timings in the transcript segments, and for those timestamps to be included exactly as they were, the same format from the segments.
And we're also asking then for the tags, like up to 20 relevant tags for the YouTube video. But we're also doing this kind of, it's kind of a single shot inference where we're giving it an example as well of the output we want to receive. So we're giving it a sample JSON just to show the structure that we're looking for. And when we run that, then about 20 or 30 seconds later, we can get back our response.
It starts off with a bit of text and then the JSON. So we just need to strip out the JSON and parse it. Now, you might wonder, this is a non-deterministic model. It can generate all sorts. Will it always generate valid JSON? That's something to be mindful of, but in our testing, it has always generated exactly perfectly formatted JSON, and we haven't had any issue there because we're using that example in the prompt.
So then tying this all in briefly into the total architecture, which you can see on the website, there's a diagram there. We're using Step Function to orchestrate. It's a very simple two-step Step Function. It's triggered by S3 events in EventBridge. It runs the summarization lambda function that calls Bedrock with the prompt, passing in the full transcript and getting back the response, extracting the JSON.
Then we pass that to the next step in the Step Function, which is our pull request lambda, which is the same one we had in the other project before. We just refactored this into the new repo, and that creates the pull request based on that JSON and gives us that nice GitHub description. And that's it. So I think it's pretty simple all in all, but it's quite powerful. And the results, I think, so far look pretty impressive. And recently we did the interview with Jeremy Daley. We got really a great amount of time to talk to Jeremy, but the more time you have, the more effort you have to do if you're trying to create chapters. So all of this automation really helps us because this podcast is only really possible because we've managed to find a format that doesn't take too much of our time. We do some preparation, we record the episodes, and then we try to keep the post-production process as lean as possible.
Luciano: Of course, all of this stuff is not free. We are using a service, we are using AWS. AWS is running servers and GPUs for us. So of course there is a cost to it. So what is the cost? And there are actually two different pricing models that you can pick. One is called provisioned and one is called on-demand. Provision, you basically pay per hour. And it's interesting enough that it's not supported for all models.
So the idea is that you pay upfront, decide on which terms you want to commit, and then it looks a little bit like a compute saving plan where probably AWS is allocating something dedicated to you and then you are paying on the hour for that set of resources. And we actually didn't use this one. We used the on-demand just because it looks more flexible and it's, I think, better for us while we are experimenting and we don't really expect to use it in large volumes anyway.
And the on-demand is what you would expect as kind of a serverless offering where you pay per the amount, the units that's being processed. And it varies a lot depending on the model you pick. For instance, we pay 0.3 of one cent to two cents per model for text. And then the most expensive is the stable diffusion one, probably because it's also the most expensive to run behind the scenes, and it's 7.2 cents per image.
So based on that, it might not be very obvious to understand, especially because we are not generating images, but we are generating text, and text might vary a lot. You might have very short text. You might have very long text. And also it's not just the text that is generated, but even the input that you provide. So if you have longer episodes, you are providing more text. So it's very difficult to do a prediction and say, well, we're gonna be spending X per every episode. So how did we reason about cost? What did we do to try to make our costs a little bit more predictable?
Eoin: Because it's so difficult to understand this pricing model and it varies from model to model, and then the dimension is a bit strange as well. So this pricing example you gave, you mentioned 0.03 of one cent at the lower end, up to like one and a half, two cents. That's for a thousand input tokens for these different language models. So what's a token? Well, it's generally roughly one word, but the input, depending on if you're dynamically generating your input or if the output is extra long, your price is gonna change.
So it's important to get more of a real-time handle on costs. So what we did to solve that was we created a real-time view of our pricing using a CloudWatch dashboard that is generated from a CDK application. So this CDK application is in the same repo. You can take a look at it and you can use it as an example to create your own Bedrock pricing dashboard too. And what we do is we hard code the pricing in there for a cloud model, because unfortunately, right now it's not available via the Amazon pricing API.
So we just had to hard code it, but then we can just generate widgets for a dashboard that allow us to see a blink of an eye in real time, like based on a minute of granularity update, what's the price for input? What's the price for output? What's the total cost over the last week, over the last hour, over the last day, whatever. And then we can see based on the number of invocations, what does it cost for the average episode to be summarized?
And we have that dashboard, but we also have alarms. So we can say, if this goes above $1 per hour for three consecutive hours, then send us notification. So we don't have to wait for our budgets to fire or for the end of day billing reports. We got much more real time alerts. And so it's an interesting model that you could apply, but I think it's particularly useful for this kind of stuff. By the way, in case you're wondering, it costs around 13, 14 cents per episode for us to do all this summarization.
I think the cost is pretty good. We've run tens, almost a hundred so far and haven't spent more than $5, I think, or something like that, just with all our testing and running the latest few episodes through this engine. If anyone wants to get this up and running for their Gen AI with Bedrock, just check out the repo and you feel free to use that code as an example. And I think that's it for this episode. Please check out this episode or product. Let us know what you think. We'd love to have others contribute to it, add new features and give us some ideas as well. So thanks very much for joining us. I hope you're really enjoying those robot generated YouTube descriptions and we'll see you in the next episode.