Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Eoin: If you've been listening to AWS Bites for a while, you've probably noticed a pattern. We keep coming back to Lambda, and that's not a coincidence. We're big fans. It's one of those services we like because it's very convenient. You can write tiny little functions in the programming language you like. They run on demand when specific events happen. They scale like crazy when you need them to, and scale to zero when nothing happens, even better.
and you only pay for what you use. Of course, Lambda is not always the best solution for everything, as lots of listeners like to remind us, which is completely fair. The moment you try to do anything that looks like a workflow, for example, Lambda can start to feel like it's fighting against you. You've got 15 minutes max execution time. It's stateless by default. And if you need some orchestration, retries, back off, all that kind of stuff, you end up bolting on something like step functions, queues, schedules, and a bunch of extra stuff you didn't really want in the beginning.
And it's not always easy to get that stuff working reliably. Now what if you could keep the lambda model we all know and love, but add a few extra superpowers that might help us to overcome some of those challenges? Well, last December we got a few new superpowers at reInvent 2024, AWS Lambda Durable Functions were announced. And to be honest, we're pretty excited about this one. Now, it's still Lambda, it still has the same runtime and the same scaling, but with a framework that can now checkpoint progress, suspend execution when you need to wait, can resume later from a safe point, skipping the work you already completed.
And this is what we're going to talk about in detail today. We're going to break down What durable actually means in practice and how this whole resume mechanism works under the hood. We'll talk about when this approach is a huge win compared to the usual patterns. And of course, the gotchas that can surprise you, especially around determinism, item potency, and debugging resumed executions. Finally, we'll also talk about one of our own open source applications that we rebuilt from scratch to have an excuse to use durable Lambda functions and see what it feels like to use them in a real project. My name is Eoin, I'm joined by Luciano, and this is AWS Bites. Okay, Luciano, would you like to start off by telling us what is a durable function and what are the basic ideas around it?
Luciano: Of course, yeah. So as you said in the intro, it's still lambda, right? It's the same service, it's the same model, the same scaling. That doesn't change. So that also means that you still write a lambda and lambda the usual way with the same run times that you know and love, the same type of resources, the same scaling mechanics. Now the difference is that now there is a new flag that you can turn on to basically turn a regular Lambda function into a durable Lambda function.
And that basically opts in the function into what is called the durable execution engine that you can now use through a dedicated SDK. So there are these new capabilities that we described and they become available by using a special SDK that you need to install. What is the core difference? So there is, I guess, a mental shift that we need to start embracing when we switch a regular lambda function into a durable function, which basically is that you have to stop thinking about one lambda invocation and start thinking about a workflow made of atomic steps.
So this dedicated SDK basically allows you to write inside the code, inside your handler, your business logic as a sequence of explicitly named steps. And you can think of every step as an atomic unit of work. So you can think about, OK, do this, then do that. And of course, each step has clear boundaries and outcomes. And this step model is basically what makes the idea of checkpointing possible. And this idea basically means that after a step is completed, the framework can create a checkpoint, which is basically a way of saying that all the state that was in the lambda function at that point is persisted in this workflow.
So it's like, okay, we are doing progress, we completed this step, the result of this step was an object, for example, that object is persisted inside the runtime execution state. And other than that, we also have to talk about, because we mentioned suspension, so what does that mean? When this lambda function execution can stop. And it can stop for a few different reasons. For instance, there could be unexpected stops, if there is an error, or some kind of crash, a timeout, or something like that.
In that case, there will be a retry mechanism that kicks in and will start to re-execute the function. Or maybe there are planned stops as well. And this is also a new concept because basically sometimes in your business logic, you just have to wait for something externally to happen. We'll talk about some examples in a second. And the idea is that when you want to wait, you don't need to keep the Lambda running because that consumes resources and it's going to cost you money because you have CPU and memory that gets occupied doing nothing effectively.
So what happens is that with durable function, the Lambda can now be suspended, which means that it's literally, the instance is literally stopped. So nothing is running, you're not going to be paying for that until something happens that basically wakes up the execution and basically starts a new fresh execution. But with all this state, that we discussed before with the checkpoint mechanism being preserved and restored.
Now, you might be wondering what are some good examples of wait steps. They might be timer-based. For example, you can just say, OK, I know that the external action is going to take three seconds, for example. So I'm just going to drop in a wait of maybe four seconds if you just want to play it short. But you can predict more or less how much time you're going to need. And you can just leap for that amount of time.
Another thing could be wait for another compute step. For example, you might be invoking another lambda. And we know that this is generally an anti-pattern. But in this case, it might be starting to become acceptable. But if you are calling another lambda, you can wait for that other lambda to finish. And while you are waiting, your execution gets suspended and then resumed only when the other lambda completes and returns some kind of response.
Or maybe you can wait until a generic condition is satisfied, which is basically a little bit of a wrapper around the weight model, the one we described before, the timer-based weight model. And the idea is that you can say, OK, I'm going to wake up this function every few seconds. and then I'm going to check on a condition and if that condition is satisfied I'm going to stop sleeping and progress to the next step.
Otherwise I'm going to go to sleep again and wait for the next timer interval to resume and check the condition again. And then another one that you might be familiar with if you use step functions is the waiting for an external callback. So you could create this concept of a callback. So it's almost like a unique ID that another service can then use to programmatically wake up that lambda function execution.
So this is generally useful, for instance, when you have a human in the loop. So you might have some kind of UI that gets triggered with that callback ID. Then the user will see some kind of interface and be able to decide, OK, maybe do some action and then decide whether that execution should progress or maybe be interrupted. And in that case, that UI you implemented is going to trigger the callback mechanism to resume the lambda invocation.
Now, there are some other interesting implications. For instance, one of the main ones is that a durable execution can last up to one year. And this is, again, similar to stack functions. And by the way, this doesn't have to be confused with the individual lambda invocation, which is still 15 minutes. This basically means that every time you suspend the execution and then resume it, the overall execution period from the first time that that lambda invocation started to when it ends can last one year.
But of course, each execution cannot last longer than 15 minutes. And let's actually say that this is convenient, for instance, if you are waiting for human approval, that gives you time. Maybe you are doing something that could take days for a human to be available and do the approval, or even months. And that's still a good programming model for Lambda durable functions. Finally, I think it's worth mentioning that when it stops, either because of a failure in the execution or because you're waiting for something, the workflow can later resume from the last checkpoint. We'll talk more about the details because I think there are some important nuances. Basically, the conceptual idea is that it doesn't start from the beginning, but it kind of restarts from whatever was completed is going to start from the next step. This is a simplification. We'll talk more about how exactly that model works. But this is how you can build a mental model for what happens behind the scenes. So I suppose, yeah, I think that should cover more or less the main ideas. What do you think?
Eoin: Yeah, well, we maybe just talk about what durable functions can do that, like a standard Lambda function cannot. One thing is you can run multi-step workflows just in code without having to roll your own Lambda queue state. The next Lambda orchestration, it's probably a lot more easy for people to reason about. It can be very difficult sometimes when you've got your contacts split across multiple different AWS services.
And then you can suspend and resume cleanly. So for things like timers, callbacks, as you mentioned, and human approvals, of course, without burning compute and paying for it while you're waiting. You can also keep reliable progress automatically by using the checkpointing feature, so completed steps aren't redone when the execution does resume. And also you can then apply resilience controls at the step level, so you can add retries back off and jitter without returning your business logic into a whole load of retry state plumbing.
So it's a little bit of the benefit of that that you get from step functions, but it's in the language that you prefer to use. For when it comes to Workflow hygiene, I suppose you could say that's can be often painful to build yourself like deduplication, cancellation and compensation. Like we're thinking about saga style rollbacks, maybe to get that distributed transaction kind of effect. And that's something you can do with durable functions. And overall, the development and operations experience is improved, I would say, with a better testability story and just clearer observability around a single durable execution and its steps. OK, so that's how it compares to regular Lambda as we see it. It might be useful then to share what kind of good use cases do we have that where you might think, OK, this is a good fit for durable functions. Let's give it a try.
Luciano: Yeah, I actually read recently on Yan Cui's newsletter one example that I found very good in terms of explaining the capabilities. So I'm just going to steal that. Sorry, Yan, if you're listening. So the idea is basically you can build an order processing workflow for a food delivery service. So the idea could be, okay, there is some kind of trigger and a new order comes in. You might imagine, I don't know, there is a website or a mobile application where a user can place an order and that's basically the starting point.
There is an event once the order is being placed and that event triggers a durable lambda function which implements the following workflow in steps. So the first step is basically save the order details in a database. Then, of course, we are going to broadcast this order placed event into EventBridge. And this is basically where the human-in-the-loop might come in. So we might want to implement some kind of restaurant confirmation so that EventBridge triggers, effectively, somehow triggers a notification to the restaurant, maybe through another web application or mobile application that is available to the restaurant, where they will see, okay, there is this new order coming in, do you want to accept it or reject it?
And you can imagine that, I don't know, maybe the restaurant is about to close, maybe they run out of ingredients, maybe they are overbooked, So there might be several reasons why the restaurant might not be in a position to accept that order. So the human-in-the-loop, in this case, is an important element of this business flow. And of course, you might also want to apply a timeout because maybe it makes sense for a customer not to wait forever if, for whatever reason, the restaurant cannot even receive that notification.
Maybe they receive it and nobody is available to actually respond to it in a timely manner. So imagine this mechanism. So your lambda, a durable function, is suspended. This event is going to get to the restaurant somehow, and the restaurant is going to have some kind of application that can use the callback to resume or reject that execution, which is effectively confirming or rejecting that particular order.
And I guess this is the example that Yan provided, but you could imagine that you could extend this example even further if you want to think about a slightly more complex workflow. You can imagine, OK, once the order is accepted, then you can also start to track the progress of that order. Maybe the food preparation, or I don't know, maybe it's even before that it's waiting in a queue, then it's getting prepared, then picked up by the delivery driver, then is delivered and then maybe you can even have a final step which is waiting for customer feedback.
So each of these steps can be implemented as steps inside the Durable Function Lambda code. And if you want to think about other examples just to, I don't know, provide more use cases, more food for thought, one other good use case that I've seen is tenant onboarding. So imagine you have a multi-tenant system. Generally, the onboarding of a new tenant has lots of steps. You might want to provision infrastructure.
You might want to configure identity providers. You might want to think about billing and setting up payments. you might model all of that with a durable lambda function at each step. You might even have, I don't know, a human in the loop, you might have review steps, and if something fails, you know that all the previous steps can be easily reverted, or maybe you can just resume and then try to complete the missing ones.
Payment retries is another good use case because sometimes, for example, if you have a system that expects to have recurring payments, It's very common that if you are charging, for example, a credit card, it might happen that that credit card doesn't have enough credit at the time of charge. But then if you retry maybe two days later, it's going to work. So you could model, for example, that kind of behavior in a durable lambda function.
And another one, which is a little bit of a spoiler, because that's actually what we implemented for our use case, is media processing. Media processing generally involves lots of steps like conversion, creating thumbnails, transcriptions, and all kinds of things. So you can imagine that that complicated workflow can be modeled as a durable function. If something fails at any point, you can resume from the last functional bit and you don't have to redo a bunch of steps that might actually be very expensive from a computational perspective. So I think that gives you probably a good few ideas on where you can use this pattern and this new capability of lambda function. So now I would like to talk about what does the experience of writing AWS durable function looks like.
Eoin: But good news, I guess, as we said, it's a regular lambda function with a few extra capabilities, powerful capabilities. And the way those capabilities are provided is through a special SDK called the Durable Functions SDK. you'll need to install it for the programming language you want to use. And right now, what's supported is JavaScript or TypeScript and Python. We believe that Java is in the works, and we even saw a discussion in the Rust SDK repository, so that might arrive pretty soon.
Now, we were using the JavaScript TypeScript Durable Functions SDK for our work. So that's the one we're going to talk about. Other languages might use a different syntax, but the capabilities and concepts should be the same. So the first thing you'll notice is that the handler isn't just a normal lambda handler. You wrap it with a helper called withDurableExecution that effectively turns on durable mode and injects a durable context into your function as a parameter.
Then inside the handler, you don't just write one big blob of code like you normally do, you define named atomic steps, right? So instead of doing work directly, the function runs work inside explicit named steps like step one, step two, step three. And those step boundaries are really meaningful. They're the points where the platform can track progress and treat each unit as done when it comes to things like resuming.
Now, the code at sidesteps is sometimes referred to as the orchestrator path. So what this looks like in terms of the SDK is you'll define a step by calling something like context.step. Then you give it a step name and a callback. And that context you're providing is the special durable context we mentioned a minute ago. JavaScript terms, that step. call returns a promise. So you typically do const result equals await context dot step.
And that await, it reads just like normal async code, but it's also the boundary where the DurableEngine can track completion and then persist progress when it's done. So step results then, the actual result of each of these steps that you define are treated like DurableState. So the function captures results, whatever you return from each step, in a way that can be reused when the workflow resumes rather than re-computing everything.
And then you have the concept of waits. And a wait is a first class operation. It's not just a hack. There's an explicit wait for n minutes construct that you can use to create a step that simply suspends your function for a while. So if you're waiting for something else to settle. In a regular lambda, waiting usually means sleeping and burning up 15 minutes, and you're paying for it, or you build an external timer mechanism.
But here, the wait actually suspends the execution, and the workflow resumes later. So the function is no longer running, and you don't pay while you wait. So with the durable executions mode, the thing you'll get used to is the fact that a function execution, it spans multiple invocations, but it still feels like one flow, even though it seems like a single sequential thing under the hood, it'll be starting and stopping and resuming across separate invocations continuing from the next step. And the code is then workflow code. It's not just request response code with a little bit of business logic. The return value is the final outcome of the durable execution from multiple steps, not merely the result of one atomic invocation. So this sounds pretty good. Should we dive a little bit deeper? How do they actually work? What's the magic behind them?
Luciano: Yeah, I think this is probably one of the most interesting and perhaps also confusing bits that you need to understand about lambda durable functions if you want to use them correctly. So let's try to deep dive and try to describe what really happens when, for example, a step is completed and the state is persisted, so a checkpoint, so to speak, is created. And then what happens when there is a resume, how things are actually restored and the execution actually continues from the next logic step.
So as we said, the core idea is that you have this concept of execution history. So whenever an execution starts, you can imagine that the Lambda service is somewhere capable of storing state. And then for each step, which is treated as an atomic unit, as we said, basically, when that step completes, you can imagine that that step inside your code of that step, you can return data. And basically, a return from that callback basically means, well, I was able to calculate something that I want to retain for the next resume, or maybe the state of this step, if you want to think it like that.
This is it. This is what I am returning. So make sure it's persisted. So basically what the SDK does is basically every time it executes a step, once that step is successfully completed, if there is a return value, that return value is sent to the Lambda service so that it can be persisted. And this is how the checkpoint mechanism works. But we also say that there might be cases when the execution gets suspended or interrupted for errors, timeouts, or other reasons.
And that in those cases, the execution can be resumed later. So what happens on a resume? And this is, I think, the key thing that we want you to understand. If there is one thing you should take away from this episode, hopefully, it's this one. Basically, the idea that might be confusing is that when the lambda resumes from an execution, it always starts to execute your code from the beginning. So imagine you have your handler code, and there are, I don't know, 100 lines of code.
Even if you executed five steps and you reach line 50, the next time you resume, you're still going to restart executing code from the first line. So how is that checkpointing mechanism possible? The idea is that every time a step is encountered again in the execution from the first line, basically the Lambda SDK is going to check Okay, did I already complete this step before? And if it did, then it's not going to re-execute the handler, well, the callback, basically, of that step, but it's just going to take the value from the persistent state.
So effectively, you can imagine the execution flow to be like, okay, I'm going to start from the beginning and then quickly check. Did I do step one? Yes. Did I do step two? Yes. Did I do step three? And so on until it gets to a point where, okay, this is a new step. which I haven't executed yet. So this is exactly the point where I am, in a way, resuming the execution. But practically speaking, everything gets executed from the first line every time there is a resume.
And this is really important because, effectively, I think it could be a common misconception to think of suspension like, OK, you are posing the CPU at a specific line in the code, like, for instance, when you are posing a trend or something like that. And then you just resume from that line of code. That concept doesn't exist in Durable Function. It's just you restart from scratch, but then there is this mechanism that allows the execution to know, I already completed this step, so I'm just going to read the result.
and continue from the point where something still needs to be computed. In a way you can think about this checkpointing mechanism like a cache, where basically if you already have that result computed for this execution, there is no point in executing it again, you can just read it from a persistent state. And the reason why you need to understand this is because sometimes it might be tempting or it might be making sense, depending on what you're trying to implement, to use non-deterministic code outside steps, what we call the orchestrator path before.
Because you can have a sequence of steps, but of course nothing is stopping you from having business logic outside steps. and that's not getting checkpointed. So if in that code, this orchestrator path code, you use stuff that is non-deterministic, for instance, you might use a matrandom or a UUID, or you might be using time-based logic, like in JavaScript you might have a date.now, for example, and then have an if statement that checks I don't know, are we after 5 p.m.
and then going to do something? Otherwise, you're going to do something else. You need to understand that this is not going to give you a predictable execution. Effectively, you are making your execution non-deterministic because the next time you resume, you might get different values and therefore your code is going to take a different path and you end up with subtle bugs or behaviors that you didn't expect. So this is why it's really important to understand how the model is built and the checkpointing works, because then you can avoid these kind of issues. So hopefully that clarifies, I think, one of the main misconceptions of durable lambda functions. But you might be wondering, because this is a very new feature, what is the current state in the ecosystem? Should I wait before using this new feature, or maybe it's already in a good state where I can start leveraging it for my applications?
Eoin: Okay, let's talk about the whole ecosystem then and what it's like as a developer, what the developer experience is, et cetera. So the SDK for TypeScript, I think we found is pretty good, right? It even supports testing as mocking and local execution, which is really good for DX. There's some good articles by Eric Johnson. If you want to see some concrete examples, we'll have the links in the description. MIDDY, which seems to be, keeping really at pace with all the new developments already supports durable functions.
So again, if you haven't tried MIDI, there'll be a link to that in the description too. The Lambda Power Tools team has worked very close with the Lambda team to make sure everything works as expected if you're using Power Tools. Still. Durable functions are still very new, and there's definitely some room for improvement in the whole area of DX. We found some missing features or inconsistencies in the SDK and some small glitches in the web console as well.
But it's pretty minor stuff, and we're sure it's going to be fixed soon. And I guess we look forward to seeing more languages supported. I'm sure Java .NET fans, Golang, will all like to see it. The interesting thing on this run is that the run times don't actually have to change. It seems to be just an SDK thing that's required. So it should just be a matter of time. And we can guess that the reason why broader support doesn't exist yet is because AWS is trying to build these SDK in a way that feels idiomatic to a specific language. JavaScript, as we mentioned, relies heavily on promises, while the Python one uses decorators. Now, all of the theory and the deep diving is done. Shall we talk about the fun part? What did we build?
Luciano: Yes, you might remember our podcast transcription service that we described back in episode 63. Or maybe not, because that was three years ago, pretty much exactly three years. I think it was somewhat January or February three years ago. So yeah, if you don't remember, don't worry, you are officially excused. But you can always go and check out that old episode if you're curious. But I'll give you, or at least I'll try to give you a quick refresher on what this project is.
It's called PodWhisperer, and it's basically our own solution, fully open source, that allows us to create transcriptions for this very podcast. It originally was based on OpenAI Whisper and Amazon Transcribe. And you might be wondering, why are you using two different transcription services and not just one? But actually, yes, you can listen to the entire episode to know the entire story. But in short, we use OpenAI Whisper because it's really, really good in terms of quality of transcriptions.
It does recognize most of the words without mistakes. But one problem is that it doesn't recognize speakers. So what Transcribe does is kind of the opposite. It isn't always very accurate, as we found, at least it didn't used to be three years ago. I don't know if now it has improved, to be honest. but it did do three years ago a very good job at recognizing different speakers. So giving you like speaker labels, speaker one, speaker two, trying to figure out how many people are actually engaging in the conversation.
So basically what we did is, okay, we tried to get the best of both worlds by doing the transcription twice, one with one service and one with the other. And then we have a slightly convoluted workflow that tries to join the two results and extrapolate the information that we need. from both. So the actual words from Whisper and the speaker labels from Transcribe. And basically PodWhispers was born as a way to orchestrate this entire workflow.
What we recently discovered is that there is actually a new project that is based on Whisper that is called WhisperX. We'll have the link in the show notes. And it's actually pretty cool because it's still using Whisper under the hood, but adds a few extra steps using additional models. And those steps are, one, it's adding world-level timestamp synchronization, which can be really useful for a bunch of different use cases that we'll mention in a second.
And the other step is what's generally called dialyzation, which is effectively recognizing different speakers. So you can imagine internally when you use WhisperAX, there are three different AI models that get executed in a pipeline. The first one just gets the raw words, the transcription, and then in segments where a segment starts and finishes. The next step is a word level timestamp. So the second model is basically taking the input of the previous model, the audio file again, and trying to figure out where each single word starts and finishes.
And then the third step in the pipeline is trying to figure out, OK, for each sentence or segment and word, who is the speaker that is talking now? Of course, who is the speaker in the sense of a speaker label? It doesn't try to get the name or just figures out, OK, this is a different person talking now. Or maybe it's the same person as before and calling it speaker 1, 2, 3, and so on. And the cool thing is that also runs on GPU.
And we noticed that it is much faster at transcribing when you have a GPU available. We noticed, for example, on a G5 XLarge that it takes about five minutes or less to transcribe 30 minutes of audio. So basically what we thought, because we have been meaning to switch to a model like this for a while, we thought okay this is a really good option that we should try and maybe this can replace our complex workflow where we try to run two different transcriptions in parallel and then join the results.
Maybe just using WhisperX is going to be good enough for us. And at the same time, there were a few other features that we wanted to implement for a while that we took the opportunity to say, OK, now that we are rewriting this transcription workflow, maybe we can also add the extra features. One of these features, for example, is every time we get the transcription file, in the last few episodes, we started to manually feed an LLM with this transcription file and just giving it enough context to understand, OK, we are talking about something related to AWS.
And can you make sure that everything makes sense? Most likely there might be, I don't know, things that are misspelled or slightly out of context or name of services that are not properly named or casing that is not respected. The name of the people talking is not always correct. For instance, Eoin is always spelled as O-W-E-N. which we know is not the correct one for you Eoin. So all these kind of things actually LLMs are really really good at fixing.
We used to fix them manually but there's a lot of work and now you can just drop all of that text with a little bit of context to an LLM and you get a pretty good result. So this is kind of a refinement step. So we started to realize, actually, we could do a few more refinement steps. Another one is we generally get Speaker 1 and Speaker 2 in our prescriptions, and then we have to manually check, OK, who is the first one talking?
OK, this is Speaker 1, and we change the label manually. Who is the second one? We change the label manually. LLMs are also really good at detecting that, because generally we say something like, my name is Luciano and I'm joined by Eoin, and that's a good signal to the LLM that the person speaking now is Luciano and the other one is Eoin. So we also included in this refinement step, we tell the LLM, can you try to detect the names of the speaker and replace the labels?
And the next step is because we have water level timestamps that are now provided by WhisperX, One of the problems that you might have noticed if you used to watch these episodes on YouTube is that sometimes Whisper gives you pretty big segments, multiple lines of text. Sometimes you might see an overlay in our videos if you use captions that is like three lines of text, which is pretty unreadable, to be honest.
So this is something that has annoyed me a lot. And once we started to see world-level timestamps, then you can start to split the segments in whatever arbitrary way you want because, of course, you can decide, okay, I want to have always no more than one line of text and no more than, I don't know, 40 characters or maybe 10 words. And, of course, we included this logic in our workflow. So try to break down the segments into something that is going to be more readable.
So basically, out of all of these ideas and features we wanted to implement, this is what we did for PodWhisperer v2. And it's all open source. You can check out the repo. It will be in the show notes. And just to recap, what's happening here is we are also using a few other things that are pretty cool, in my opinion. So let me just tell you very quickly what happens in each step at the end when you use PodWhisperer.
So the first thing that happens is that we drop a file into a street, so an audio file, and that creates an event bridge event. which effectively is going to start the durable function execution. The first thing that the durable function execution does is just going to send an event into SQS saying, this file is available for transcription. And what happens behind the scenes is that that SQS is being monitored by an ECS managed instances cluster.
And if you don't know what that is, we recently spoke about it at length. It's episode 150. Check it out in the show notes. But the main idea is that we want to have an easy way to bootstrap a machine that has a GPU only when there is work to do and then shut it down when there is no work left. So that mechanism allows us to do that. We drop a message into SQS, ECS manages instances, it's configured to monitor that queue and spin up an instance when there is work to do in the queue.
And at that point, the instance, sorry, the cluster will start as a service configure that is basically an image with WhisperX already pre-configured with all the necessary model preloaded into it. And it's going to do all the transcriptions, and then it's going to call a callback. So this is another detail that maybe I didn't explain very well. After we drop something into the queue, the execution pauses waiting for a callback.
So effectively, the message that we sent into the queue is, this is the file that needs transcribing, and this is the callback ID that you need to call when you're done. And of course, there is also a timeout that should be reasonable depending on the length of your episodes. If you want to use this tool, you can configure the length. In our case, I think it's about 60 minutes, which should be more than reasonable.
Then we have all the other steps. I'm just going to go through them very quickly because probably they're less interesting. The second step is basically what we call replacement rules. We have a bunch of either strict matches, like, I don't know, very often we see that, as I said, Eoin gets misspelled. So we have all the common misspellings listed out and we have replacement rules. We can also do that using regexes.
For example, in other use cases, AWS bytes often is spelled with a Y rather than an I, so we have regexes that capture that and can fix it on the fly. Then we have that LLM refinement step, so it's effectively using Bedrock and creating a prompt for Bedrock to say, can you check if there are potential common issues there and fix them? And can you also try to identify speakers? And then give us back a structured JSON that we can use to reconcile your proposed changes with our existing transcript.
Then we have the segment normalization step. So effectively, we break down each segment into smaller chunks so that they are more readable. And finally, we have another step that generates captions in the common formats, for instance, SRT or VDT, which is what we use on YouTube. And we also have our own custom JSON format that's what we use to build our website. If you notice on our website, you can go to the transcript tab and you will see the entire text and you can even click around and that will move the video to that specific point.
So this is the JSON we use to build that feature in the UI. And finally, when everything is done, we trigger an event-on-event bridge saying PodWhisperer has finished to do a transcription. And this is something you can use for any arbitrary extension mechanism if you want to use PodWhisperer. In our case, we have another tool, also open source, called Episoder, which you'll find the link in the show notes. which does another step, which is basically trying to update the website for us. So it creates a PR to our website repo with the new episode description, trying to figure out what are the chapters, suggesting a description, suggesting tags for YouTube, all that kind of stuff. So again, everything is open source on GitHub. If you're curious, check it out. And if you have ideas on how to improve it, again, open source, feel free to submit issues or PRs. Now, probably before we move into final topics, like comparison with other tools and pricing, does it make sense to quickly recap some of the best practices or things that can bite you?
Eoin: Yeah, because it's kind of a new programming model, that's a good thing to talk about. There is an AWS document with some best practices we'll link in. But our summary, I guess, is based on what we discussed so far, any kind of non-deterministic code, side effects with steps should be wrapped, like things like random UUID, designed for idempotency as well. That's always a good practice with things that are at least once invocable.
Adopt the replay aware logger from the context. We also noticed that LLMs don't understand these rules, so be careful with LLMs in general, but specifically with new things like durable functions. One example there is we got a case where we wanted to keep track of the total time of execution of a durable step function. And of course, the LLM generated code outside a step, initializing a new data object called start time.
And for reasons already stated, we know that's not going to work. So this is basically breaking one of our rules. Resume would generate an entirely new date object, and we wouldn't be able to track the total time of executions across invocations. The solution there was just to calculate this date within a step. So it's properly persisted as part of the durable state, and then you can just reload on resume. That's your solution. You have to tell LLMs these rules very explicitly, and of course, always review the generated code. Important topic, Luciano, how much does it cost?
Luciano: Yes. So basically, very quickly, doesn't change too much, meaning that it's the same Lambda pricing as a base. But of course, because the Lambda function is now doing more stuff, you are expected to pay for those extra features. So there is an additional cost for durable operations. So these are checkpoints related, steps, weights, callback, et cetera. Basically, you pay $8 per million operations. And then also because there is data being persisted, you have to pay for that.
And it's $0.25 per gigabyte. And then data retentions is $0.15 per gigabyte per month. And this is something you can configure. I think sometimes it might make sense. For instance, I don't know, when it comes to payment, maybe you want to have a longer retention for whatever reason. But if it's something that, once it's completed, you don't really care about, you can have a much shorter retention and don't have to pay too much for that.
Now, one thing that I think is really interesting that I was spending a little bit of time on is comparing durable function with other industry tools that are somewhat similar. I've seen lots of people in the past talking about DBOS. So I don't know if something new to the listeners here, but I always find it very interesting. And other options are Temporal or Temporal. I'm not sure what's the right pronunciation.
and trigger.dev. And basically you can think of the same story that we just described for Lambda durable function as in a more generic service that is not necessarily tied to Lambda. For instance, DBOS is effectively, if you want to do durable execution, totally open source, you can just pick whatever machines to run your code or containers, that's basically it. It's implemented using Postgres as a mechanism to persist the state, and then it gives you an SDK that allows you to write your durable code in TypeScript, Python, Go, and Java.
And of course, they also have their own cloud service if you don't want to self-host it, but I think this is a great option if you want to self-host this concept of durable execution. I think I've seen somewhere, I don't have the link right now. If I find it, I'll put it in the show notes, somebody trying to run DBOS on Lambda, which I think was a pretty cool idea before durable function was effectively created by the Lambda team.
The problem, of course, is that you wouldn't be able to easily replicate that stop and resume model. So you will probably be limited to 15 minutes execution, or you'll need to do some kind of crazy orchestration to recreate all the checkpointing and resuming yourself. So, I'll try to find that video and if I find it, I'll link it. I haven't fully watched it myself, but that seems really where the benefit of having a specific service built into Lambda comes in, because if you have to do it yourself, it's not easy to do, or it would be much more limited than you can actually achieve with a native service.
Then Temporal and Trigger.dev are basically pretty much the same ideas. I'm not sure if they both have or not an open source version, but they are more sold as kind of hosted services. I've seen Trigger.dev briefly and it seems like it has pretty cool UIs and it seems very easy to use. So probably another alternative to look for in case you you are not really tied to Lambda. But again, I guess I want to say with this section of the episode is that this idea is not new.
It's just the Lambda team figured out, OK, this is a capability that many people are actually using. It is nice to have it in Lambda. But if for whatever reason you cannot use Lambda in another service and you enjoy using durable function, you can achieve something very similar using one of these tools. So that's why we wanted to give a mention to these other tools. I think there is another common question that I heard a lot, even in the presentation talk at reInvent. This question came up, and it's basically, this seems very similar to step functions. So when should they use durable functions compared to Step Functions?
Eoin: Well, you might think that durable functions are just Step Functions without the ASL, Amazon States Language, but I don't think that's necessarily true. It's obviously going to be down to preference. If you're very proficient with Step Functions and the nature of your workflows are not too complex, that might be fine. But I've definitely been in the situation where you end up with lots of Step Functions with lots of Lambda functions interspersed where you're doing business logic, and the switching back and forth can be a little frustrating.
Durable functions, they're just Lambda functions, right? So one of the advantages is that you can use any event source that works with Lambda to trigger them. With Step Functions, I think you could still trigger them from events, but you don't have the same set of supported integrations there. Now, testing and running locally, The durable functions, as we mentioned already, seems to be pretty well designed. I've had a good few attempts with testing and running Step Functions locally.
And while it's better than it used to be, it's still pretty hard and not a like for like experience. Whereas if you're testing in the language that you're very familiar with, it's a much more pleasant experience. When you're comparing the two as well, I'd say beware of massive parallelism. I've been able to do lots of highly scalable distributed map step functions, and that allows you to run tens of thousands of jobs in parallel. Now, while Durable Functions does support map steps, it doesn't seem to be designed for that kind of scale. We haven't actually tried them yet. but it does seem like it's more geared towards smaller volumes of data.
Luciano: Yeah, I have a few more. For instance, I think there is a light we have to shine on the workflow builder of Step Functions. Like if you like that visual way of building different steps, or even, for example, if you have cases where you're trying to integrate a bunch of different AWS services, in Lambda, I think that will be much more complicated because you have to write code, make sure to install the correct SDKs.
With Step Functions, you can just drag and drop and connect different things. And that gives you also a pretty nice visual story of, okay, this is what happens and that's what happens, especially when you start to have lots of branches. It might be much more complex to represent the same logic within a durable function. And for certain, you don't get a visualization built in. So that's something that you're going to be lacking.
I found myself that sometimes when I'm working on complex step functions, just screenshotting the flow visualizer is already pretty good documentation. While, for example, in a durable function, that's something you need to do yourself. You need to create some kind of diagram that represents all the different states and then keep it up to date. So that might be a reason, like if you find yourself preferring more this visual model, I guess that might be a reason to pick step function in favor of, I guess, durable functions.
And another thing that is probably worth mentioning is that when you're doing distributed transactions, stateful application logic, or even AI workflow, I'm hearing lots of people building AI agent workflow with durable function. seems like kind of a better candidate because probably it's mostly code that you're writing, so it's probably easier to just drop that business logic and split it into steps within your lambda handler. So that's maybe a case where I would prefer to pick durable functions over Step Functions. So that's probably everything we had to share. I know this was a long episode, so maybe Eoin, I don't know if you want to try to give a quick recap and then we'll wrap it up.
Eoin: Sure thing. So what do we have to say about durable functions? You know, still Lambda, same scaling, but you can write multi-step workflows in code now with checkpoints, weights, and a nice resume model. Big mental shift here is you're thinking in atomic steps and the system persists progress so you don't have to hand roll orchestration queue. And we did talk about the resume model because that's where the power is.
And it's also where a lot of the surprises live. So what makes us excited about this? Well, I think it feels like a really interesting middle ground between raw lambda plus lots of glue and full-blown orchestration. I think we're really excited for the long waits and human approvals, the fact that you could do this now without paying for them, and the fact that a durable execution can hang around for up to a year.
That's pretty impressive. So has anybody out there tried durable functions yet? We'd love to hear from any listeners or viewers. What did you build? What tripped you up first? What were your successes? And if you haven't tried them yet, where do you think they'd fit better than step functions into your world? So let us know in the comments or reach out on socials. We really want to hear real world experiences, good and bad. And if you've got a weird edge case or gotcha story, even better, send it our way and we might cover it up in a follow-up episode. Lastly, thanks again to 4theorem for backing us and powering this episode. If you want help designing and implementing an AWS architecture that's simple, scalable, and not too hard on cost, head to 4theorem.com. Thanks so much for joining us again, and we'll catch you on the next episode.