AWS Bites Podcast

63. How to automate transcripts with Amazon Transcribe and OpenAI Whisper

Published 2023-01-13 - Listen on your favourite podcast player

We built a Step Function that allows us to generate high-quality transcripts for AWS Bites podcast!

After evaluating different approaches and technologies we ended up using Amazon transcribe and OpenAI whisper. They both have their pros and cons but combined together they gave us everything we were looking for with quite a good degree of accuracy!

In this episode, we describe our use case, our research, and how eventually we did go about productionizing our final solution.

If you run a podcast and you would like to do something similar, we have open source our solution. It's called PodWhisperer and you can find it on GitHub: github.com/fourTheorem/podwhisperer.

AWS Bites is sponsored by fourTheorem, an AWS Consulting Partner offering training, cloud migration, and modern application architecture.

In this episode, we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: We have updated awsbites.com and added transcripts to every single episode. We don't have a massive number of episodes yet, but we're proud to have published 62 episodes so far. So it's no small amount of audio to transcribe. So how did we go about doing all this work and how are we going to be able to keep doing this consistently in the future? Well, there's a simple answer. Automation and AI. For the longer and more elaborate answer, you'll have to stick around until the end of this episode.

We'll tell you how to build your own transcription automation pipeline and we'll also share all of the code for our solution. My name is Eoin, I'm with Luciano and this is the AWS Bites podcast. AWS Bites is sponsored by fourTheorem. fourTheorem is an AWS consulting partner offering training, cloud migration and modern application architecture. Find out more at fourtheorem.com You'll find the link in the show notes. Since this is the first episode of 2023, happy new year to everyone. We're glad to be back publishing new episodes after the break. So back to transcripts for this podcast. Luciano, can you start maybe by giving a quick recap of what the actual need is? What problem are we solving?

Luciano: Of course. So yeah, this is a podcast and basically in every episode we talk a lot. We say a lot of nonsense. Sometimes we say also something interesting, at least I hope. And of course it would be great if we could provide together with the videos and the audio files also proper transcripts. And that would be nice if we can do that consistently for every single episode. When the episode comes out it is also available in a text format basically.

So people who prefer to read this conversation rather than just consuming the video or the audio format, they can just simply use that transcript as a way to consume the information we are trying to share. Also transcripts are very useful because they can be used for search engine optimization we embedded them in our own website with the hope that that contributes to make our content more discoverable because on the web we provide a better description for the kind of content we are producing. And in general transcripts can also help people even just watching the video or listening to the audio to easily find exactly what is the place where we were talking about a specific concept. Maybe they are listening to an episode again because there is something they want to refresh so maybe they remember that we talk somewhere about Step Functions they can easily search in the transcript just to exactly figure out at which point we start to talk about that particular topic. So definitely there is value in creating all these transcripts. But yeah the main question is how did we do that? How can we generate transcripts in general? What did we do?

Eoin: Yeah this is something we've looked at a few times in the past and never found any ideal option until very recently. So some of the options are like doing it manually hiring somebody who's professional at this. The other one is grabbing the closed captions that are automatically generated by YouTube because all our episodes are already on YouTube or we can generate them in another way. So having someone do it manually like having a professional individual like a freelancer or a company who specializes in this is appealing because it leads to really high quality results by people who do this all the time.

The disadvantage really the main hurdle is that it takes time to find somebody who's reliable enough and build a relationship with them and set up that whole process. It can also be expensive depending on your budget and then you have communication back and forth that can introduce a delay every time you need to publish an episode. So overall because it adds to the lead time it's something we were pretty reluctant to do. Regarding YouTube closed captions, we could have done this pretty I suppose integrated this into our workflow after we publish a new video on YouTube we could wait for some time for YouTube to generate those closed captions and then try and integrate some code to download them and integrate them into the build process. That seems like a decent enough solution but there's two major problems with it. Number one, the quality of the transcripts isn't that great it lacks kind of punctuation and grammar and sentences and that sort of stuff.

Additionally, the YouTube transcripts don't identify different speakers so if you just converted it into a blog a wall of text it would be literally that just a wall of kind of stream of consciousness text without any punctuation or identification of speakers. So the last solution left is to kind of generate the transcripts ourselves somehow and since this is an AWS podcast it's to be expected that we would use something like Amazon Transcribe which is AWS's managed service to perform speech to text. So you give it an audio file and it gives you back text and we like the simplicity of that we're always advocating for using managed services you can use the Transcribe SDK or the API to generate the transcription in a programmatic way. With a transcribed client, you call the start transcription API and provide reference to an audio file and an output prefix and it will generate that as a JSON file it can also generate subtitles formats like SRT and WebVTT. So it runs in batch mode it can also do real-time transcriptions but we would be using batch mode for this since it's kind of for on-demand content and you can get notified with EventBridge when it's finished. Luciano, do you want to talk about what the pros and cons of Transcribe are and like why we ultimately ended up kind of using it but not entirely?

Luciano: Yeah of course so Transcribe is quite good because it gives us that feature that we really liked and we we felt it was missing on YouTube which is basically you get different speakers you get a label that tells you this person is starting to talk another person is starting to talk from this point so we can retain in a text format that feeling that this is just not a wall of text but it's an actual conversation between multiple people. Unfortunately actually there is another one good thing that you can customize it so you can add custom models and vocabularies to fine-tune the results so if you have a very specific domain you can basically put more work into it and get more accurate results but in general we were not satisfied with the level of quality. It is not that bad I think it's still quite a good tool so you can use it for most things but for the kind of scope that we had in mind we feel that good and perfect are very noticeable points like we were aiming for the 99% good while transcribed maybe around the 97% good and we feel that that 2% of a difference is actually quite noticeable when you are reading some text and you expect it to be higher quality. So we were looking for something that could be a little bit better so that basically led us to explore other avenues. And pretty much during the same time where we were looking for an alternative, there was a blog announcement by OpenAI that introduced this new tool called Whisper which is effectively another tool to do text-to-speech so to try to recognize speech and convert it to text. So this came around last September I think and we are going to be linking the announcement blog post in the show notes. It's the same group of people that created ChatGPT and DALL-E so you probably heard of them because right now their products are all the rage and Whisper is probably the least known of these three but nonetheless it's a very interesting product. And we were really excited to try it so we quickly spin it up and tested it and we were definitely blown away by the level of accuracy. So we immediately thought okay we want to use this because this is giving us the level of quality that we want to provide in the end and if we can automate all of that process this is going to be something that we can keep doing very easily without too much overhead in our existing process. Still, it wasn't perfect unfortunately there were a few small problems one is that it did not distinguish between speakers. So on one side we're getting more accuracy but again we are losing that ability to distinguish the speakers and the other thing is that it's not built in in AWS as a managed offering so if we were to productionize so to speak this solution in AWS, we'll need to figure out exactly how do we take the model and run it in AWS. So Eoin, do you want to detail our solution in the end?

Eoin: Yeah exactly, so we wanted to get the best of both worlds right so we have OpenAI Whisper which is this fantastic model that you can run it's basically they deliver it as a container that you can run and as a very nice user-friendly developer friendly interface where you just give it an audio file and it gives you the transcript. It might be worth mentioning that can also do translations as well so if you want to transcribe but also generate Italian text or even transcribe from different languages into English this is something that's really good at too. We did run it standalone. It comes in different sizes so you have it depending on your compute resources available to you you can run the tiny model, small model, the medium model or the large model but if you want to use anything and get a result within a reasonable period of time like even less than 30 minutes you probably need a GPU so that's something that's worth bearing in mind with OpenAI. So our solution - what we wanted to do was use the accuracy of the OpenAI Whisper transcript but take the speaker labels from the Amazon Transcribe output output so that we'd have an accurate labeled time linked transcript and merge the results and end up with a JSON file that we could use to generate a transcript for the website with sections that say this is what Eoin said this is what Luciano said and make it readable for people almost like a blog post, right? So we built this using Step Functions.

It's using SageMaker so that we can run that Whisper model with the GPU and it's also using Lambda for lots of little transformation efforts like if the input audio file isn't an MP3 we convert it to MP3 because Transcribe - it's one of the formats it supports and sometimes we're using M4A audio and Transcribe doesn't natively support that. So, how do we do this? How do we even kick off this process? Well, after we finish recording an episode of AWS Bites we do a bit of editing we create a video and we create an audio file. That audio file gets pushed up to Anchor which distributes the podcast to all the podcast channels but we also take that audio now and we copy it into an S3 bucket and that kicks off a whole automated process with this step function.

We of course we had the previous 61 episodes or so to consider so we also had to do a backfilling process so we pulled down the RSS feed and kicked off this process for each of the 61 previous episodes by copying that audio up to S3. Interestingly, I suppose it's worth mentioning that you know there is a cost associated and probably a lot of people will be wondering what is the cost to run this because we have to run SageMaker with a GPU we also have to use Transcribe both of those things can be expensive if you use them at scale we talked about that in our previous episode on AI services. We did work this out I can't recall exactly the channel was around it was definitely less than a dollar per episode for the whole process to run so it's not too ominous compared to other alternatives at all I would say. So is it worthwhile talking about some of the orchestration here? How does it all fit together? We mentioned we have Step Functions right. So we pre-process the input if needed with FFmpeg, we trigger the two transcription jobs at the same time so the Transcribe job and the SageMaker job transcribe means we have to use the AWS SDK within step functions and then kind of poll until it's complete. With SageMaker we've got like a more native integration with Step Functions where we could just say run this batch transform job that'll kick off a docker container in the background with the right um compute resources it will pass our input audio into it it can run a batch of jobs the way we set it up we generally just do one transcription at a time because we're only doing one a week and when we get both results then Step Functions will allow us to take both of those inputs and kick off a Lambda to process the results. So that's like essentially taking these two sets. Both systems will give you a set of segments with start times and end times one of them has speaker labels so we have to do this run this algorithm essentially to merge the two what else is there to mention in this process. What other bells and whistles do we have?

Luciano: There are some additional things that we do to try to to get a little bit higher quality with our final result, so for instance we noticed very common mistakes. For some reason, Whisper doesn't like our names like it was getting my name wrong a few times it was mostly getting your name correctly, Eoin, but spelled in a different way I think that is that I just yeah just too many ways so it was interesting that it was getting correctly but just the spelling of course yeah you need to guess which which spelling is the right one so we basically figured it out there are also some other cases where for instance name of services in AWS sometimes would be consistently wrong or some small things like that so basically, by reviewing the first results we created a dictionary and we pass the output and apply what substitution wherever we see these common errors and we apply the correction so all of that is somewhat automated and we can keep improving our dictionary as we find more issues like that. Then, the other thing is that at the end of the day our website is the place where we want to output this result in a way that is visible to people and they can consume it so we need somehow to hook this entire project into the process that builds our website and our website is also open source. We'll put the link in the show notes. It is a static website built with Eleventy so what we do is basically every week every time there is a new episode we trigger a new build and that will generate a new version of the entire website, all the html pages , assets and so on and publish that online on a CDN. So what we wanted to do is somehow integrate this process to be able to hook into our website build process and we thought that it would be very nice if the Step Function could just do a PR just trying to send the generated file directly into the repository for our website so we did all of that with an additional Lambda at the end of the process so you might be wondering at this point did we manage to fully automate everything and I will say unfortunately not entirely yes but I think we are close enough. We definitely reduced all the manual work to the bare minimum and but what's left to do. What we still need to do manually or at least we want to do manually to retain a decent level of quality there so this is also the reason why we we do a PR because first of all it gives us an opportunity to review the result of our transcript before it gets merged and the other interesting thing is that the PR effectively is just publishing trying to publish a JSON file in our website repository. This JSON file is not ready to go straight away because the speaker identification is just telling us something like speaker 0 and speaker 1 is not able to tell we which one is which depending on the voice is just distinguishing between between two different people so we need to quickly look up who is the first person to talk and just assign the name to to the right label. So this is something we can easily do manually directly from the GitHub UI by editing the PR and in the process, we also quickly review, we just eyeball the entire text and if we spot any other obvious mistake we can easily fix it manually before merging the PR. So I think that describes more or less the process and what do we do in an automated fashion and what we still do manually. What else do we want to share?

Eoin: Yeah I think just to summarize I think this has really been a step forward in the transcription technology and I'm really happy with the level of automation we now have I think it's just the right balance between manual effort and automation it's great that you can now use AI and be really confident that you've got a result that people can read without finding it kind of jarring or distracting to read. Some of the things that it really surprised me with OpenAI Whisper is how you mentioned, like, product names and AWS service names it seems to just know what they are and get them right most of the time some of the things where it's less accurate is just things that are hard to predict. Like, AWS Bites isn't exactly a top international brand yet so sometimes it would spell it with b y t e s instead of b i t e s so there are some things where you'll always have to do those vocabulary substitutions you mentioned but overall I think this is just mostly mostly hands off and you end up with a really good result for very little cost. So if this is something you want to do for your own podcast, the good news is that everything we just told you about is open source so you can find a repo on GitHub. It's called PodWhisperer as a tribute to OpenAI Whisper because this is primarily aimed at podcasts but of course you can use this for transcribing meetings any other kind of audio you could think of. You can follow the instructions in the readme and deploy this into your own AWS account so feel free to contribute back to the project if you think there's something missing, improvements you'd like to make, something you'd like to change and we'd really love to hear from you. And we'd gratefully appreciate the chance to grow this and spread it around even further. This is all we have for this episode. We hope you liked it and we look forward to hearing your feedback on our transcripts. By the way, if you happen to find a mistake in one of our transcripts you can easily submit a PR like Luciano said. The link will be in the show notes to the AWS Bites static website repo. It will help us fix the issue and improve the quality of what we're doing. It's really nice to have everybody contributing to the podcast. We're really enjoying that so far so thank you and we'll see you in the next episode.