AWS Bites Podcast

92. Decomposing the Monolith Lambda

Published 2023-08-04 - Listen on your favourite podcast player

In this episode of AWS Bites, we take you on a captivating migration journey. Together, we'll explore how we transformed fullstackbulletin.com's automation process, leaving behind the complexities of a monolithic AWS Lambda and embracing the efficiency of Step Functions.

Join us as we dive into the challenges of automating a weekly newsletter, trying to strike the perfect balance between automation and manual curation. We'll discover the risks of relying on external services and how we navigated these obstacles during our migration.

Together, we'll uncover the step-by-step process of breaking down the monolithic Lambda architecture and orchestrating a more manageable approach with Step Functions. We will also briefly touch on alternative social platforms like Mastodon and other Twitter alternatives during our migration adventure.

Learn with us about different migration strategies and the crucial role of observability for smooth operations. Finally, we will share some valuable lessons that you can apply to your production workloads.

fourTheorem is the company that makes AWS Bites possible. If you are looking for a partner to accompany you on your cloud journey, check them out at fourtheorem.com!

In this episode, we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: Lambdas are a great abstraction to write event-driven logic and to split this logic into small, composable, and maintainable units. But as developers, we are not only sleeping bad at abusing technology, or at least I am. In today's episode, I want to tell you the story of how I ended up creating a terrible monolith Lambda and how five years later, doomed by the shame and guilt of having done that, I am now decomposing that Lambda using Step Functions.

My name is Luciano, and I'm joined by Eoin for another episode of AWS Bites podcast. AWS Bites is made possible by the support of fourTheorem, an AWS consulting partner that works with you to make AWS migration, architecture, and development a success. See fourtheorem.com to find out more. The link is in the show notes. Let me start by telling you a little bit what's the context of this episode. I'm talking about a side project that I've been working on for a few years called Full Stack Bulletin, and it's a free weekly newsletter that you can subscribe if you go at fullstackbulletin.com.

And the idea is that if you are a full stack web developer, then you might want to subscribe because you will receive some interesting information every week, and you can try to keep up with this field that evolves very quickly. And it's always very hard to find interesting information and try to stay up to date with the latest news. I guess that's the context, right? And let's try to figure out a little bit more about the implementation.

This is a serverless project. As I said, something I built a long time ago. And the way I built it is a Lambda project effectively triggered on a schedule every week. And the idea is that I want to automate as much as possible because it's something that I want to keep doing every week, and I might be busy. I cannot spend too much time on this. It's a free activity after all. I think automation helps a lot.

And the idea of this Lambda is that it's going to do as much work as possible to try to pre-combine a draft version of the next newsletter. And then I can edit it manually and improve things. So what this Lambda is doing, the newsletter is basically made up of three parts. There is a book suggestion. There is an inspirational quote. And there are seven links that might come from articles, videos, or GitHub repositories, or other projects that might be relevant.

How do we fetch all this information? From different sources, but the most complicated one is fetching the links. And fetching the links is a bit of an involved process. So the idea is that I'm going to be finding interesting information throughout the week, and I will be tweeting it on Twitter, sharing these links on specific profiles. There is a full-stack bulletin profile on Twitter. Now, to differentiate these tweets from other regular tweets, I generally use Hootsuite.

And Hootsuite is a service that allows you to schedule the tweets. And these tweets are also spread out during the week using Hootsuite, because you can just put them on a queue, and they will be automatically spread out. So this automation needs to connect to Twitter, discriminate between the interesting tweets coming from Hootsuite and the regular ones. Then it's going to take all the ones that contain links.

It's going to filter out all the invalid links or the ones that have been used already in previous newsletters. Then it's going to try to rank them. There is a super secret algorithm, which is just counting the number of shares of that particular URL. And then it takes the top seven. And for every link that is selected, we need to get some metadata. So there is a process that scrapes every link, extra the title, description, and it tries to get a relevant image.

Sometimes the image is not available, not every article has an image. So there is a fallback step that uses a placeholder image that is related to technology, something very generic, but that still fits nicely with the overall theme. Then these images are uploaded to Cloudinary, which takes care of doing a CDN, downscaling the picture, and all that kind of stuff. So all this stuff is then taken together and used with Mailchimp. So there is another step that takes all this information, sends it to Mailchimp. There is already a template in Mailchimp. A draft campaign is created. And then I receive a preview by email.

Eoin: Wow, there's quite a lot of steps involved there. And I'm interested in knowing, I've seen Fullstack Bulletin, and it's really useful and looks really slick. It looks like there's a lot of time put into it. So how much of that is manual? I mean, I can imagine that just taking the top seven tweets based on the number of shares is probably good in the most case. But sometimes you end up with things that are not so relevant, or maybe they've gone viral for the wrong reason. So how do you manage that? Is it just a manual process where you have to curate all of this? Is it fully automated?

Luciano: So I think my goal has always been that I want to keep the curation. So I don't want to get random stuff. I still want every single piece of content that comes out to be something that I somehow selected before. So every week, everything I read that I think is relevant, I will be sharing it. And I think that's part of the curation process. Then ideally, I wouldn't want to spend more than 15 minutes every weekend, because I generally do the process where I receive the email, the draft, every Friday evening.

And then I have a couple of days just to review and refine. And then the next Monday afternoon, the new issue will go out. So in my mind, I would like to spend no more than 15 minutes. In reality, I will be spending a bit more, because generally, the draft is not perfect. I will need to change images that are not relevant. Some articles might not be very relevant, so I might want to swap them out with other ones.

Or sometimes there are a few articles that are very similar. So I don't want just one topic to be a recurring thing in the newsletter for the week. So that's also another reason to try to find different articles. And then for every single article, I try to change the description a little bit, because the style is always very different between one description and the other. And also, I would like to add a little bit of personal touch.

Why did I think that that particular link was interesting? Was it something that I played with during that week? Or is it some technology that I want to experiment more with? Or is it maybe something new that I think is, I don't know, revolutionizing some kind of field, and therefore is going to be something to watch out for? So I try to add the kind of bit of, this is why I think this link is something that should be worth your attention. And that ends up taking a lot of time sometimes. Some weeks is really 15 minutes, but most of the time, it's probably one hour a week that I just spent refining the newsletter.

Eoin: That sounds like a really nice process, but I suppose one of the things about the process that I'm kind of thinking might be problematic is that you're dependent on third party systems. So we're talking about Hootsuite and Twitter and Cloudinary. And I guess when you have external systems in the mix, you have to think about fault tolerance and also what happens when they change their contract and their API. So what has the implementation been like? And what have the problems been with it? Because I know you've been live streaming stuff on this recently. Maybe you can give people a bit of background.

Luciano: Yeah, I think that this Lambda is something that I wrote, I don't know, five years ago, if not more. And I didn't really have to change it much since then. So even if today I'm not super happy with that implementation, it hasn't really been a problem so far. Lately, it has become a problem because lots of things started to break. And as you said, you depend on external services and things might change with external services.

Funny enough, lots of them change at the same time, more or less. So I was left with a lot of problems to try to solve in a very short amount of time. One problem was Twitter, not only made it very difficult to use their APIs, very expensive as well. There are some free APIs that you can use, but I think it was very hard for me, first of all, to understand how do you even get API keys. I wasn't able to get API keys for my own personal account.

I was able to get them eventually for the full stack bulletin account. But then with those free API keys, supposedly you can only read user information but not even the tweets. So it wasn't really useful at the end to do anything, right? So I had to change strategy there. And we'll talk about that in a second. Then also, OotSuite decided to revoke any free plan. I was able to use the free plan for a while because I was tweeting something around 20 maybe tweets per week.

So it was not a lot. It was definitely in the realm of the free plan. But now there is no free plan at all. And I think that the cheapest one starts at $50 or $60, which is definitely not worth it for something I'm doing basically for free. And then there was another service called Place Image, which is one of those thumbnail generator services that you would use. I don't know if you are sketching out a landing page and you want to have some place on the image, which was really nice because just by composing a URL with specific parameters, you could get a random image, for instance, technology, right, through different categories.

That service was a free service and now it's shut down. So I needed to figure out an alternative for that one. And with all these things starting to break, initially it was a bit tricky to understand why is the newsletter failing this week? Why is this Lambda not sending me the preview? And I would log in into AWS, look at the errors in the console, and it would be just a random JavaScript error somewhere.

I'm very apt to tell why that was happening. And for a few reasons. Admittedly, this Lambda was in a very bad state code-wise. It was written five years ago. Async-await wasn't even available in Node.js. So I was using Babel and Webpack to transpile everything and be able to still use Async-await to make the code a little bit easier to write. But that made it so that when you have a stack trace, even though I was trying to use one of those source map libraries that were available five years ago, it doesn't always resolve correctly. So sometimes you get stack traces that they don't make sense at all. They will just point you to a random place, which is not really the place where the error is happening. So you just need to rely on the error message and try to simulate things locally and try to figure out, OK, where is this really breaking? Because if you just look at the stack trace, it's not really telling you the truth sometimes. So definitely I had to get my ends dirty and try to fix all these things. And at that point, I realized, well, why did I put everything in one Lambda when there are clearly a bunch of different separated logical steps? So maybe a Step Function there could have been much better.

Eoin: I hope you don't mind. But I'm looking at an old version of the Lambda function handler from 2020. I just picked one at random just to see what you're getting at here. And it looks like it's pretty well structured. I mean, it actually reminds me of the kind of orchestration logic you might have in a web service endpoint where you've got lots of modules. You've broken down each step in the process into different node modules. And then you're calling them one after another in using async await. And it's very easy to actually understand. And this is the nice thing about it is that you can kind of see it top to bottom. And the process is pretty clear, at least from a kind of code understandability point of view. So what is the motivation then to move to Step Functions? And what is your thinking there?

Luciano: Let me try to describe first what I am thinking in terms of what is going to be the final state of the Step Function. And big disclaimer, this is still in progress. I have done some steps, but it's not fully completed. So what I have in mind for the final design is that basically this will be a Step Function with a bunch of different steps. The first one is going to be like a parallel step where we are fetching information from three different data sources, one for the book, one for the quote, and then we have all the links.

So that could be three different branches. And each and every one would fetch their own information in parallel. Now, the book and the quote will be relatively easy, but fetching the links is a little bit involved because inside that branch, technically, there could be multiple steps, maybe one step all in sequence, one after the other. Maybe the first step fetches the link from Twitter, which now has been replaced with Mastodon, just because I cannot use Twitter anymore.

So fetch all the base links. And then there might be another step that tries to filter out the ones that are broken or not relevant or used before, and another step where maybe you do the top seven. And then once you have the top seven, you could do a map step to try to scrape all of them in parallel and get the metadata for each and every one of them. At that point, outside the big parallel step, we could have an extra step that just takes all of this information from the state and use it to create the MailChimp preview and send it by email.

So that would be the structure. And I think with this idea, the point is basically, if something fails, it's going to be very easy for me to figure out exactly which step is failing. So that's already a big advantage, because you visually see the representation of the step function. You see the execution. You can see all the green nodes, and you can see where the nodes are read. And then you can click and start to zoom in exactly on that particular Lambda.

So that reduces the scope for the error. The other thing is that if something fails in a transit way, for instance, I don't know, maybe a service as a networking glitch. You are not able to complete a request. But if you retry, that time will go well. So withStep Function, it's actually very easy to define this retry logic without having to write custom code for it. So that's another reason. And in general, I think the fact that you can structure all the parallel steps so easily should make everything faster, because you are not creating that sequential logic. Or you are not making your code more complicated to try to do things in parallel at the level of your code. But you let the Step Function do all the parallel ization and dealing with concurrent stuff wherever possible.

Eoin: That makes a lot of sense to me. Just since you're talking about Mastodon, I mean, I'm on Mastodon too, but I don't think I've reached the same level engagement as we had previously on Twitter before it went the way it is. Where it just seems, I don't know, every time I open Twitter, it's worse in lots of different ways. But let's maybe share our handles for Mastodon and the full stack one as well in the show notes for all the listeners who are living in the Fediverse, and we can try and grow the engagement that way. I'm also wondering, are you planning on, are you thinking about Blue Sky, Threads, LinkedIn, or other platforms as well for full stack bulletin? Because it seems like things are going to spread out and move beyond just one platform for this kind of interaction.

Luciano: Yeah, absolutely. I did think about using pretty much all of them. I think I ended up with Mastodon just because the APIs are so easy to use. And there is a very good Node.js client that pretty much was like everything I needed to do in one function call. Also, you don't have a complicated authentication process. You create an app, and then you can get an application token. So you don't need to do like an OAuth 2 kind of process just to be able to get your own tokens. So it was really easy to set up for this particular use case, and I was able to do that transition in a couple of hours. So I was impressed how easy it was to use that API. And I think going with LinkedIn or Blue Sky or others would have been more involved. So I just went for the one that was giving me the easiest path to migration. Even though the platform itself, I'm not getting a lot of engagement yet. So maybe that's something that will grow, but I don't know. I have some doubts that it's going to get at the levels of Twitter, to be honest.

Eoin: The fact that you have it now or you're moving towards a Step Function means that it will be easier to modularize it and orchestrate multiple platforms in the future. So what's the plan to migrate? Are you looking at a big migration effort here to get to the end state?

Luciano: Yeah, I think the key here is that this is a side project, and I'm not going to be investing more than a couple of hours a week on it. And the refactoring is probably about one hour a week when I'm doing my live streams. So I need to be strategic about that. I cannot do a big bang type of migration where I'd be spending hundreds of hours, and maybe two years later, I will swap the thing entirely. Also because it was broken at the time.

So I needed to figure out how to do a migration that will fix the problem while progressing in that direction of the migration. So the idea was that every time I do a change, I need to figure out what is the minimum valuable, I guess, change in the direction that I want to go so that I don't break things, and I can ship it to production and make sure that I got some extra value. I went a little step forward in that direction while still keeping something that works.

So the idea there was let's try to extract. The first step was, OK, let's just take the monolith Lambda and wrap it in a Step Function. This Step Function doesn't do anything special. It's just one state. But at least now we are in the realm of Step Function. That was my first change. Then after that change was in place, I was still able to deploy and run it. It was still failing. Meanwhile, I had to fix in that Lambda some of the problems.

Like for instance, I swapped Twitter for Mastodon. I removed Hootsuite. I changed place image with Unsplash, which also has nice APIs and a free tier that you can use. And at that point, I had everything working in a Step Function. Still very monolithic, but was something that I could use to produce the next newsletter. From there, it's very easy to start to extract states. For instance, the first state that I extracted, I didn't even create a parallel step yet.

I just created two sequential states where I think I started first either the quote or the book. I'm not sure. But basically, there were two steps. One extracted the quote and adding it to the global state of the Step Function. And then everything else was pretty much the same monolithic Lambda, except that rather than taking the book itself, it was just reading it from the state. At that point, the next step was, OK, let's start to create a parallel step at the beginning, where I can do multiple things together. So not just fetch the book, but fetch the book and the quote, and then use them and put them in the global state and make the current monolithic Lambda a little bit slimmer, because it can just read things from the state, rather than doing more stuff. And the next steps would be to start to branch out also the link fetching step and processing step into another branch into the parallel step and remove all of that code from the monolithic Lambda. So basically, the monolithic Lambda is still leaving us the last step of the Step Function, but it's becoming slimmer and slimmer as I extract out steps that will go into their own dedicated Lambda functions.

Eoin: Given that you're talking about changing a single Lambda function into Step Function with multiple Lambda functions being orchestrated, do you have to think more then about observability and how to monitor when things go wrong?

Luciano: Yeah, so I think that just by virtue of being a Step Function is already a little bit more observable, because if something goes wrong, you can just open that execution. It's very easy to see where things are failing and what is the error. If I look at the code, at the time, I used this debug module from Node.js, which allows you to create effectively logs, and then you can select a log level, and you can make it more or less verbose, depending on what you want to see.

So that's something that is already in place. It's not perfect. There might be some refinement work there to make it nicer, maybe using structured logs, just to make it easier to send it to CloudWatch and then use CloudWatch logs to query the logs. But I think overall, it's a good starting point. We can improve it. Other things could be create custom alarms. Like right now, there is no alarm. I think I just get worried if it's Friday evening and I didn't get the preview email, and I have to go in and check out, OK, why did it fail? It would be nice to just have an alarm if the Step Function fails that sends me an email anyway, saying, you're not getting it because it failed and it failed for this reason. So definitely, there is some room for improvement there. And one tool that we have been talking about that we built at fourTheorem is SLIC Watch. So it will be very easy because I'm using SAM to integrate SLIC Watch and get some of that done automatically for me at the stack level.

Eoin: That's true. If you've got a SNS topic, you can just get alerts for any failing Step Functions or Lambda functions, APIs, anything else. This sounds like it's really going in a nice direction, and it's quite a professional approach you're taking, given that it's kind of a side project. But since you want to run it once a week, it's a fairly controlled environment, and you're in control of it. So does it really all matter that much to take this migration approach? Is it just an interesting exercise for you? Or is there some higher level lesson that we can extrapolate from this for more serious production workloads, where there might be users relying on the feature 24-7?

Luciano: Indeed, you're right. I mean, I could definitely afford to let this fail, and I have a weekend to fix it. And even if it doesn't, I don't fix it, the worst that happens is that I'm not going to be publishing a newsletter for a week, which is not the end of the world. So I'm probably over-engineering this thing a little bit as an interesting exercise. But I think that there are lessons there that we can extrapolate for more serious production type of workflows.

And one lesson I think is that I've rarely seen in my career Big Bang migration succeed. I think you end up spending way too much time, budget, energies into Big Bang migrations. And very often what happens is that when you are very close to completion, somebody is just going to cancel the project, because it has been years in the making, and nobody has seen value so far. So yeah, that's always the way they go.

So try not to make another Big Bang migration. I think in itself is a good exercise. And I think I can use that practice a little bit more just to force myself to think into, OK, even if it seems longer, because you are doing small incremental steps, and sometimes you have to work around things a little bit just to make it incremental, it's still functional. But you get to see value every single time you do a change.

I think it's something that I need to push myself to think into those terms a little bit more. So in that sense, the exercise is good. And in general, it's a lesson for every migration. Try to think that way whenever possible, even if it seems you're doing more work, but you are doing work incrementally, and every single change gives you value immediately. Then the other thing is that moving to Step Function has a bunch of advantages that we described.

So I think by doing that, I will be having something that it is more observable by default, as we said. And it is something that long term I can maintain more easily. And another idea there is that if I manage to split everything out into the wrong functions, then if something changes, for instance, as you said, I want maybe to swap master for something else, or maybe I want to start to include multiple sources, it should be easier to do it with a Step Function where you can just create more steps or change very specific steps rather than thinking, OK, there is a massive monolithic code base.

Where do I do the changes? How do I change the tests? How do I test everything again? While if you can do all these things in isolation and then just compose them, it should be a little bit easier. And another idea there which could be relevant in production environment is let's just imagine that at some point you realize that you have a bottleneck, and you have a Lambda that takes forever because you're doing maybe something very confusing, dense, you could decide to swap the Lambda for something else, or you could rewrite it in another language. I think that that composability gives you a lot of opportunities for change where optimizations opportunities arise. So maybe you want to rewrite something you've asked for fun or for performance, you can rewrite one Lambda at a time, for instance, or maybe just very specific Lambdas.

Eoin: I like the way you're taking the one bite at a time approach to eating this elephant. And I think it's going to be fascinating to watch the rest of the live streams and see where this ends up.

Luciano: Yeah, and on that note, I want to mention that we will have some links in the show notes because all the code is open source, has been open source since day one. So if you're curious, you can see all the evolution by just looking at the history. And we'll have the link of the repository on the show notes. I'm doing the live streams on Twitch, so you can check out my Twitch profile. If you're interested, generally it's every Monday afternoon in Irish time zone.

And there is also a playlist on YouTube with the previous recording. So if you want to check out the previous episodes, just see the incremental changes, you can do that there. So on that note, I think we've reached the end of this episode. I hope that this migration story is going to be interesting for you. If you have done something similar, be curious to know what did you do. Maybe you took a different path using different services or maybe a different architecture. So definitely share your experience with us in the comments or reach out to us online on Twitter or LinkedIn or master as well. All right, thank you very much. We'll see you in the next episode.