Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: Have you been following reInvent 2022? If you're following this podcast, you are probably into AWS and aware of everything that was announced at AWS this year. So don't worry, today we're not going to give you another recap of reInvent. We're going to spare you all of that and just focus on three announcements that we are most excited about and we want to tell you exactly why we really care so much about this particular thing. So of course, this is going to be a very personal take. So we do look forward for disagreements and hearing what did you like the most instead. My name is Luciano and today I'm joined by Eoin and this is yet another episode of AWS Bites podcast. AWS Bites is sponsored by fourTheorem. fourTheorem is an AWS consulting partner offering training, cloud migration and modern application architecture. Find out more at fourtheorem.com. You'll find the link in the show notes. So, okay Eoin, let's maybe start with the first announcement. What did you like the most?
Eoin: I think the standout one for me was one that really heads in the right direction and I'm talking about a new addition that allows you to reduce the amount of glue code you have to write in your applications. And when you're building serverless applications and event-driven applications, this is really important because you can end up with a proliferation of Lambdas that don't do very much otherwise.
So we're talking about EventBridge Pipes. Now we've talked a lot about EventBridge in previous episodes as part of our event series. EventBridge allows you to create event-driven applications by publishing messages to a broker or a boss and setting up pattern matching rules for consumers. EventBridge is also one of the fastest developing AWS services. So this announcement followed quickly from the release of EventBridge schedules, which is another really exciting pre-event announcement.
What we like about EventBridge is that it's extremely easy to get started with it. The amount of setup is reasonably minimal, especially compared to other event services and you can send and receive events straight away. It also supports native events, so you can listen to other things happening in your AWS account very easily. So a new object was created in a bucket, you can listen to that. An ECS container has stopped, you can listen to that too and react to it.
By default, by the way, you listen to an event, like one of these created by an EventBridge by creating an EventBridge rule. A rule is generally a pattern that allows you to capture all of the events that conform to that pattern and there's a content filtering syntax that allows you to do that. Then you can specify one or more targets that will receive that into an invocation, like a Lambda function, an SNS topic, a queue, a Kinesis data stream or a HTTP endpoint.
So what are EventBridge Pipes and why are they so cool? Well, the idea with EventBridge rules is that you're dealing with PubSub type interactions. So you have one producer and multiple consumers. With EventBridge Pipes, they're point to point. So it allows you to take an event from one source and pass it to another with optional filtering, transformation and enrichment. So the goal here is to avoid you having to write more custom code.
So for example, you would typically previously have to do a lot of Lambdas that take data from one source, put it into a queue. You don't have to do that anymore. So let's dive into it a little bit and talk about the various constructs in a pipe. Werner Vogels in the keynote compared this to Unix pipes and the Unix philosophy with standard edit and standard input and text as the interchange format between them.
So it's very much along the lines of that same principle. So you have event sources, and most of the services supported by EventBridge rule targets are supported here. I think it was actually, they mentioned in the blog post or some of the Twitter commentary about this, that the event sources were very much inspired by Lambdas event source mappings. So you can take events from DynamoDB streams, from Kinesis data streams, SQS, Kafka, your own Kafka or AWS's managed Kafka, also Amazon MQ, also SQS, and you can then send them on to Step Functions, Kinesis data streams, Lambda, third party APIs, API gateway.
And you can put a, just like with an EventBridge rule, you can put an input transformer. So you can transform the event before you send it on to the target. You can also filter events, which is really, really, really important. So you use the same syntax as you do with EventBridge rule patterns, and then you use that to essentially filter out what the subset of the events coming from that source that you want to forward to the target.
So that's pretty much it. It's a bit like if you imagine if you're a Unix fan, you might cut a file, pipe it to grep to filter out some of the lines, and then send that on to the WC command to get a word to it. So it's a similar idea, right? You've got sources, filters, and targets. One of the things that you can do with EventBridge Pipes then is also enrichment. So this allows you to call out to other services or an API to get additional data and add it into the event.
So you can call out your Lambda, Step Functions, or HTTP API or an API gateway. And then you can also transform the result of that too. The other thing I'd probably mention is that pipes also support DLQs. So again, this is like a fairly reliable way of taking data from one system, passing it on to another. So just to kind of summarize EventBridge rules, I think it's going to be very powerful. Hopefully, it'll allow a lot of people to delete Lambda functions they don't need anymore, and focus on using Lambda for kind of meaningful computation rather than just transporting data from A to B. The main difference is just to summarize that, it's point-to-point, with pipes, but it's pub-sub with EventBridge rules. With pipes, you don't have to write code to take an event from sources and put it into EventBridge. Like with EventBridge rules, you're writing a rule for an event that's already coming along to the bus. Somebody still has to put it onto the bus. With pipes, it's taking care of taking the data from the source for you. And the other difference between pipes and rules is that pipes have enrichment enrichment. So you can do that with a lot of different types of enrichment. Enrichment support as well. So what do you think? Is that your number one as well? I was a little bit tempted to go with that one.
Luciano: I was really excited about that one. But since you covered it already, I'm going to talk about the other one that I really liked, which is Step Function distributed map. And also Step Function is a topic that we have been talking about in the past. So what's so interesting about distributed map? So in a Step Function, you can already do a map step. And that map step, it's something useful when basically you want to take a bunch of different input.
For instance, coming from the previous state, you have an array and you want to do something repeated n times for every item in that particular array. And that works really well. There are a lot of practical applications for that, but it's very limited. You cannot process more than 40 items concurrently. So where distributed map is trying to improve things is to try to raise that limit much, much more and give you a much higher throughput if you really have to process a large number of things concurrently.
And it also takes a slightly different approach. I'm going to try to describe how. But the first thing worth mentioning is that where the limit is 40 for regular map, with distributed map, the concurrency limit is 10,000 items. And it's even more interesting than that because you can process up to 100 million items in total. So a full distributed map can have a maximum number of 100 million items and they will be processed 10,000 at a time.
So you can imagine what's the difference in scale already. Now, how does it work in practice? Because the model is slightly different from what you would use with a regular map. So each map step is basically running a child Step Function and that Step Function has its own execution history. So it's kind of, in a way, an orchestrator of children Step Functions every time you're running that distributed map step.
The input is taken from S3. This is another big difference. Like with regular map step, you generally can take a... or either the entire state of the function or a portion of that state if you are mapping with the JSON syntax. And basically you are just saying, take this array and repeat some other steps for every item in that array. Instead, with distributed map, you need to take data from S3 and that needs to be some kind of structured file format.
It can be a JSON, a CSV, or you can even use an API call like list objects. So that basically is the way that you can load a lot more data into Step Functions, which I think is another limitation that we have with the traditional map step, where you are only limited to the state, which is not a lot of data. With distributed map, you can actually process big data files and repeat that operation with very high throughput and concurrency.
So what are some of the use cases? Definitely batch processing. So for instance, if you have a lot of files in S3, maybe representing some valuable piece of business information. I don't know, maybe something around analytics for your e-commerce. And maybe you can have for every product in your e-commerce, you might have a JSON file that tells you exactly all the IP addresses that look at that product.
And you might want to do some analytics to try to figure out in which regions every single one of your products can be relevant for people. So you can do some marketing. That could be a use case where it could be the Step Function that takes all your files in parallel. And then every sub Step Function will be crunching all the data and give you some analytics about that. This is just to give you a random example, but you can come up with other examples like, I don't know, financial modeling.
So you might be running some models over your data set and try to come up with some results about maybe, I don't know, calculating a risk score for specific deals that your organization is working on. Or another use case, which is apparently our favorite one, because we end up mentioning it in almost every episode, transforming images, maybe creating thumbnails of images that you have somewhere in S3, or maybe even just extrapolating information from those images, connecting with other services, and try maybe to do some computer vision analysis and then figure out, okay, what is a description for every single image?
So you can imagine basically all these kind of orchestration workflows where you're starting with a lot of data from S3 and you just want to map basically S3 files to something else. You can create a Step Function and use this distributed map functionality now. Now, there are of course also some limitations. So although I am very excited, I am also a little bit disappointed that they didn't go just a step further, which would have been even more amazing.
So maybe this is my wish list for reInvent 2023 already. But basically what I was a bit disappointed about is that you can only deal with a flat list of inputs. So basically that means that you run, you can imagine the mental models, like there is no correlation between inputs. You just run everything concurrently. Of course, there is a concurrency limit, but you cannot create rules such as like a dependency graph where you could say, I need to run this file first, then I can use the output of this other file to run something else and create a more kind of complex orchestrated way of running the workflow, which can be very convenient in some cases.
Again, I'm thinking risk modeling, where maybe you have data that needs to flow from one deal to another for that compute issue to make sense. And another thing is that you cannot dynamically add items to the execution run, which basically means that even if you wanted to have your own custom orchestration logic from the outside and push things into this pool of things to process, that's not really something that is supported today.
You just need to define everything in advance and the Step Function is just going to take that input, create this kind of execution and run it, and then eventually you'll get your results. And also cost might be an issue because you are basically doing a huge amount of stage transitions, step transition in the Step Function. So you need to be careful and try to come up with some numbers to make sure that the cost is going to be reasonable for you, depending on the type of computation you're going to do and the number of files that you're going to process. So that's everything I have about distributed map. I don't know if you want to add anything that I might have missed.
Eoin: It's a bit like the first one in that it's also something that could potentially allow you to remove a huge amount of glue code and orchestration logic. So I think it's really, really a great step in the right direction. And I think those wish list items you mentioned would just really make it fantastic altogether. The price issue is definitely a concern because it's a bit like the pricing model for Step Functions wasn't designed for this level of scale.
But you're still paying like two and a half cents for thousands state transitions in Step Functions. So you can imagine if you've got a million state transitions, it's now quite possible to reach that. That's $25. So if you're running that multiple times a day, it adds up over the month. So you have to work that out. See our previous episode on pricing. Absolutely. Okay, well, maybe we could talk about number three in our top three lists then.
And it's pretty hard to choose because there were some pretty good announcements elsewhere. Code Catalyst is another one that's worth a mention. But we're going to talk about Application Composer. So Application Composer is a completely new tool in the AWS console for visually designing new applications. Or visualizing existing applications. So this is in preview right now. So it's not generally available, but you can try it out and give it a go.
And I've done that and I've found it to be much, much better than previous attempts at this kind of thing, like the CloudFormation designer. Now, it's really focused on serverless applications right now. But the way it works is that you can build an application from scratch using a drag and drop interface. Visually, it looks good. It makes sense. It's reasonably simple to use. And it will generate the CloudFormation template for you and also generate things like IAM policies you will need.
Now, it doesn't support all of the CloudFormation resources. So there's a set of about 12 or 15 services. So classic things you'll find in a basic serverless application, like an API gateway, Cognito user pools, tables in DynamoDB, EventBridge rules, Kinesis, Lambda, S3, SNS, SQS and Step Functions. And I would love if it supported the many hundreds of resources, there are thousands even, that you can get in CloudFormation and maybe we'll get there. But it's a pretty good start.
So one of the things it can do then as well, if you're using Chrome or Edge browsers, is it can actually synchronize with your code on the file system. So if you're taking the approach of visualizing an existing application, you can point it to the directory, pick your template, and it will visualize that for you. And if you make changes, it will sync them back to the file system. So that's using File System API in the browser.
It doesn't work in Firefox because Firefox doesn't support that. And in that case, you just have to load your template manually. I did try out an example. So I was building a recent service application. It had two features. I was using AWS SAM and this supports AWS SAM, so that was a good fit. My application was using nested stacks. And it was also using Step Functions where the state machine definition was loaded separately from a JSON file.
In that case, it doesn't support nested stacks. I mean, it could load the file, but it just showed me that there were a number of stacks. It couldn't show me the resources within the nested stacks. So I thought, I said, okay, that's fine. So I'll just load the individual stack template. And I did that and I was able to load up all my Lambda functions. And it was able to recognize that there was a state machine, but it didn't parse the state machine definition.
So it wasn't able to draw the lines between my state machine definition and the tasks that were invoked in the Step Function stages. I did try building from scratch and creating a Step Function. And in that case, you can put the definition in, in line, basically in state machine resource, and that seems to work fine. But I was thinking as I was doing that, wouldn't be nice if it's seamlessly integrated with the Step Functions Workflow Studio, so that you can go directly from designing your state machine and your CloudFormation resources and your serverless SAM resources directly into the actual state machine design. And if those two tools blended well together, that could be really, really powerful. I think this is really good thing, because one of the things about serverless applications is that they can be hard to kind of understand how everything fits together, because you've got lots of resources that are sometimes loosely coupled. This is a good step in the right direction. And I think it's going to be really useful, very good for people starting off, I would say as well, with serverless development, because when you're just looking at lines and lines of YAML, it can be a bit of a headache, but when you can visualize it nicely and talk through it, it's like having a live kind of physical architecture diagram for your solution.
Luciano: Yeah, totally agree. And I think this is one of the pain points that we hear the most about when talking with people in the industry or our customers, that it's always very hard to keep in sync your architecture diagrams with your actual architecture running on AWS. So this might be a step forward in that direction. It could be a tool that kind of gives you that automatic visualization of your actual stacks, rather than trying to keep two different things in sync.
And you know that that's really, really hard to do it well. So really happy to see this being announced, even if it's not perfect, I think it's a great one to mention, and it's a step forward for sure. So just to try to wrap things up here, I want to mention that we will have the links for the individual announcements, EventBridge Pipes. Step Functions Distributed Map, and application composer in the show notes.
But also there are three things, three additional things that we were kind of discussing, and they were in our shortlist as well. So maybe just worth a quick mention. So we have SNS payload message filtering, verified access, CloudWatch cross-account observability. We are not going to spend more time on those, but you can find the links in the show notes as well, if you want to deep dive on these other announcements.
We will also have another link, which is our unofficial blog post which highlights all the top announcements of AWS reInvent 2022 directly from AWS. So that's another great source if you have missed something and you just want to see exactly what was announced and deep dive on what's more interesting for you. And with that, I think we are at the end of this episode. We are really curious to hear what are your top three favorite announcements. I realize that as probably our background, we have been mostly focused around the area of application development, but probably in the audience we have people that are more focused on networking, ML, data analytics, and there were a lot of announcements in those areas. So I'm really curious to see what you liked the most and if you were excited about the new things that were announced. So definitely leave us a comment, chat to us on Twitter, and let's be in touch. Until then, see you in the next episode. Bye.