AWS Bites Podcast

39. How do you build a cross-account event backbone with EventBridge?

Published 2022-06-03 - Listen on your favourite podcast player

When it comes to building and deploying microservice applications on AWS, there are 2 emerging best practices: use a separate AWS account per application (and environment) and decouple communication between separate systems using events (instead of point-to-point communication). Can we use these two best practices together? Yes, but we will need to find a way to pass messages between AWS accounts! In this episode we discuss how to do that using EventBridge as a cross-account event backbone! We discuss why these 2 suggestions are well established best practices, what are the pros and cons that they bring to the table, what an event backbone is and why EventBridge is a great service to implement one. Finally, we will discuss a case study and an example implementation of this pattern in the context of an e-commerce application built with a microservices architecture.

In this episode we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: There are two emerging best practices when it comes to building and deploying microservices applications on AWS. The first one is to use separate AWS accounts per application and per environment. And the second one is to decouple communication between separate systems using events instead of point-to-point communication. But what if we want to use both of these best practices together? We will need to find a way to pass messages between AWS accounts, basically. So in this episode, we will discuss how to do that, specifically using Event Breach as a cross-account event backbone. We will also learn what are the advantages of this approach, and we will discuss a case study and examples, and we will see an implementation of this pattern. My name is Luciano, and today I'm joined by Eoin, and this is AWS Bites podcast. So let's start with a quick recap. We just say that it is a good practice to have separate AWS accounts per application. Why is that?

Eoin: The goal is to keep accounts fairly small and clean and not have too much clutter in them. That's one of the things. But there's also some technical limitations as well. From a security perspective, you might want to actually define policies at an account level and make sure that an account is restricted and what you can do with it, depending on the team who's using that account. So you can use things like service control policies to restrict at a high level what's going to happen within an account or a set of accounts.

The other thing is around quotas, so the limits in different services that you can get. Normally, those are applied at account level. So if you've got lots of different workloads mixed in one account, you want to avoid the situation where one workload can starve another of available containers or EC2 instances or whatever else you need to rely on. There's also, I suppose, a trade off with this multi-account approach that's worth stating. So the potential disadvantage is that with more accounts, you'll have more account management overhead. So I think we discussed some of that in previous episodes, but it does require investing in some sort of tooling like org formation to make your life easier when deploying resources across these different accounts. Yeah, that makes a lot of sense. I think we also mentioned a few times, for instance, the example of total number of concurred Lambdas.

Luciano: So if you have workloads that tend to spin up thousands and thousands of Lambdas, they will effectively compete with each other and you'll be hit in the quota. And yeah, by separating accounts, I suppose you can kind of prevent all of that from happening. But we also say that it's a good idea to decouple communication between systems using events. So basically that kind of asynchronous communication. What do we really mean by that and how is that really something that gives us an advantage or gives us a better architecture, I guess? Yeah, I mean, this is another trade off, but there's a lot of people talking about event driven architecture these days and the benefits of it.

Eoin: There is a couple of things that it solves as well as introducing a few challenges. But some of the things it solves are the fact that if you've got synchronous point to point communication between systems, you end up with two types of coupling. Typically, when you do that. So one is location coupling where one system needs to know the address of the other. So you can't really easily change that address. The other one is temporal coupling.

When you have a synchronous request, the request processor needs to be available for at the same time that the request is being sent and the sender is being is actually waiting on the result. So the idea with asynchronous communication is to reduce both of these things, to allow teams to be able to maintain and scale these applications separately and minimize the amount of coordination they need to do.

So you define your event communication mechanism and try to define some sort of a schema or a way in which those messages can be structured, but avoiding too much of a tightly bound contract and certainly an address in the middle. And what you end up with then is a system where you can scale each component independently. We talked about that in some of our previous episodes. But it also allows teams just to develop in isolation and not be blocked by each other.

So it solves those problems, but it also introduces some challenges. And you still have a need from time to time where you will say, OK, well, we've got two separate applications in our organization. Let's define a RESTful API with some JSON schemas and some Swagger open API documentation. And that's our interface. And that's still OK, but you just have to be prepared to solve the coupling concerns in a different way then and just be prepared to version those APIs and evolve them and communicate if you're going to change them, etc. Yeah, I like that you mentioned that it is a trade off. And I hope that with the examples we are going to be discussing in a moment, it will be more apparent what kind of trade off we are going to be talking about and where it becomes really an advantage and when it's not really worth doing.

Luciano: All this kind of stuff. But we mentioned that the solution to this problem, or at least one of the possible implementation for this particular idea, is to use an event backbone. What is really an event backbone? How do we define it? The event backbone terminology you often hear about in the context of Apache Kafka. And the idea is really that in the past you had service oriented architectures, the solar systems that led to enterprise service buses.

Eoin: And the vision of the enterprise service bus was that you would be able to integrate all of the disparate systems in your organization through one piece of foundational technology that everyone could access. And this could work in some cases, but again, enterprise service bus, because the goal was often to include a lot of integration and routing logic and sometimes some business logic in there, that it could become a big single point of failure in your system. And you end up just introducing the coupling in one location. So the idea with an event backbone is to simplify that ideal and have a simple piece of very scalable technology that can be used for high level communication of events throughout the business. And you try to avoid any business logic there and just instead focus on reliability, durability of messages, and ensure that it scales very well so you don't have to worry about single points of failure in your system.

Luciano: Yeah, actually, funny story. I once built a customer service bus in a company, mostly inspired by Apache Camel. And I do remember how painful it was to provide all the building blocks, like if statements, switch conditions, mapping of data to allow people to actually be able to put all that business logic into the enterprise service bus. I definitely agree that an ESP is something like really stateful and powerful, but that comes, of course, with a cost that all that integration is like, all the logic lives in the integration, which maybe not always a good idea. So I guess the next question I have is, is this something we should do all the time for any kind of distributed application? Or maybe there are some examples where it could be worth it and other cases where it's much more clear that it's not really worth the effort of setting up something like that. Do we have many examples worth discussing?

Eoin: Yeah, I guess it's probably worthwhile saying that we're talking about a specific level of event-based communication here. We're not talking about having one event technology that every single event-centered consumer uses across the whole organization. It's really about high-level inter-system communication. So communication in separate domains in the business, in separate applications, maybe even between services or microservices if they're sufficiently complex. But it's up to you really to pick where you would use this, but I would avoid saying this is a one-size-fits-all solution for all eventing needs.

Okay, so you can still have an event backbone and then within each team's application they can use different event technologies for internal communication that don't necessarily have anything to do with the event backbone. So maybe some examples. If we've got, say, a video streaming service, right? We know many, but imagine that in their content ingestion workflow they've got two separate, completely distinct applications. One that's involved in ingesting all the video files and transcoding them and making them available.

And another one is there to sort out the catalog so people can browse, see what to watch next, see details about programs. So you can imagine that it's pretty important as a business rule that a title should not appear in the catalog until all the videos have been processed. So maybe the ingestion service will eventually emit an event onto this event backbone and the catalog application would subscribe to a number of events across the system and once it knows everything is ready and the licensing of the title is in effect and all of the video files have been transcoded and published out to the CDN, then it knows it can set the catalog's entry to published.

So that's one example. Maybe another one is in the kind of business application we talked about more in recent episodes. If you're performing lots of calculations like batch processing and producing a lot of data in a database, maybe at the end of that process you might publish an event to say all this data is ready, and then you have a downstream reporting application which is, again, completely distinct in terms of the application and its deployment, but is related in some way because it's consuming from the data that is generated by that first system. So I think a backbone is another good fit for that kind of high level intersystem business communication. Yeah, I suppose in general if you also expect your application to evolve a lot and get more kind of moving parts around it, probably there is another advantage there because in the case of the analytics you just mentioned, maybe eventually you want to have another process that knows when these files are published and at that point you can just create a new application, listen for the same event, and you don't have to change anything else.

Luciano: So the source application emitting events doesn't even know which other applications are listening for those events. Yep, that's a good call, and you might want to actually have just a standard in place that when you finish the business process you emit life cycle events about that.

Eoin: You know, when you start a process, when it's completed, when it's failed is also important. That might also be an interesting fact for some other systems to know about. So we mentioned that EventBridge is one of the possible services that we could use in AWS to build something like this, but why specifically EventBridge and not any of the other messaging services?

Luciano: And we have been talking about messaging services a lot, so I'm kind of curious why EventBridge is specifically a good choice for this use case rather than any of the others. Yeah, I think if you go back to those episodes, there's a couple of things that EventBridge really shines at, and the main one is simplicity, because there's very little to set up and I think that makes it a very good candidate.

Eoin: But also in real terms, it has a very large feature set. So one of the important ones in this context, since we're talking about separate applications and separate accounts, EventBridge has really good cross account support. It could be better, actually, so maybe we'll get onto that in a while, how it could be improved, but it is really good. And if you're building all these applications in AWS, and you're thinking, okay, how do I figure out the networking for all these applications to talk to my centralized event backbone?

All of this complexity is taken away with EventBridge, really. It's also massively scalable. There's a very low investment for teams to get up and running with it. It integrates really well with lots of other AWS services. And it's also, that means you can adapt it to more complex use cases. So if you have needs where you need durability, you can add in SQS on the consumer side, and that makes it very, very easy to adapt.

You can also decide that for specific cases, you could add in Kinesis, for example, or SNS. So it's the kind of investment that you can adapt really easily. You're not stuck with it. You could swap it out in very specific cases if you need something slightly different. And we covered a lot of those different advantages in the comparison episode when we talked about all the different event services.

So I think it's a really good one to start with. You could also build on Kafka. We mentioned Kafka already, especially if you already have a lot of Kafka skills in-house and you've got a team who can work with that. Or you could build on SNS. SNS is actually another good service that you could use to build a backbone. But I think EventBridge will go into some of the details on how it works maybe. But I think it's definitely a good place to start if you're not sure. Yeah, I agree. EventBridge feels like the simplest and the most flexible option at the same time, which is really interesting. So if you don't really need that kind of extreme latency performance that you get with maybe something like SNS, probably you're going to be fine for most of the use cases.

Luciano: So now we talked about why it's good to have separated AWS accounts, why it's good to have something like an event backbone, why EventBridge can be a good solution to implement all of that. Can we maybe propose an actual somewhat realistic example and describe a little bit more in detail how we could implement all of this stuff?

Eoin: Yeah, let's do that. And there is an article actually, there's a blog post that details this architecture and the source code as well. So we'll link to that in the show notes if people want to explore it in more detail and deploy it for themselves and see the events flowing in front of their eyes. So let's go back to our typical example. I will caveat that it's not just an article, there is an entire repository attached to it so you get all the source code from like application code, infrastructure as code. So definitely try to have a look and deploy it yourself and play around with it.

Luciano: So we like ecommerce examples on AWS Bites. So let's talk about our canonical ecommerce example where we've got an, in this example we tried to simplify it as much as possible so you've got an order service and a delivery service.

Eoin: So one service deals with orders and more user facing. The other one is dealing with more fulfillment. And in our example code. These are very very simple applications. But remember that we said for an event backbone we're really talking about very, you know, sufficiently complex ones that would warrant their own teams and accounts to maintain them. And you can imagine for a major ecommerce vendor, your order service and your delivery service probably will have one or more teams maintaining each of them, they will be running in separate AWS accounts for each environment.

So one of the workflows you can imagine here is that when an order is created. We want the delivery service to be able to react to that and handle it in some way. And correspondingly, when the deliveries are fulfilled, we might want to update the state of the order to reflect that. So each of those applications doesn't really have to know that the other order system exists, but they do understand that there is a concept of an order, a concept of a delivery, and each of them has a lifecycle where they transition between different states.

So you would deploy those to separate AWS accounts. But in this example application we actually have three AWS accounts because we've decided to deploy the backbone itself as an event bridge global bus, that's what we're calling it, the global bus into its own account. So it's again, it's its own piece of functionality. And then each of these services, the order service and the delivery service will publish events to this global bus.

There's probably a couple of questions that might be coming up in people's minds as they talk through this. So you might think, well, why do the services have to communicate with this global bus? Do they have to reach across an account in order to do anything? Well, there's a couple of, I suppose, fundamental principles with event bridge that you have to be aware of. And one is that if you want to have cross account rules, you can really only create rules that target another bus.

You can't have a cross account rule that will target a Lambda function in another account or an SQS queue in another account. So if you want to, I suppose, propagate events from one account to another, you need to use, go from one bus to another bus, and then you can create the rules within an account to trigger resources in that same account. So what we're basically doing is building a routing mechanism here. So each application will have its own local bus. And there's a global bus that will essentially make sure that the right events are routed through to each local bus. So it might take a little bit of time to kind of see how this all plays out. But if you have a look at the article and the repo, there's a diagram and there's the source code that will show it. It's not a lot of AWS resources. It just means that you kind of have to understand what the limitations are here. So if I'm getting this correct, we basically have three accounts, one for the global bus, where the only thing that lives there is this global bus.

Luciano: And then in the other two accounts, they are dedicated individually for specific services. And in every account, you have all the service related stuff plus an event breach bus. And basically when you want to dispatch a global event, you go directly to the global account. But when you want to listen for a global event, you listen from your local event bus. Exactly. Is that correct? That's how it works. Yeah.

Eoin: Perfect.

Luciano: And I suppose it takes a lot of fun to configure all the policies so that the accounts are effectively allowed to read and write from all the different places, right? Yeah. Because it's cross-account access, you need to authorize that interaction on both sides.

Eoin: The code example there hopefully gives you a template that you can use to do this in your own application. It's a CDK application. You could do it in CloudFormation or Terraform just as easily, but you just need to make sure that you have the policy in place every time you add a new application with its own account. So I suppose one of the questions that might come up is also, so we talked about these three accounts, the global bus, I guess, makes sense, and the local bus makes sense for each account.

But you have to publish to the global bus and then you kind of react to rules on the local bus. So that might seem like it's an inconsistency. The reason for that is that you might say, okay, well, let's just publish to our local bus and react to events from our local bus. And we don't have to know about the global bus at all and then let the routing happen behind the scenes. But there is actually a restriction in event bridge. That means you can't go and have a rule that goes from bus A to B to C. You can only go from A to B basically. So you can only have one cross account event bridge rule to another bridge, to another bus. You can't just keep hopping to another bus and then another bus. So that's why we basically say you publish the global bus and then that event can be distributed to all the local buses. And that's how you react to them. So you could check out the architecture diagram and it might provide a bit of clarity on how all this works. But that's just how we set it up from the beginning to make sure that this will work and scale to many, many applications. Yeah.

Luciano: The other thing I like about this is that you need to like the architecture kind of forces you to think what is a global event, like something you want to dispatch for other services and systems to consume, as opposed to something that is maybe a local event that makes sense only in the context of a specific service. And you don't necessarily want to publish outside of your service. So that's another thing that I think this architecture forces you to think a little bit more about. So I have another question. Is there any, I don't know, best practice or pitfall or something else to be aware when we implement and start to do something like this? Yeah, I guess so. And I remember when we were talking about EventBridge specifically in the EventBridge episode, we talked about how do you troubleshoot? How do you get observability into what happens?

Eoin: How do you find out where your missing event has gone? So one of the things this application example has is logging of all events on every bus. So creating an event bus is easier. And as we know from that episode, you don't have to create one at all. You get a default one with every account. But in this case, we were deciding to create specifically named buses in every account, just so it provides clarity and isolation.

We're also providing a log group, CloudWatch log group with every bus. And we're logging all events that come into a bus into the log group. So if your rule isn't working, you can look in the log, see where the event is, see if what the pattern is like. You can go into the AWS console and use the tool there for testing your event, checking the match. So I think that's a really good practice and I'd recommend that. But there's plenty more. If you're deploying this in an enterprise context, you'll also need to think about the structure of events and schemas as well. And whether you want to enforce schema validation on events. I think there's a couple of resources that are worth pointing to on this. Maybe we'll cover them at the end. But there are some really good talks and blogs out there that give good advice on how to construct those events and how to do validation and enforce more strictness if you need to. All right. That sounds awesome.

Luciano: So I think at this point, we're just going to give people more of these resources to check out if you want to know more about EventBridge in general and other similar services. So the first one is that we have a blog post on the FortierM blog. You'll find the link in the description. That basically is kind of a summary of all the main features and all the things you need to know to actually use EventBridge.

And then we have another blog post by Shin Breezels about how basically you could structure your payloads so that you get the best out of EventBridge. We also have another blog post about SNS. So this will give you a very good comparison between SNS and EventBridge. This is in the FortierM blog. And finally, actually other two resources. Another one is a very good talk by Luc van Dorkins, I think, by PostNL. And it explains how they use EventBridge and event-driven architectures to scale all that system. And if all of that sounds confusing, the next step that you should be doing is going check out episode 23 of AWS Bites podcast, where we give you all the details about EventBridge. So we'll leave you with that. And thank you for following and we'll see you at the next one.