Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Eoin: Integration testing event driven systems is a classic hard problem. With modern distributed applications, you have events filing off all over the place. How do you write integration tests that check if events have been sent and received correctly? As part of pre-Invent, AWS have just released a solution for exactly that. The AWS IATK. We've been taking it for a test drive and are going to share everything we found out.
We'll also talk about some of the alternatives. By the end of this episode, you should have a good idea of how you can use IATK and we'll share a project where we have been able to use it to test a cross account application with EventBridge. My name is Eoin. I'm here with Luciano and this is AWS Bites. AWS Bites is brought to you by 4thErem, the ultimate AWS partner for modern applications on AWS. We can help you to be successful with AWS. So check us out at 4thErem.com. Before we get into IATK and the alternatives, let's take a step back to Luciano. What is integration testing? How does it compare to other forms of testing? And why is it so difficult for event-driven applications?
Luciano: Testing involves taking your code and testing it together with your application and including external systems. So if your code is communicating with things like a database, for instance, as part of your integration test, you need to include that database. And for instance, in a more complicated use case, maybe you connect to another base first, then you create an email and then you send that email through an email provider.
So you can include all these external pieces as part of your integration test. And it's a little bit broader than what you might have been used to if you only have done unit tests, which typically are focusing only on small units of code in isolation. The idea is to be very efficient, very specific, make sure that that one feature works really, really well, but they don't cover how do you integrate that feature with the rest of the application. So this is where integration tests come into place to make sure that you are actually writing units that are correct, but then those units are actually still correct when they are put together and combined into your own software solution. So with AWS, it's tricky because often when you build a solution for AWS, you start to use all these very specific AWS services with their own specific APIs, for instance, EC2, RDS, ElastiCache or DynamoDB, EventBridge.
And of course, if you're writing unit tests, you can mock some of that and simulate their behavior to make sure that part of your own business logic works well. But then at some point you need to make sure that also your own mocks are correct and your implementation actually works with the real backend, with the real AWS services. So that's when writing integration tests can help to increase the confidence that your application is actually going to perform well and be correct when you deploy to production. And it's very often that you will find bugs not in your code itself, but just the way that you are integrating things together. Maybe a configuration option is wrong, or maybe you assume that certain API will work in a certain way and then in reality works slightly different. So in your unit test, you didn't capture that behavior because you mocked that API and made assumptions. But then when you run it against the real service, then you realize that there was a mistake there or an edge case that you didn't include in your own logic. We usually like to mention as an example building e-commerce solution, because I think this is an example that everyone can relate to. And in that particular case, we can imagine that there is, I don't know, an order service and a delivery service. You might be writing them independently as two totally separate services. They might have their own tests, but then at some point they will have to be integrated together. So for instance, we might have the case where when you place an order, that order is pushed to something like EventBridge. And through EventBridge, there is a notification that gets picked up by the delivery service and the delivery service knows that an order was created and it starts to process it and do all the fulfillment procedure that you put in place for your own e-commerce. So that EventBridge is the tricky bit because how do you actually test it and how do you make sure that on one side you are producing the right type of event, on the other side you are picking it up and processing it correctly. So that's the question that we want to explore today.
Eoin: There's different ways, different approaches that people use for testing events, these kinds of applications. One is you just build in logging to the event system so all events are logged. And then in your integration test, you can actually scan the logs and filter out for the ones that you're interested in testing and validating. That's a fairly simple approach, but logs aren't always reliable in terms of the time to deliver them and the guarantees around delivering them and having to parse them. It can be a little bit slow and inefficient. Another approach you can take is just focus on more end-to-end testing. So testing the final outcome like a record appearing in the database or an email being delivered and then you don't have to think about testing the events at all. It might work for some cases, but not all cases will have a readable outcome like this and you might want to focus on just a smaller unit that you're integrating. And then the third approach is temporarily recreating an additional consumer for the event just for the purposes of your test.
So if you are testing SNS or EventBridge, this could involve adding a temporary SQS queue as a subscriber or target and polling that SQS queue for a limited period to check for delivery of the expected message. I think this last approach with the temporary queue is probably the most reliable, but it requires a bit of setup. You also have to think about the additional latency to create these test resources and also think about deleting the queue when the test is finished, including in cases where the test exits before any tear-down phase has a chance to happen. Now when we're talking about the different approaches, we're talking about integration testing and end-to-end testing, so it might be worth clarifying the distinction between those two. Integration testing, as you said Luciano, is basically ensuring that application components work well individually and together, including with external systems. End-to-end testing is broader. It's still integration testing because you're using real services, but it's really everything altogether. So it evaluates the product as a whole from the user's perspective, user flows, and that can include starting with your front end or an API or whatever the external user interface is. So given those three approaches we talk about, what are the tools out there to help you with this?
Luciano: One of the tools that we have been using, and I think it's very relevant here, is called SLS Test Tools, which comes from this company called Alayos. It's basically a tool that extends Jest, so the famous Node.js test runner, and it provides a bunch of additional matchers, I guess I'm going to call them. I'm not sure it's the right terminology, but it extends basically the capabilities of Jest, the built-in checks that you can do to include checks that are very specific to AWS services, for instance DynamoDB, S3, Step Functions, SQS, EventBridge. So the idea is that you will create a test where you provision some infrastructure and then as part of your test running you can use the specific matchers or assertions to basically check that the infrastructure was created correctly and that certain behaviors actually are apparent in the infrastructure after you executed the specific tests. There is a very good article by Sarah Hamilton that we're going to put in the show notes which describes a little bit of a tutorial on how to use it and why it can be very convenient, but I think the hype these days is into this new tool announced by AWS that we mentioned in the introduction and I think we want to show a little bit more of that and maybe compare it with SLS test tools. So how does AWS integrated application test kit compares with SLS test tools?
Eoin: I think that Sarah Hamilton article is actually a very good article on the general approach here, everything we're describing, and I wouldn't be surprised if it actually inspired some of the design of this integration application test kit from AWS. So it just has been launched and it's still under public preview, so we're going to talk about the pros and cons, but we should be fair and say that this is just released. AWS releases things early so we can expect some glitches. This one is currently available for Python-based tests, although it's implemented mostly in Go, Golang for the core and it simply uses an RPC wrapper for Python. AWS says that they will add other languages in time, so I think that's a good thing to see. It has a few capabilities. We can kind of break them down into three parts. One is creating test events from the EventBridge schema registry that you can use in your integration tests. The second one is probably more aligned to what we've been talking about in terms of challenges and that's the validating that events have been received via EventBridge and have the correct structure. And the third one, which is kind of the most innovative piece almost, is checking the correct flow of events with X-ray. So we'll go through how the process for all of these things work. Maybe Luciano, you can talk about IATK and how it works from the perspective of EventBridge event testing.
Luciano: So what we tested is basically a very simple example and in this example what happens with this IATK tool is that it creates a temporary SQS queue and a temporary rule on the bus which uses the same patterns as the rule you want to test and that will allow you to basically put into SQS copies of the events that are happening so you can expect them and make sure that they look correct from an application perspective. It doesn't allow you to specify any pattern. You have to specify a rule and a target when you create a test listener and this seems a little bit strange. Now I don't know why it needs the target but we could be missing something that maybe is a little bit obvious to AWS and that we are not seeing here. Now the idea is that again you are trying to capture that event into SQS and then analyze it after you have been executing the code that you want to test. So it provides a number of helper functions for you to effectively clean up everything after you executed the test but also to inspect the state of the system after the test was executed. And you can also do things like clean up first just in case that a previous execution maybe left things in a little bit of a dirty state. So this is something that's actually recommended by the documentation and then you run your test, you do all of your own assertions and then in the third down phase you clean up again. Now before we go through all the features of this tool and how to use it, it's interesting to note that everything is available on GitHub including examples and what we did is basically we created an integration test for a repository that we call Cross Account Event Bridge which is something we built previously and it basically allows you to execute event bridge across accounts and share messages across accounts. And this is something we mentioned in a previous episode, episode 39, and you can find the link in the show notes if you want to know a little bit more about that specific use case and why we built it. Now this repository uses TypeScript for the CDK but because right now this tool only allows you to use Python, we wrote the integration tests in Python and we will also have a link to the specific test section in this repository if you want to find a quick way to go and just see how we brought the tests. So let's talk maybe a little bit about the process of creating this kind of test. Eoin, do you want to cover that?
Eoin: Yep, so what we did was we were using PyTest so you create your Python test then you use the AWS IATK Python module and instantiate it. Now immediately we kind of ran into an issue where it didn't pick up the credentials locally. The documentation says that it should pick up your AWS environment variables but I was just getting expired key all the time and I think it was picking them up from somewhere else like from a credentials file or config file. I don't use credentials for files so I don't know why but that didn't seem to work as documented so we had to actually specify a profile argument in the constructor in order to get this to work. Once we've done that then you just you need to know the event bus, the rule and the target. So IATK also provides some utility functions for reading cloud formation resource physical IDs or reading cloud formation outputs so that you can get those values for your stack. You mentioned also the cleanup process for ephemeral resources so you can use the IATK remove listeners helper to do that and you will it will use a tag filter to identify the resources that it can clean up safely. So you call that at the start of your test and then you also call it during your tear down. That's basically how you do it so that it clean up at the end of the tests normally but it'll also run at the start of tests in case there's anything dangling from previously aborted runs. So then you create an IATK listener so you give it the bus name, the rule name, the target ID and some tags and this will allow you to start checking for events. So when you create this listener under the hood it's creating an SQS queue and it's creating an event bridge rule to route events to that queue and it basically copies the pattern from the rule you provided. Again the fact that they ask you to provide the target ID as well doesn't make any sense to me because once you have a rule and a bus and you have a pattern that's all you need I think to do the test. So I didn't look into the code to find out what was that what that was all about. I'm not very good at reading Golang so I wasn't going to go in there and try to figure out what was going on but yeah maybe somebody can explain. I'm sure they have a valid reason for it. And then you have two options for actually retrieving the events and doing the validation. One is wait until event matched. This basically just waits for one event to come in on the bus and you provide an assertion function to check if the message is the one you expect. The other one is poll events. So this is a different model where you basically say you poll for events for 20 seconds and it'll give you all the events that arrive in that period and then you can go through them and filter them and check if they are valid yourself. In your tests teardown function then just remember to clean up the resources created by IATK. And you can see this example and we have in our cross-account event bridge e-commerce example we have the event bridge testing approach but we also tried out the trace validation with x-ray which seemed pretty exciting. We got some mixed results but we were able to get it to work. Luciano do you want to describe that process?
Luciano: Yes I think it's a good advice in general to have x-ray enabled when you're doing event driven systems because that gives you peace of mind that you can trace exactly how requests flow through the system and which different components gets used based on your own requests. And of course this is something you can even leverage here for testing and this is the one of the more innovative things that I think this tool brings to the table. So the idea is that you set up everything, you execute your test, meanwhile the system is also collecting traces because you have enabled x-ray. And one of the things that you can do in your own assertions is that you can actually fetch the traces as a structured object and then do assertions on the traces themselves. So basically that can help you to make sure that the systems that you expect to be involved in that particular flow are actually being involved.
If there is some kind of ordering that might be important for you it's also something you can use to do assertion and make sure systems are actually propagating messages in the correct order. So basically what the library allows you to do is gives you an helper function that's called getTraceTree and with this function you can specify a tracing header as a parameter and then it gives you back this object which represents the tree of traces so it's a nested structure where you can follow the different branches to make sure that things are happening correctly.
Now depending of course on the complexity of your code and how many systems are involved you might have to write lots of code to do these assertions correctly like it's not like a plain array where it's easy to assert certain things you might need to traverse the tree so it might be a little bit tricky to test exactly what you want to test but you have the entire view of all the systems involved there assuming that you enable and configure x-ray correctly. Now just to give you examples of the kind of matching that you might be doing on the trace tree for instance you might check if there are errors in any segment maybe a specific system was part of this transaction and that system produced an error maybe it's not something you might realize immediately by just looking at the final result of your execution but just looking at the tree you might realize that some of the components was failing in some unexpected way so I think it's good practice to try to traverse the tree and look for this kind of things and if you didn't expect any error in the case you see an error maybe make the test fail and report that particular error. You can also check performance metrics so make sure for instance if you have requirements in terms of SLAs regarding response times you might produce a warning or even fail the test if you see that certain numbers go beyond the thresholds that you expect as acceptable and you can also check if specific components were actually part of the trace maybe there are systems that don't really expose a behavior that you can assert at the end of your test but you just want to make sure that they received for instance some information and somehow they were part of this transaction so you can just assert that they were part of the trace tree at some point in the tree and finally you can also do the inverse for instance you might know that there are only three systems for example involved so you might assert that those three systems are there but if by any chance you see a fourth system that's probably a symptom that you are doing something a little bit unexpected so maybe something else that you might want to write an assertion about and make sure that only the things that you expect to happen are actually happening and nothing else is happening in the flow. So what are the things that we were happy with and the things that we were disappointed with?
Eoin: I think we were curious to see this in action but when we did the testing there were a couple of issues with the x-ray approach. The get trace tree function you mentioned gave us an error which basically said error building trace tree and it said that it found a trace with a segment that had no parent which was a very strange one like how can how can you get a segment from x-ray with no parent so we looked into the raw data like in the x-ray console in cloud watch and we saw the raw data we saw the segment and we saw that it had a parent so we figured maybe it was like an eventual consistency kind of a problem that but at the time when the test was reading it maybe all of the data hadn't been fully settled so what we did was we filed a bug in the IADK repo and we did a workaround so there was that other function you mentioned that allows you to wait until a trace comes with a valid condition.
We still had to add in a sleep in order for this to work so when we switched over to that function even though the documentation says you should never have to sleep in order to get it to work we had to get it to sleep and that would work. Now there are examples in the code of using sleeps with the get trace tree approach but not with the approach we eventually used so it seems like the documentation and the behavior aren't 100 aligned on this but luckily we were able to do it and we were able to get the traces in this example application we have we've got three kind of services we've got this kind of global bus and then we've got an order service and a delivery service so the flow the trace is kind of interesting it's like an event bridge to lambda to event bridge event bridge to lambda to event bridge three times and we were able to assert that that's true and we were also able to put in some performance checks like an SLA that all of this should take no longer than 10 seconds for example which is pretty good for exposing any unexpected performance degradations just in your continuous build process so I think that was quite a nice one. So that's we've talked about the event bridge testing approach the x-ray testing approach there was one other feature in IATK do you want to say something about mock events?
Luciano: Yes mock event is something else we we tried and it's an interesting feature that you have available and the idea is that if you're using event bridge in event bridge you can have a schema registry so what this library allows you to do is basically to generate a mock event starting from your own schema registry and then use it effectively as a source for your own tests and for instance you you can say I want to generate a mock event that invokes a lambda or a step function and then that's basically the starting point of your own test and this can be convenient for instance when you have maybe a very complex flow and you want to break down the testing into individual parts maybe you want to test one integration at the time it can make it a little bit easier to actually have a clear starting point where you can craft exactly the event that gives you a good test case without having maybe to go through a bunch of additional steps that you might have in the entire flow so it's basically a way to make your test case start from an event in event bridge I don't think this was very applicable for our own testing so we kind of had more of a look at the documentation and so that looks like an interesting feature when you have kind of a multi-step approach maybe going through different rounds of generating events on event bridge picking up the event from there and doing something else but yeah in our case this wasn't really applicable but nonetheless it's an interesting feature that you might find useful and it's great to have the convenience to be able to create an event not let's say from scratch but starting with what you have already in your schema registry which should give you a little bit of confidence that the event is matching what you will actually have in a production scenario so what are all our overall thoughts like what did we like and what we didn't like about this new tool
Eoin: on the good side I think this is a nice addition just fills a gap in tooling for integration testing there weren't a lot of options out there we had sls test tools people are rolling their own this is a good one it supports other languages which is a good thing I think as well once you're familiar with it it makes testing these pretty complex cases quite simple the fact that they allow you to clean up down resources easily it's very nice because that's one of the problems you'll face if you do this yourself and I think an advantage of that RPC approach is that we will have support for additional languages in the future which is promising because it'll see broader adoption and hopefully then more features without them having to rewrite different versions for different languages what about the bad do you want to be the messenger for all the bad news
Luciano: Luciano I can't bear it I can be the bearer of bad news so let's recap what we didn't like and just disclaimer this is probably due to the fact that this is a very new product it's still very early stage so everything we are saying here maybe it's not going to be applicable anymore in a few months as the product evolves and gets polished gets new feature gets bug fixes and so on so we already mentioned that traces didn't work the first time maybe our fault maybe we did something wrong but it wasn't really obvious how to do this just by looking at the documentation and copying their own example so definitely something to be improved either fixing bugs improving the documentation providing more examples and making this functionality easier to use correctly the other problem is that there is right now no SQS or SNS or Kinesis or Kafka support yet so really this is applicable today if you're using EventBridge but we know that EventBridge is not the only option here so depending on your architecture it might be disappointing to be able to test EventBridge but not to be able to test the other types of integrations documentation is there there is actually a website with a bunch of pages and examples but it looks like it was put together very quickly so it feels like there is a a lot more work to be done there for instance there is a section called tutorial and in that tutorial you only learn how to install the tool and then there is a link to some examples so I had the feeling that they want to create a more kind of fully fledged tutorial that walk you through all the different steps and gives you a bunch of different examples where every example is actually discussed what is the rationale behind the specific implementation for that example but right now you have to kind of fill the gaps on your own and just learn how to install it look at the code and the examples and figure out everything else in between now this is also a project that is open source so maybe if people are willing to contribute they can speed up the process of building this body of documentation and making the experience better for everyone else so that wouldn't be the first time for AWS to receive open source contribution and make specific tooling a little bit better so probably something where actually we have a chance to contribute and makes the project a little bit better the last one is on the RPC approach that we mentioned we like on one side because it makes this project more likely to be fully supported across languages in a consistent way even though today is all available for python the problem that we expect to see there is that with this kind of abstractions often when you have an error that error can be very obscure because it's just showing you for instance an issue at the python wrapper level but that error that you might see might be very generic and then the actual error is hidden in the go implementation which is abstracted away by the RPC wrapper this is something that we have seen for many many times i guess when when we use cdk and we see the kind of js ai errors appearing and it's always very hard to troubleshoot here because the design seems very similar we expect to have the same problem and this is maybe something that can be fixed by putting lots of attention in making sure that the rpc layer also propagates good errors and these errors are actually displayed well by all the different wrappers but nonetheless is an effort that ws needs to put into building the library and building the reporting tools when an exception happens so we expect this to be a bit of a friction point for people using the tool now again it's worth remarking that this project is in public preview so we don't need to be too harsh on judgment here i think the starting point is absolutely positive and it's great to have this tool so i think that the future is going to be brighter and this is going to be a valuable tool
Eoin: for people to use and write their own integration tests with in general iatk looks very promising i think and we hope to see plenty of improvements before it becomes generally available aws as we mentioned tends to release products early so we won't dwell further on the shortcomings if the concerns are addressed this should really be a valuable part of our toolkit but let us know what you think if you've tried it out what alternative approaches might we have missed and if there's some other features that you think should be added into iatk let us know and let the maintainers know as well. Until then, we'll see you in the next episode.