AWS Bites Podcast

34. How to get the most out of CloudWatch Alarms?

Published 2022-04-28 - Listen on your favourite podcast player

CloudWatch is a great service for metrics. You get tons of metrics out of the box and you can also create your custom ones. One of the most important things you can do with metrics is to create alarms, so how do we get the most out of CloudWatch alarms?

In this episode we share our insights and cover the different types of alarms that exist, how to create an alarm, what to do when an alarm is triggered, a few examples of useful alarms and some of the drawbacks of CloudWatch alarms and how to overcome them.

In this episode we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: How do you get the most out of CloudWatch alarms? Last time we covered CloudWatch metrics and with CloudWatch metrics you can get out of the box metrics for AWS services and you can create your own custom metrics for technical data points as well as business level metrics. But in this episode we're going to discuss what can you do with alarms? So what are the different types of alarms? How do you create them? What can you do when your alarm gets triggered? We're going to go through a few useful examples, types of alarms you can create and we'll also talk about some of the drawbacks of CloudWatch alarms and tips to overcome them. My name is Eoin, I'm joined by Luciano and this is the AWS Bites podcast. Last time that we discussed CloudWatch metrics and one of the most important things you can do with those metrics is to create alarms and that is our topic for today. Luciano what are the different types of alarms you can create with CloudWatch alarms?

Luciano: Yeah there are essentially two types of alarms. What we call metric alarm is the first time which is basically an alarm based on a single metric and if you want to make it a little bit fancier you can also use mathematical expression to basically announce the value that you extrapolate for the metric and do some calculation with it. And another type of alarms is what we can call composite alarms and those are effectively what the name says so you take different alarms and you combine them together and you can build additional alarm on top of that. So as an example you can say if this alarm fire and this other alarm fire I want to fire another alarm. What do we want to say first how do metrics work? Maybe we can give an example of how to create one alarm.

Eoin: Yeah that sounds good so there's a couple of things you need to specify when you create one. First of all you pick your metric. If we're just talking about a simple metric alarm example let's say we want to monitor CPU usage on our EC2 instances then we'll pick the metric AWS slash EC2 that's the namespace and the metric is CPU utilization. So once you pick the metric then you need to specify your threshold so are you going to put an alarm on 90 utilization for example and you pick then a comparison operator greater than greater or equal to less than and then you can specify the period.

So we know from our last episode on metrics that you have a period of say one minute five minutes 15 minutes so what do you what period do you want your alarm to be based on and then the statistical function. Are you looking at the average over that minute the maximum and then you can specify a number of periods. So what you're doing then is you're saying to cloud watch I want you to look at the average CPU utilization over three minute periods for example then you can specify how many data points of those three will trigger an alarm. So you could say three out of three if all of the if the threshold has been breached for three successive data points then alarm or two out of three or one out of three or you could just have one data point it's completely up to you and how sensitive you want this alarm to be. What that gives you then is an alarm on CPU utilization if it's above or equal 90 percent on average for at least two minutes of the last three minutes say and then the last thing you need to be considered is if you want to be alerted. So if you do want to be alerted you'll probably need an SNS topic and alerts will be sent to that topic and then you have a lot of options in terms of what you do with those messages otherwise you can just see the messages in the AWS management console or from the CLI. It's not ideal and not very reactive you have to be a lot more proactive so you have to get the balance right I guess get the right amount of noise but not not get too noisy. We give one example there right on CPU utilization are there any other use cases you can think of that call to mind?

Luciano: Yeah just a few that come to mind are for instance if you're monitoring a load balancer you can check the latency and see if that latency is going above a certain threshold for a certain amount of time. You can also create billing alerts these are probably some of the most useful ones and some of the ones you should create as soon as possible so with those you can monitor if your projected expenditure is going above a certain threshold that probably is going to be like your billing budget or something like that and you can use that to stop some services and see maybe if there is a bug that is causing you to to spend a lot of money on AWS and react before that bill becomes very very big. And other ones for instance that I've seen to be very very useful when building APIs you can monitor API gateway and see the number of 500 errors and you can have alarms that check maybe if you have an increased error rate that way and those can be useful to spot bugs for instance. So I think we will yeah we have more examples in in the next sections but I think those are very good example to understand just different kinds of alarms that you can create not just for application errors but also for billing or for latency so it's a range of different things you can do with them.

Eoin: Yeah it might be worth mentioning that there's we think about alarms as alerts but alarms are always in one of three states so the state you want them to be is in the okay state the state you fear the most is the alarm state and then the third state is actually somewhere in between that's called insufficient data and that that's the state it gets into if you don't have enough metrics for that much to be able to determine which state it's actually in. So often when you create a metric for the first time if you don't have enough data it'll end up in insufficient data so that's what that's about and any alert so if you're talking about SNS alerts or anything like that you can alert on any state not just the alarm state. So that's a good example of that just the alarm state. Yeah that's a alert me when it goes on alarm but also alert me alert me when it goes back into okay.

Luciano: That's a very good clarification because the naming can be a little bit confusing I think generally when we talk about alarms we think about the alarm state so when things are bad but in AWS CloudWatch the term alarm is just that configuration and then you need to look at the state to understand yes are we in a good situation are we in a bad situation or you can as you said you can even look for are we just recovering from a bad situation because you can create alarms that tell you you sorry you can create notifications that tell you okay you were bad before you were in an alarm state before and now we are finally back into okay state and if you use things like chatbots we'll mention some of them later on that can be very useful to see because maybe if you have a noisy configuration sometimes you realize okay this thing triggered an event I'm seeing an issue but then immediately afterwards I see that the issue resolved itself and that means you don't really need to do anything maybe you can tweak a little bit your configuration to reduce the noise but you don't have like an immediate need to fix anything.

Eoin: What do you think is the best way to actually get notified then if you want to react in a reasonable time frame and you want to also avoid noise what's the best tooling that you can apply for notifications?

Luciano: Yeah so we there are different ways that you can basically be notified when when the state of an alarm changes and probably the most common that I used to see in the past is not necessarily the one I would recommend is sns to an email so you get you configure your alarm and you say every time this condition happens send an sns event and then from that sns event you can dispatch the event somewhere else so the most common one I've seen in the past was email but I am not a big fan of that one because I remember a few years ago when I was trying to use this you will literally get a big wall of text on an email and it's not the most intuitive way of understanding when you have to go in an emergency and try to fix something like it takes a while for you to process even what's just written there in the email and make sense of it so there are more sophisticated ways today and one is like totally custom you can just send the sns to a lambda and then you can do whatever you want with it and that really depends on how you structure your teams and your operations for instance if you use tools like pager duty you could use the lambda to send the event to pager duty and then manage the incident that way or you can build the lambda to integrate with other systems that you might want to use another interesting way if you for instance want to use a chat platform because maybe you do that kind of operation based on slack or teams or something else there is of course you can do it with a lambda you can do it yourself and build the integration that way but there is a tool from aws called aws chatbot so you can send the alarm from sns to aws chatbot and aws chatbot has already built in a lot of nice things like you get a very good user experience when you integrate it with slack or chime you get very nice preview of the message describing what's what the incident is about the only issues that as far as i know right now microsoft teams is not supported yeah and and i've been working in many companies using microsoft teams as the primary system so in that case you

Eoin: are a little bit on your own to find other tools or to write your own integration yeah i really like the chatbot experience actually it's pretty easy to set up once you've got your topic chatbot is pretty slick for setting up and writing your own integration into slack and has become a little bit more involved you have to create kind of create an app now you can't just fire data at a webhook at least they don't want you to so aws chatbot has kind of solved all that for you and gives you a really nice message so what do we do after we get notified what can we do with that in janitor that's a good question so you can have all these i suppose manual intervention is going to be completely dependent on the kind of alarm you're dealing with you know um but there's also some automated interventions so apart from sns there's there's some other actions you can take that are actually like automated remediation and you could like you say you can use sns to lambda or you could target you know systems manager automation for remediation there's so much you could do there but they're out of the box you if you're talking about ec2 ec2 metric alarms can also trigger ec2 actions so you could say if your cpu utilization is getting too high then reboot the instance or terminate the instance and you know if you've got some legacy application with a memory leak and that's your only option i guess that's um one kind of path you could take um you can also trigger an auto-scaling action and this is probably where a lot of people may have used alarms in the past maybe without even realizing it because uh alarms and auto-scaling kind of go well together so you're scaling essentially in response to the observation of a metric breaching a threshold so that could be an ec2 auto-scaling group or and that could be you know based on number of requests or some of your load balancer metrics but it could also be an ecs service auto scaling an example of that before and i think we've raised this one a few times is if you've got a pool of workers doing some jobs batch processing say in an ecs cluster then you could scale based on the number of messages in a queue like an sqs queue that they're pulling their actual jobs from and that's that's a good way to handle it because you could have you know a standard default maybe a five or so workers working away by default but then if a large volume comes into the queue you might want to scale up to a certain limit and alarms are really good for that i don't use a lot of the many features in systems manager but i know as well there's there's some things you can do there if you're using some of the incident management features like you can create an ops item if you're using opcenter and you can also create incidents and systems manager so there's um there's quite a lot you can do um but i suppose one of the challenges with alarms and one of the reasons people probably shy away from them is because people have experience with very noisy alarms and it can be very difficult to actually know what's going along going on when the default state is that things are always an alarm and then people just stop trusting the value of them so composite alarms are is something you mentioned at the start do you want to talk about maybe some examples for

Luciano: you to use composite alarms and how they compare to just the simple single metric alarms sure yeah before that i like what you said because i there is a quote that i really like that is when everything is an emergency nothing is an emergency right so i think that we can say the same with with alarms like if if the alarm system is always noisy you stop being concerned right that becomes your norm and you are not really trying to react anymore and fix things so that that's something to be aware i i would also caveat that at the beginning when you start to set up your alarms it's probably a good idea to be to try to be a little bit noisy so you can find actually the threshold that work for you so initially probably there is a little bit of a tuning phase where you try to find what your threshold should be like for you to actually be effective anyway back to composite alarms i i think composite alarm are actually relevant to this topic because they can be one tool that you can use to make things a little bit less noisy and one example i have is that it's very common for instance when you build apis with api gateway and lambda to have individual alarms both on for instance on lambda failures but also 500 errors on api gateway and if you have both what happens when a lambda failed you get a 500 on api gateway so you are basically alarming twice and one way that you can remove a little bit of noise is that of course you need to create both alarms but then you are going to fire an alarm event on sns using a composite alarm so that basically means that you take the two alarms together you combine them with a boolean expression saying if either one or the other for instance fire then fire the other composite alarm and then you only receive a notification for the composite alarm so this this way you are nice effectively flipping uh two alarms but being notified only on one of them um yeah i don't know if you have other examples when composite alarms can be useful but i think they are just a nice way to reduce noise maybe you can also build more complicated rules maybe in some cases you can use that to combine different metrics and then maybe only when you see a certain combination of matrix happening then it really makes sense for you to allow this is maybe another example

Eoin: where composite alarms can be useful yeah yeah it's very flexible in that way one of the other points about i suppose the usability so noise is one factor that's important for usability another one is that we've frequently talked about separating lots of different services and applications and environments into different accounts so the question then becomes how do you keep an eye on alarms and metrics across multiple accounts and multiple regions even and it is possible to do this without having to open dozens of tabs in your web browser dozens of and dozens of containers or whatever way you can manage different accounts because cross-region metrics you get out of the box so you can switch you can view alarms and metrics from the console already but if you're talking about cross account you just have to do a little bit of setup so for every account that you want to share metrics from you just need to say okay i'm going to share it with this central monitoring account and give the account number and then on the monitoring account side you just say okay i'm going to pull in metrics from these 20 accounts across my organization or all accounts across my organization and then you can start to look at things from one dashboard and from you know a single pane of glass essentially without having to log out and log in or switch tabs and that makes it a lot easier so i think that's yeah worth mentioning in this multi-account world so maybe it's worthwhile if we talk about some more useful use cases for alarms right to see if we can give some inspiration for people who are maybe not using alarms extensively and can start getting cracking and making their lives easier and make maybe reduce the operational overhead that's what they're for after all right especially if you can predict errors before they can happen where will we start we talked about business metrics in the last episode so we we've given examples about API Gateway EC2 what kind of alarms could you create on business metrics that

Luciano: might be useful yeah i think the the ones we mentioned in a previous episode were for instance the number of sales per day again our favorite e-commerce example like some i suppose in some e-commerce is that the amount of sales tends to be quite predictable so you could create a custom metric and then an alarm to see when that value goes a little bit outside your expected amounts either lower or way higher in both cases maybe you need to do some action so it's worth to be alarmed as soon as you see that that event happening and another one that i used in the past and was very beneficial to me is to monitor the number of people that are logged in into an application like in some application that can be predictable enough i would say so you can define rules or even sometimes you can just say let me know when this value is zero for a long time and this is actually the one i used in the past and that one helped me to figure out an issue that was present in the login system actually was more in the session system than the login system but anyway having that alarm was good for me to see that people were not able of logging in for a long period of time so i could figure out there was something wrong investigate find the issue fix it and bring everything back online so these are examples of business metrics that you can leverage to build alarms and then be more reactive and prevent incident for happening for a long time before you realize and you can fix them again we mentioned API Gateway and Lambda for more technical ones that you can use to to capture specific bugs so when the code is actually throwing errors similarly you can yeah if it's timing out you can consider that an error the problem with all these ones we just described is that they tend to be reactive so it means the the issue happens first then you're probably giving a disservice to your users because of these issues and then you try to to rush and say okay let's fix this because our users are currently i don't know seeing unexpected results and we don't want that to happen anymore but the question could be okay can we do something in advance can we predict when something is about to happen and maybe fix it before it's too late before it's impacting the users and there are two things that you can do there are one thing is that you can use for instance in the case of timeouts right Lambda timeouts you could create an alarm that looks for your your Lambdas getting closer and closer to that timeout so for instance you could say trigger this alarm if you get 90 if your Lambda are taking 90 of the time that has been allocated for that Lambda so if you are close enough to the timeout but you are not eating it yet and this is something that for instance it can happen if you have n plus one queries in a Lambda right as your database grows the time for executing that query will increase it will get longer and longer so initially you will have plenty of time to stay within the timeout but over time maybe you will see that that time increasing and getting closer and closer to the timeout so having an alarm there can tell you that this is becoming a problem before your users will start to see an error so that that's something that can be very very useful and another thing you can do i'm not sure if we mention it already but there is a anomaly detection based alarms that you can use to basically see if your current metrics are going outside the norm yeah so you can use that for instance again for Lambda directions if suddenly you see your Lambdas taking much longer not necessarily getting close to the timeout but just

Eoin: being taking much more time unusual behavior exact yeah that's good yeah yeah i think you you kind of for that one you specify like uh it's like a standard deviation so you specify a band how how how much outside the average you want to uh to be your anomaly detection band and that that can

Luciano: trigger it on and one last example i have is for asynchronous computation uh for instance i don't know when you have a queue and a pool of workers you could monitor um the i think it's called approximate age of all that message in in sqs and something that is like the iterator age in kinesis so if you're using kinesis you have a similar concept there but that's something you can monitor to see if your workers are doing their job fast enough so that you don't keep accumulating messages because of course if you accumulate more messages than you can process in a reasonable amount of time like eventually you're not going to be able to to process these messages fast enough anymore so your users will wait more and more before they can get results so that those are other good things to to explore and maybe have some alarms to make sure things are being done efficiently enough

Eoin: yeah and if you're into like continuous deployment and continuous delivery one of the other things that just springs to mind is that you can tie your alarms into your deployment process as well and this is particularly useful if you're doing like canary deployments blue-green deployments where you need to monitor the health of the application if you're shifting traffic or percentage of traffic from one deployment to another and cloud watch alarms are ideal for that because you can get programmatic access to the state of the alarm and it's also well integrated into code deploy and cloud formation so they that those tools will actually integrate into your alarms start shifting traffic over to a new container a new instance and if the alarm starts to fire it'll say okay let's back off here and shift back to the old instance so it gives you deployment safety as well maybe that's something we can go into in depth in a another future episode are there any kind of drawbacks with cloud watch alarms what's the what's the first user experience like what do you think for people getting started with cloud watch alarms and how could it be

Luciano: improved yeah i don't know if i would call it a drawback but one thing that i wish was a little bit better with cloud watch is that you don't really get anything out of the box like you get all these amazing features you can do a lot of things but literally you start with a blank slate and it's on you to define which things to look for and create all the alarms and if you haven't done it before it might be a little bit stressful to just come up with different use cases and make sure things are finely tuned and for that AWS just gives you like the the well-architected framework for instance but gives you documentation that can give you ideas on where to start and what to look for but again it's all new to do the artwork of configuring everything and sometimes you you create for instance application serverless application using maybe the serverless framework it feels a little bit unnecessary that you have already defined everything with infrastructure as code why not apply those best practices by taking that information that is already available in a sus code seeing for instance that you are using lambdas you're using sqs why not creating all the out of the box default alarms based on your infrastructure and i wish that that was available and we have been spending a lot of time ourselves on this problem and because of that we ended up creating a serverless framework plugin that does exactly that so it looks into your infrastructure as code so your yaml file your serverless yaml and if it sees for instance that you're using a lambda it will provision for you a bunch of alarms that make sense for lambda similarly if you're using step functions api gateway dynamo db kinesis sqs and what we did is we are going to give you with that plugin you just initialize it you get a bunch of defaults but of course you can configure it a little bit more if you want to change the default thresholds or if you want to enable or disable specific alarms that are a little bit more advanced

Eoin: that's a good one and i guess one and it creates dashboards as well so it also gives you kind of a little bit of a better user experience in terms of how those graphs of metrics are organized i don't think i don't think it's probably it's worthwhile to have a dedicated episode on cloud watch dashboards since they're really just like a representation of your metrics and they can also show alarms as well but it's another feature that plug-in so when you deploy a serverless stack you'll get a dashboard for that stack so maybe it's a good idea to wrap up just uh talk about cost and you get 10 metric alarms for free which which is nice of them but beyond that it's uh it can get expensive if you've got a lot of them right so i guess it your mileage will always vary depending on your context and usage it you're talking about 10 cents per alarm metric um now if you're using high resolution alarms high resolution alarms are alarms based on high resolution metric so those one second metrics we talked about in the last episode um so those are 30 dollar cents and if you have a composite alarms for some reason they're 50 cents for a composite alarm and i don't know why the composite alarm implementation is so complex that it justifies that cost it seems like a pretty expensive boolean expression evaluation but that's the cost you didn't get so if you've got you know tens dozens of alarms that's probably for most organizations revenue generating organizations is probably okay but if you're starting out and trying to keep everything free tier it's definitely something to keep an eye on so at this point i think we've covered now from the last two episodes metrics fairly in depth and what you can do with those in terms of creating alarms next next one i think we can talk about logs log aggregation and how in particular you can now use cloud watch logs to get a lot more and maybe avoid having to use a centralized log aggregation service uh this logs is probably my favorite topic of the three so i'm looking forward to that one but if you haven't checked out the previous episode on cloud watch metrics check it out it's episode number 33 and let us know what you think and like and subscribe and we'll see you in the next one all about logs you