Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: What can you do with CloudWatch metrics? In today's episode, we are going to discuss what CloudWatch is, and in particular, we are gonna focus on CloudWatch metrics. We are gonna discuss what are the characteristics of metrics like namespaces, dimensions, units, and even more. Also, what metrics you can get out of the box directly from AWS and how you can create your own custom metrics, how to access and explore all these metrics that you have been collecting for your applications.
And finally, we'll try to compare CloudWatch to other providers so that we can assess whether CloudWatch is enough or if you need to use any other third-party service. My name is Ruchano, and today I'm joined by Eoin, and this is AWS Bites Podcast. What is CloudWatch? What are metrics? What are metrics? What are metrics? So, CloudWatch is a service with multiple subcategories. We know it's not just about metrics.
There are a number of things that you can do with CloudWatch. We have been discussing many times about, for instance, logs and dashboards or alarms. So let's try to make sense of all these things, but in particular, we want to focus on metrics. One interesting thing that I want to mention just because we said in previous episodes is that there is also a feature related to events in CloudWatch that probably if you've been using AWS long enough, you will remember every time you were trying to create a Lambda on schedule, it will create like a CloudWatch event for you. All the stuff is now under the umbrella of EventBridge. So we have explored that when we spoke about EventBridge. So check out that episode if you haven't done already. So yeah, again, today we are just going to focus on the metrics component of CloudWatch. So why don't we start by trying to describe what a metric is in CloudWatch?
Eoin: Yeah, so a metric is essentially a time series of data points from your systems. And in CloudWatch terms, metrics are defined by some of the things you mentioned, like a namespace. You have a unit, a value for each metric, and then you have dimensions as well. So let's talk about what each of these terms means. So a namespace, you'll always see at the top level. So namespace is a container for all your metrics.
For an example, if you have some of the services metrics coming from AWS itself, like for EC2, you'll have a namespace of AWS forward slash EC2. And you can create your own metrics and give them your own namespace as well. Now beyond that, you can have dimensions. And for every metric, you can have up to 10 dimensions. And that's essentially a different way of categorizing your metric. And when you store a metric with multiple dimensions, CloudWatch is actually storing multiple copies of that metric just with different dimensions.
So some examples of that are, if you look at the duration metric you get for Lambda functions, that metric is actually stored by function name. So you can create by functioning, but you can also query it by functioning plus version. And a lot of AWS service metrics are stored in multiple different dimensions. So you can query them depending on what you know about what you're trying to query. And the important thing to be aware of there is that each dimension is stored separately.
So it's also priced as an additional metric. If you've got too many dimensions, you try to create dimensions dynamically based on something that changes frequently within your application, that can result in escalating cost. And that's one of the things that can catch people out. So try to reduce the number of dimensions and keep them constant. So apart from dimensions and namespace, you can also specify the unit.
So when you store a metric in CloudWatch, you can specify that it represents a number of seconds or account bytes and percent. It can also change the way the data is stored internally and what kind of queries you can do on it, but it's also useful metadata that you can use when you're creating graphs and dashboards. So those are the main properties of a CloudWatch metric. And it's important to understand how they're stored and how you can query them. So we'll get to that and we can talk in detail about how you can explore these metrics. But maybe we should talk about the different types of metrics as well, right? So we mentioned EC2 and Lambda. So you can get those out of the box metrics, but you can also create your own. So how would you categorize those two things Luciano?
Luciano: Yeah, so as you said, we have out of the box metrics that are things that you would expect to see for the kind of AWS services you use. For instance, if you spin up an EC2, anytime you can see what is the CPU usage, you can see a bunch of things like that and same for Lambda, you can see for instance, I don't know, average Lambda direction. All of these things are stored in CloudWatch as metrics out of the box.
You don't need to do anything to enable them. So anytime you can just go in, check this metrics, build dashboards, build alarms based on those metrics. And most importantly, they are given to you for free and every service will have a long list of metrics. So just go in the documentation and you will find what kind of metrics have to be supported, what do they mean? I have to say sometimes the naming is not extremely obvious what the metric actually means.
For instance, I don't know, if you look at SQS, there are some naming that might be a little bit confusing. So make sure to check the documentation because there you can find a good description of what the metric actually means. Again, the name sometimes is not enough to make entirely sense of the metric. What else can we say there that, yeah, so sometimes there are metrics that I wish were there, but they aren't.
So you can kind of fill that gap by providing your own custom metrics. So there are of course ways and APIs that you can use to basically say when I'm recording a certain unit and I would like to see these in the future as a metric, maybe because I want to build a dashboard or maybe because I want to build custom alarms based on these metrics, you can do that. And of course it takes a little bit of work, but you can definitely do that.
And another case where this is useful is not just for technical metrics, maybe, I don't know. If you want to keep track of resources that you're spinning up and certain characteristics of those resources that are not supported out of the box, you can also use custom metrics for business reasons, for instance, I don't know, you might want to know how many users are logged in in your platform at a given moment in time, or how many purchases are you generating in, I don't know, a unit of time. So you can definitely use CloudWatch also for more business oriented kind of metrics and use the custom metrics for that.
Eoin: I think that's really useful actually. And sometimes more useful than the custom technical metrics, because the metrics at a technical level can sometimes create a lot of noise because there are so many of them. But if you look at in your business, what's actually important to you, and like you say, if you have an e-commerce application and you're tracking how many purchases have been made, you generally know that the number of purchases you expect in a given hour or a minute is going to be, let's say 20,000. And you can create an alarm on that metric that says, okay, if the number of purchases, it drops below a certain threshold, or maybe even exceeds a certain threshold, then let me know with an alarm. And sometimes that's a lot more useful than looking at detailed technical metrics, which can be very noisy because it's telling you that something actually important to your business is being affected here. So like if you suddenly have half the number of e-commerce transactions being processed, that's something that you can act on pretty quickly and you know what that means for your business. So I think that's a really good one.
Luciano: Another example that I have in the past, this was not built on CloudWatch metrics actually, but I think the same example applies. You can build it with CloudWatch metrics, but I was working on an application that had some custom metrics looking at the number of logged in users. And there were good number of users constantly throughout the day, but suddenly we saw that the number dropped to zero and that helped us to realize that there was an issue with the login system. When we didn't have other alarms that could tell us otherwise, so having the custom metric was very useful because we could immediately see that something looked wrong there and we could investigate, find the issue and fix it as fast as possible.
Eoin: Really good. When you mentioned as well these missing service metrics, one of the things I find is really missing from AWS metrics is if you had everything, AWS billing as we know is very complex, but everything is built based on a certain unit of consumption. I would really like if CloudWatch metrics out of the box gave you everything that was billable as a metric so that you could create alarms and observe trends in usage.
So for example, a very basic one would be give me a metric on the number of running EC2 instances per instance type or the number of running containers per ECS cluster. But those things don't exist out of the box and I've had many situations where as part of cost control, you end up using event bridge events to keep track of when things start and stop and then you can increment your own counter, create running containers metrics and then use that to anticipate billing problems before they occur because billing data is, there's always a lag before you get billing utilization data. So I'd really like if everything that was billable, if it was just a hard and fast rule in AWS, everything that was billable was also available as a metric and that would help you keep an eye on costs much better.
Luciano: Yeah, that would be really useful, absolutely.
Eoin: One for the AWS wishlist.
Luciano: Yes, we'll send something that way. So how do you access metrics then? Let's say you have custom metrics that you created or you just want to look at the metrics that you get out of the box from AWS, how do we access them?
Eoin: Yeah, so you can use the, there's a, as with everything, you can use the API and the SDK to read these metric data. More commonly, this is one of the cases where you'll jump into the AWS console and use the metrics explorer and the metrics dashboard to create graphs or yeah, to create graphs and look at them in different chart types, line graphs, bar graphs, or just numeric values. But it's important to understand that when you're accessing these metrics, you don't access individual data points in the time series, you're always accessing statistics.
So when you store metrics, AWS is accumulating all these different statistics for you at different levels of granularity and that's what you can query. You're not querying individual records. So it's not like you're running queries against this time series database. So that's something that it's a fundamental concept that's important to understand. So when you store like the account, it's going to internally record like the average and the minimum and the maximum and the sum per minute, for example.
But it also has a lot more kind of in-depth statistics functions. So it'll store the sample count, but also percentiles. So you can query any percentile and there's also some more complex one, like trimmed mean and stuff. You can look into the AWS documentation for that. So the user interface is not as slick as some of the third-party offerings, but it is practically very useful. So if you understand what you're doing and get a bit of familiarity with it, it is really, really quite good.
You can also perform some mathematical operations. So you can do metric math to combine multiple statistics together and do a formula on them. So we mentioned that we're storing statistics, right? And not individual values. So then it's kind of important to talk about resolution and metrics are typically stored at one minute level aggregations. Now for EC2 metrics, it used to be that the default was five minutes.
So for EC2, if you want one minute granularity, you have to enable detailed metrics and there's a cost implication of that. So that's important to realize, but you also have the ability for some metrics like custom metrics to specify one second granularity. So those are called high resolution metrics. And there's an extra cost of that because you're storing essentially 60 times the volume as you would with minute level aggregations.
There is a recent feature called CloudWatch Metrics Insights as well, which is just another way of accessing those metrics. And that allows you to write SQL-like queries on metrics instead of just using the UI to build metrics. I guess just one point to note about one second granularity is these high resolution metrics, they're not very common, but when you go to the dashboard and you're building graphs in AWS console, there's a dropdown where you can specify the period of granularity you're trying to present in your graph. One of the options there is one second and that option appears whether those metrics are available at one second granularity or not, but more often than not, it's just going to show you one minute granularity.
Luciano: Yeah, and that can be confusing because for instance, sometimes you have using the mathematical formula as well, you have graphs that are trying to display, I don't know, for instance, limits in Kinesis, right? True-good limits. And you get like a red bar that shows you how close you are getting to the limit. That red bar is used, is basically built using a formula that uses the resolution. And if you switch to seconds, it gets pretty confusing because the bar will be calibrated for the second. So you get a certain level, but while the data that you are still seeing is by the minute, so yeah, it doesn't immediately make sense and it can be confusing. So just be aware of this particular thing and make sure you are looking at the, basically the aggregation unit that you actually expect that at the moment in time.
Eoin: Yeah, and it's important to be aware of the metrics retention then. So different granularities are retained by CloudWatch for different periods of time, isn't that right?
Luciano: Yes, that's something actually we can discuss in a little bit more detail. So for instance, if you have data points for a period that is less than 60 seconds, they are available for three hours. And this is also something that sometimes is confusing because it looks like if you are looking, I don't know, for instance, you had a daily run of something and then the next day you want to look back at it.
And depending on the level of what kind of period do you select, basically it looks like you are missing data or something was not actually running. But in reality, if you increase the period, you are gonna see data just aggregated at a different period. So just be careful with that. If you're looking at the very small periods, you need to look not too way behind in the past, otherwise the data will be missing.
So it makes sense to look at this data just a few hours after the data was recorded. So yeah, let's mention the other ones because I think it can be interesting just to understand this concept better. So if you are looking data points with a period of 60 seconds, they are available for 15 days. So that's generally, I think, a good average. Like if you generally look at 60 seconds, I think that will work well enough for most of the use cases. But of course, if you want to look even more in the past, like more than 15 days, you can aggregate by five minutes. So in that case, you get retention for 63 days. And if you even want to look even more than that in the past, you can look at one hour aggregation, which will be available for 15 months.
Eoin: That makes sense. And I guess then it's just a question of understanding which statistic function you need to select when you're looking at messages. So if you're looking at duration for a Lambda function, what do you want to see? Do you want to see the average duration per period, or do you want to see the maximum duration? Maybe for duration, you actually want to look at a percentile. That makes more sense, like the 95th percentile.
But if you're looking at Lambda invocations, you might look at the sum of invocations for a period. But if you're looking at concurrent executions, that doesn't make sense to look at a sum because concurrent executions is already like a sum in its own right. So you might look at the maximum concurrent executions. So you have to kind of understand the nature of these metrics and what they mean in order to pick the right statistical function. But the documentation for all of these metrics will kind of help you. And it'll tell you for each metric, kind of which function you should be using to explore them.
Luciano: Also, this is something we'll be mentioning probably later on during the episode, but I think similar concept applies even if you use third-party alternatives, like I don't know, Datadog. Or even if you use your own StatsD and Grafana, you need to know exactly what kind of data are you storing, how we structure it, and then you'll be using different functions to fetch this data and make sense of it. So this is not unique to CloudWatch. Okay, so let's say that now we want to create some custom metrics. How do we create one? And there are a bunch of different ways. Where do we want to start?
Eoin: Maybe we can start with the most obvious, the fundamental operation there, which is the PutMetricsData API. So if you want to create a metric, you call the API with your SDK and you can put a metric and you'll specify the namespace, the dimensions, the unit, the value. I'm not sure if I'm forgetting anything else, but that's fundamentally it. This one is interesting because it has a lot of limits.
Luciano: So if you want to use it like one-off, it's probably fine. It's a good way to create one-off metric. But if you want to have a process that is constantly ingesting metric this way, it's very easy if you're going to bump into limits. We'll be mentioning an article in the show notes that explore some of these limits, how you can work around some of them. You can also use compression if you want to overcome some of the limitations around the payload size. So it's an interesting word, but I wouldn't say is the most convenient if you really want to start a lot of metrics over time.
Eoin: Yeah, it's a pain when you have to work around all those limits. I know you've got the CloudWatch agent as well, which is one of the older mechanisms for posting metrics, particularly if you're on EC2 and you want to collect metrics from your EC2 instance, custom metrics and post them. So you can use like StatsD or CollectD and the CloudWatch agent can pick those up and post them into CloudWatch. And that also supports something called EMF, which is Embedded Metrics Format. And that is one of the newer features. Well, maybe it's been around a few years now, but I think it's one of the most useful additions to the whole space of CloudWatch metrics. Do you want to describe what EMF metrics is and why it makes such a difference in how it compares to just using metric data?
Luciano: Yeah, so I suppose the way we can think about EMF is like rather than calling a specific API with a specific structure to store all the information, you're just going to log a JSON structure that contains all that information. Then something else will pick up those logs and make sure that all that information is translated into CloudWatch and stored for you. And this is something that works out of the box, for instance, with AWS Lambda. So in Lambda, if you produce logs that are JSON object that conform the EMF specification, then something, some process around Lambda will pick up those logs line and you will see the metrics appearing in CloudWatch metrics for you. And they are not priced like put metric data. I'm not sure if you still pay something or if they are entirely free. Do you remember about that?
Eoin: You still pay for having the metrics. So we could talk about the pricing a little bit, but the metrics, you avoid the cost of the put metrics data API because the API requests are priced separately. You avoid the cost and also the latency because put metrics data will have a latency because there's a network request there. So if you can imagine you're performing some time sensitive function, every time you call put metrics data, you've got an HTTP request in there. But if you're just writing to the console and that's going to be logged into CloudWatch logs, that's a much more efficient operation. So it gets around all of that. It gets around the cost, it gets around the latency and it's just way better for performance and you don't have to worry about limits.
Luciano: Yeah, that's to me the biggest selling point that it's way friendlier when you, because you don't have to think as much about all the different types of limits and all the different dimensions that will make you hit the limits. You just log these lines and you are pretty much guaranteed that eventually your metrics will appear in CloudWatch metrics.
Eoin: Yeah, it's like magic. And I think it has the side effect then of because it becomes so easy to create custom metrics, then people tend to create more and you end up with much more insight into your system. So it has that nice side effect. The only drawback I'd say is that the structure you need to create for EMF metrics is quite strange. It's quite unwieldy. If you were to design a nested JSON structure for story metrics, you wouldn't exactly do it this way, but I'm sure there's logic behind it. It's a small price to pay. And there are libraries. AWS do provide libraries for generating this format in JavaScript and Java and Python. And if you're using Python, the AWS Lambda Power Tools makes this really easy. It's really nice, has some really nice support for it.
Luciano: Yeah, and I suppose that will also help you to avoid mistakes because if you are trying to create that object structure yourself, most likely you might do some mistakes in particular edge cases using a library, all the stuff will be covered for you and you just have a simpler interface that you can rely on.
Eoin: Yeah.
Luciano: One thing that I have on this one is that this is not supported, I suppose, everywhere. Like if you want to use this on Fargate or EC2, you'll need to figure out different ways of making sure that the data is ingested. For instance, we mentioned in EC2, you can use the agent. But if you are using ECS or Fargate, I don't know if you can use the agent straight away, but what you could do is you could ingest this data some way like using a Kinesis stream and then process it through a Lambda that will actually meet these logs and then these logs will be picked up. So there is a little bit of indirection, but you can find other ways to make sure that the EMF format gets ingested. I wish that there will be a better support across all the compute services.
Eoin: I completely agree. Especially with something like Fargate, right, which is a serverless container, it would be nice if it just supported EMF metrics out of the box like you get with Lambda. But I think the way you can do it, I've seen this done on one of our projects where you can create a task definition that has two containers. One is your main application container and the other one is like a sidecar for the CloudWatch agent.
And the CloudWatch agent can then pick up the logs and create the EMF metrics for you automatically. I think there's also another driver. So this is using, normally with a container, most people would use AWS Log driver. There's also a FireLens driver that AWS provide and it's not widely used, but I believe it also supports EMF metrics. Oh, that's interesting. Okay. So it's great for Lambda. Another reason maybe just to use Lambda for everything.
But one of the, actually, it seems like we're going on about EMF metrics quite a lot, but maybe there's a good reason for that. One of the great benefits is that while we said you can only access metrics as statistical values in the console, bear in mind that once you log your metrics as individual records like this, because you want to use EMF, you also get the site benefit that you can now query them in your logs as individual data points. So you can actually select individual metrics there and you can, I think, actually extend that structure to add in other fields like annotations or labels that you might use for querying. So if you find that, okay, your metrics aren't available because the resolution or the retention means they've expired, if you retain your logs, you can actually pull them in with Athena or you can use CloudWatch Logs Insights and you can do really advanced aggregations and queries on them there. So it's very powerful.
Luciano: Yeah, probably will be a little bit more bespoke data extraction and aggregation. And maybe it's not as easy to beat graph out of that, but nonetheless, you retain the granularity of that single metric information and you can use it anytime.
Eoin: Yeah, yeah, for sure.
Luciano: Okay, so should we quickly explore pricing? So we already say that there are metrics that are out of the box available and they are pretty much free, if I remember correctly, and you get a five minutes frequency. You can have detailed monitoring metrics. This is, I think, only for EC2, right?
Eoin: EC2, yeah.
Luciano: Apply to all, okay?
Eoin: Yeah, the pricing page is quite confusing on that, but the way I understand it is you only pay for detailed monitoring metrics for one minute of frequency for EC2, because you get them out of the box for everything else.
Luciano: And then you can do 1 million API requests for free to get metric data or get metric widget image, which I believe is like one way that you can create an image for a dashboard basically, or a graph and use it somewhere else.
Eoin: Yes, your free tier gives you a billion API requests. I think get metric data and get metric widget image are the ones that don't come under the free tier. So those are just ones to be mindful of. So for most API requests, you're gonna pay like a dollar cent for each 1000 requests, something around that order.
Luciano: And then if you want to create your own custom metrics, that is, I suppose, a little bit hard to tell exactly how much it's going to cost you. I think we estimate, yeah, we estimated between two cents and 30 cents per metric, depending on volume and probably depending on the number of dimensions that you're gonna create for every metric. So yeah, that's something that requires a little bit of an exercise for you if you really want to be accurate about estimating the cost. And then there is the usual API requests. So if you do 1000 requests, you're gonna be paying one cent per every 1000 requests. So again, another reason not to use the put metric data massively, because probably you're gonna create thousands of metrics over a short amount of time, so that will affect your cost. There is another feature called metrics streams. Do you want to mention what that is and how can it be useful?
Eoin: Yeah, metrics streams is another relatively new addition to CloudWatch metrics. And the idea there is if you've got a third party application or something else where you need to, that you want to use as a sync for CloudWatch metrics, the traditional means of doing that was to kind of pull on an interval and pull metrics in on some sort of pre-configured interval. And if you've got a third party, like you're using Datadog as you mentioned, or using U Relic or something else, it often ended up in really significant today.
It's like sometimes you could, people reported, you know, 15 minutes before they could see their metrics in their APM, I think that's what they call it, right? So that's, you know, not very usable for, you know, real time troubleshooting. So the idea with metric streams is that AWS will create a literally a stream of metrics through either Firehose. So Kinesis, Firehose, and then you can put them in Redshift or S3 bucket from there, or you can stream them through a third party like Datadog. So I'm not sure exactly which third parties are integrating with metric streams, but the idea there is to give you faster time to insight on your metrics. So that's the problem I believe that feature was designed to solve.
Luciano: Yeah, and I think that they have also ways to easily integrate that stream from your AWS account into Datadog or whatever other service. Yeah, yeah, yeah, yeah, totally, yeah.
Eoin: That's a good point actually, because we're talking all about CloudWatch metrics and a lot of people out there probably have used CloudWatch metrics from time to time, but actually have a third party solution that's their chosen vendor for performance management. What do you think? Should people be going for the CloudWatch instead of using something like Datadog or is there something else you, is there a limit beyond which you might say, okay, CloudWatch isn't really gonna serve my needs anymore. I need to use something more complete. What's the story there these days?
Luciano: Yeah, if you asked me this question a couple of years ago, I would probably tell you don't use CloudWatch at all. Just do something else. And honestly, it's not because CloudWatch doesn't have the capabilities that you need.
Eoin: It's mostly because the UI is, well, it used to be way behind the competitors.
Luciano: Now I think that CloudWatch is, the theme is investing a lot and trying to catch up with the competitors. So I am seeing a lot of innovation and a lot of improvement, and I'm confident that they will get there and you will have a very good service also from the UI perspective. So I'm confident it's a good investment to try to learn CloudWatch and use it. And it can probably, for most use cases, the only service you need, but I still think that they are still a little bit behind what the competition has to offer.
For instance, I've used Datadog at a quite decent scale, I think. And I have to say that most of the UIs were much more intuitive. It was way easier to understand the data. Also, you get a lot more dashboards out of the box, so you don't have to configure as many things as you need to configure today in CloudWatch. So I suppose your mileage might vary if you are starting off and you need to just build a few dashboards because it's a small project.
Probably you're not gonna notice any big difference, but if you have multiple teams using multiple products and producing a lot of magic dashboards alarms, probably going with something like Datadog will make, on one side, your life a little bit easier, like as a user. On the other side, you need to make sure that all the integrations are configured correctly. You are not missing any data. Probably, yeah, there are features that will require you some fine-tuned integration. So you need to make sure that all the setup is done right, basically, but at that point, it's probably gonna be easier for the people in your team to avail of that information.
Eoin: Do you have a similar opinion or do you have a different one? Yeah, I agree. It really comes down to user experience. And also, if you've got multiple systems outside of AWS that you're monitoring, then it might make sense to have it all in kind of unified dashboards, unified performance management systems, and also your logs, think about that. But I would say that once you get familiar with the user interface in CloudWatch, you can be very productive with it.
It just takes a little bit of investment in time because you have to overcome the less fluid user experience that you get with it. But like you say, the features are there, and you certainly benefit from the fact that you don't have to worry about sending your data to a third party and what that entails, the latency involved with it, the separate pricing arrangement. If you want to keep everything under your AWS bill with IAM, you can do quite a lot with CloudWatch metrics, and I think it is worth persisting with if you don't want to have to bother with another third party.
But at the same time, there's a lot of innovation in third parties as well. Like if we see the amount of innovation that comes out of tools like Honeycomb, they're really pushing the boundaries of what observability means and kind of leading the way as well. So I think it is worth exploring the space for sure and making your own decisions on it. One thing we didn't cover is, I guess, Grafana and Prometheus are very prominent in this space as well for metrics, logs, and like Grafana particularly for presentation. And you can actually do both, right? So AWS has managed services for both Grafana and Prometheus, and you can use them to visualize, you can use Grafana to visualize your CloudWatch metrics. And you can build a lot more in terms of useful dashboards with Grafana than you can with CloudWatch. So maybe there's a middle ground where you can combine these things, especially if you've already got Grafana and Prometheus in the organization.
Luciano: Yeah, that's a good one. Okay, so I think with this, we covered everything we wanted to cover today. Of course, we mentioned that you can do a lot more with CloudWatch, you can do dashboards, you can do alarms, probably topics that we will be discussing in future episodes. So make sure to subscribe, follow in whatever platform you are using so you can stay in touch with us and be up to date when we publish the next episode. Until then, let us know if you have any question or if you have any comment, and yeah, we look forward to seeing you in the next episode.