AWS Bites Podcast

35. How can you become a Logs Ninja with CloudWatch?

Published 2022-05-05 - Listen on your favourite podcast player

In the age of distributed systems we produce tons and tons of logs. This is especially true for AWS when using CloudWatch logs. So how do we make sense of all these logs and how can we find useful information in them?

In this episode we talk all about logs on AWS and we discuss the main concepts in CloudWatch for logs like Log Groups and Log Streams. We discuss how you can consume logs and how this used to be a big pain point with AWS CloudWatch logs and how now things are a lot better thanks to a relatively new feature called Log Insights.

Finally we discuss some best practices that you should consider when thinking about logs for your distributed cloud applications.

In this episode we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: In the age of distributed systems, we produce tons and tons of logs. This is especially true for AWS when you use CloudWatch logs. So how do we make sense of all these logs and how can we find useful information in them? My name is Luciano and today I'm joined by Eoin and this is AWS Bites podcast. So if you use AWS, chances are that we have been using CloudWatch and you have probably seen that CloudWatch is a service that allows you to do a bunch of different things. In previous episodes, we have been talking about metrics and alarms and today we're going to focus on logs, which is the next big topics when you deal with CloudWatch. So an interesting thing is that for a long time, at least in my career, I was using third-party tools because my opinion, take it with a pinch of salt, CloudWatch logs was a little bit underwhelming in terms of functionality, in terms of ability to actually make sense and use the log data. I don't know Eoin, if you share the same opinion. Definitely, especially when you want to read them.

Eoin: It was pretty good for storing logs, but the main challenge then is how do you get useful information out of them? And for a long time, there was really nothing you could do there.

Luciano: Yeah, I think now this has changed a lot, especially in the last couple of years. So in this episode, we're going to try to give you a few insights on how you can get the most out of CloudWatch logs and maybe you don't need to use a third-party tool for logs anymore. So do we want to start maybe by describing what are the main concepts when it comes to CloudWatch logs? Yeah, I can give that a go.

Eoin: With a lot of log aggregators, so you talk about third-party log aggregators, and when you send all your logs, people used to, it was pretty common that you'd have log groups in CloudWatch, you'd send everything into Elasticsearch, for example, with an ELK stack or Elastic stack where you funnel all your logs into one place. And then you've got a big stream of all your logs from all your microservices, all your distributed applications.

So it's just one thing, but CloudWatch, it's quite fragmented the way they're stored. So there's a two-level structure. You've got these log groups, and that's like the primary folder structure for logs from a given application usually, or a given Lambda function or a given container. So that's a log group. Imagine that like a folder structure. And then within the log group, you've got log streams. So the number of streams you have kind of depends on the service, but your log stream is like a file within that folder. When you're looking for logs, because you've got multiple files, you've got multiple folders with streams and log groups, you don't know where to look necessarily if you've just got those resources.

One other thing that it might be worthwhile mentioning for log groups is that some services allow you to log to log groups. For example, you can log step functions, state executions and state changes to a log group. You have to make sure that the log group starts with a certain prefix. And this is something that isn't very clear in the documentation. So for Lambda, your function, your log group name should start with slash AWS slash Lambda. And for step functions, they should start with AWS slash states. And then for event bridge, it should be AWS slash events. Sometimes it lets you use something that has a different naming convention, but it doesn't tell you why your logs aren't being written. So that's one thing I'd call out just in case it helps save people some time. So then you've got log streams and log groups. So the question is, how do you view them? So maybe we can talk about what the case used to be like a couple of years ago, and you didn't really have a lot of options. What would you use, Ligiana? Yeah, mostly AWS logs tailing from the CLI.

Luciano: That used to be one way, but to be honest, I was more in the camp, less in just everything in a big Elasticsearch cluster, because there we can just keep BANA and that's a much nicer experience to use and find information that way. So I think that was my preferred way to operate when I was building any sensible application.

Eoin: For something like the serverless framework, when you deploy a Lambda function, you can use serverless framework to tell the logs in the console. That works pretty well. AWS CLI has AWS logs tail command, which also works. And that allows you to, saves you from going through the console and clicking through individual streams, looking for logs, which is just too painful. But I was always kind of, I always found it a little bit of an effort to create that Elasticsearch cluster just so you could store your logs. It seems like too much infrastructure for what you were trying to do. That kind of led me to try different things like log entries or Splunk or many of the other solutions out there. There are, there's probably a hundred of them at this stage. And they're all pretty good in terms of user interface. One of the things that stops people from using them sometimes I think is, well, there are probably two things actually. One is that people don't always feel comfortable with putting all of their logs in a third party. Now that depends on what kind of logs you have and what kind of data is in those logs. And the other thing is that sometimes there's just a propagation delay from the time logs go into CloudWatch logs and then into your third party vendor before you can query them. And when you're in a production troubleshooting scenario, seconds matter. Absolutely. When it comes to the time for logs. So we were always on the lookout for ways you could improve the experience with CloudWatch logs. So maybe before we get into the game-changing feature that enabled that, should we talk about some of the other features that might be less commonly used like metrics filters? It's not something I use very often, but it is pretty useful. Do you want to talk about that one? I'll try my best. There's also not a feature that I've been using a lot.

Luciano: So let me know if you think I'm missing anything important. But the idea is that you can basically define a filter that is like an expression that is constantly analyzing logs coming in and trying to see if your filter matches the log lines. And then you can create metrics better than it. For instance, an example could be count the number of 404 errors just by looking at access logs, right? Doesn't have to be an API gateway. Maybe you want to count from an NGINX instance that you are running on an EC2 or a container. You could create your own custom metric filter and have metrics this way. Okay. So that's different to the last time we were talking about embedded metrics format, right?

Eoin: But this is not the same thing. It's not the same thing. Yeah, exactly.

Luciano: Embedded metrics format is basically when you create log lines that are a JSON object effectively. Like you are printing a JSON object that follows a specific structure that AWS recognizes. And then AWS will use that particular structure to create metrics for you. So you have in that structure all the information that is needed to actually push a metric into CloudWatch metrics. And that is integrated well with Lambda. And we spoke about ideas that you can basically use that to integrate it with containers. Or you can use the CloudWatch agent, for instance, in an EC2 instance to ingest it from this kind of log ingest metrics into CloudWatch. Yeah, there is a little bit of, I guess, terminology confusion there between filter and EMF metrics. But at the end of the day, the idea is that you can use logs to also create metrics for still in CloudWatch, but from logs, you also create custom metrics. As we think more about now that I have logs and I understand the structure of the logs, how can we use these logs in different types of processing logic?

Eoin: So you got the option to create subscriptions with CloudWatch log groups. So if you want to programmatically process them, send them somewhere else, you can create subscriptions. It used to be that you could only have one subscription for a log group, and then they increased that limit to two. So you can have two subscriptions. I think I've heard of people having that limit raised further still, so that might be worth a try.

But the idea is that you can subscribe and then you can send all your logs to Lambda or to Kinesis or to Kinesis Firehose. So Lambda is a good way you can process logs directly, but if you want to batch things before you process them, it's a good idea to put them into a Kinesis data stream first. Then you can have lots of log entries in one message and process them in Lambda from that point. But you can also subscribe directly into Kinesis Firehose and Firehose will...

If you want to put your logs into an S3 bucket, the Firehose approach is a good way to do it. But Firehose can also go to Elasticsearch. So that's one way of going into Elasticsearch. And you can use Firehose to go to Splunk as well. So there's lots of options there. Depending on what you've got in your logs, you can use them to create metrics. Like you say, you can use EMF metrics, but if you have your own metrics format, and people used to do this before EMF metrics existed, people would have a... Use the StatsD format for metrics or custom format, and then you could use Lambda to create the metrics with the CloudWatch API. That's the story when it comes to integrating with other services. We mentioned the game-changing feature for CloudWatch logs then. So a year or two ago, probably two years ago, the CloudWatch logs insights feature was announced. For me, this ended up being a big deal. What do you think?

Luciano: Yeah, I consider you a CloudWatch log insights ninja, because every time I have to search something CloudWatch, I struggle with the syntax, because of course it's a new syntax and you need to know all the different features. And Eoin already mastered all of that, so he's my reference. But yeah, other than that, I think it's a pretty powerful thing, because it really gives you effectively a query language that you can use to go and search for stuff, not just in a log stream, but you can even look for things in multiple log streams together. So it's very powerful, for instance, when you're trying to troubleshoot a problem, on a particular distributed system, maybe you have all sorts of different components, multiple lambdas, containers running, and you know more or less what's the time where you are looking for particular evidence of an issue or try to understand the problem. You can construct queries using all sorts of different properties. We'll talk more about how to have structured logs. And with CloudWatch logs insight, you literally have all the power to do that from a web interface and you get the results and you can use that to dig deeper until you find the results you're looking for or the evidence you're looking for.

Eoin: You mentioned that the syntax you use for querying with CloudWatch logs insights, and I really like it actually. Maybe for a lot of people it would be nice if you could use the Lucene syntax that people are familiar with from Elasticsearch and Kibana, but I never really got to grips with that fully. So I actually prefer this syntax and it's pretty easy to remember once you have got to grips with some of the things you can do.

So it's essentially like a Unix pipe format where you create different, it's like a pipeline for filtering and searching your logs. So you can say filter a given field for a given string or a numeric value or Boolean operator or regex. And then you can extract fields, you can parse strings and actually, even if it's unstructured text, you can parse numbers, strings, or whatever from an arbitrary piece of text, again using regular expressions. And then you can do statistics.

So you can almost treat it like a metrics query engine then. So you can extract numbers and do, it's almost like SQL in that case. So you could find the number of users who have logged into your system in five second buckets. And then you can do five second buckets using this syntax. And then you can sort and you get the results back in the AWS console. You can use the API for this as well, but the API I find is a little bit cumbersome for Dog Insights. But if you have some programmatic cases, you can give that a try as well. So I really like it actually. I only wish there were a few extra features that would really make this as usable as all the third party systems out there. So I use this every day. I use this multiple times a day across multiple different applications, but there are some limits and those limits get to me. Yeah.

Luciano: One example that comes to mind when we just want to present a case where this was useful is in a project I worked on recently where in particular workflows, we write a lot of files to S3. We needed to troubleshoot something related to the creation of these files. And one of the things we do, we have a common utility library that allows us to write all these files in S3 in a consistent way. And this library also happens to write logs in a somewhat structured way. There is always a line saying something like creating a file called blah in this bucket called blah and the size of the file is this. And what we did, that was the easiest way to find the results we were looking for was literally, okay, let's go to CloudWatch Insights. We know we produce all these logs consistently. We know that the time span we are trying to understand this particular problem. So let's just set CloudWatch logs inside this time span. Let's just create a parse expression to find these particular lines and extrapolate the number of kilobytes I think we were looking for. So we basically managed to build a query that way and figure out, okay, we are creating this many files in this amount of time and this is the size of files that we are creating. So this is just a random example of the kind of things you could do. And it's very flexible because in that case, we didn't even have structured logs, but we were able to extrapolate information from text effectively by using the parse functionality. That's nice. We talked about these metric logs before.

Eoin: So we talked about EMF metrics and then other less structured ones or not JSON structured like StatsD format where you start with monitoring pipe, the metric name pipe, the unit pipe value. So you could use CloudWatch logs insights to parse that, extracted all the fields, extracted the metric, the value of the unit, and then say, give me the maximum metric value for a five minute value and group by the time segment, but also the user's country code. And that kind of stuff works really well. You can do with EMF metrics. And if we go back to our metrics episode, we were talking about how CloudWatch metrics only gives you like one minute aggregations. Usually you can't get any finer grain than that unless you have high cardinality custom metrics. But once you've got those metrics in your logs, you can query and do really powerful things with this interface, CloudWatch logs insights. So some of the limitations... Sorry, go ahead, Lucio.

Luciano: Before we go into the limitation, another useful feature is the fact that you can save the queries persistently. And one of the things we did with one of our customers is that we have an operational playbook for some of the alarms that we have created. And when those alarms fire, we go in kind of an incident mode where we need to try to troubleshoot and resolve the particular problem. And we have in this playbook links to the CloudWatch logs inside page that point specifically to particular queries where we have effectively some placeholders that we can fill to try to find the answers on what's actually going on for that incident. So that's another very useful feature. It's something that you can do also, for instance, in Kibana. So it's not like an innovation in the market, but again, it kind of remarks the point that now that the reason for using a third party system is becoming less and less relevant, right? Because you can do all these things natively in AWS. Yeah, I agree. That's a useful one.

Eoin: And another useful feature worth mentioning is that you can export the data. So you can export it right now. Interestingly enough, exporting it log lines, you can export to CSV or Markdown. And I don't know why these are the first two chosen options. A text file would have been the immediate first option I'd prefer because I end up like downloading the CSV and turning the CSV into a text file. So maybe this brings us onto some of the gaps and the limitations of CloudWatch logs insights. So in terms of limitations, one of the ones people cite often saying that it's kind of a barrier for the adoption is the limit of 20 log groups. As you mentioned, it's good that you can query across multiple log groups in a distributed system, but why is the limit 20? I personally find that it doesn't cause me a problem often because 20 log groups is generally quite a broad area. But if you compare it to something like Cabana where all of your logs are in one place, then it's a difference, right? So you have to be more selective. So it would be good to see the limit raised on that. And it would also be good if you could kind of save those groups of logs. So you can run multiple... You could have say a collection of log groups that related to part of your application. And then you can run different queries against that collection of log groups. So another kind of safe feature. And the other limitation that you can run into sometimes is that the results you're limited to 10,000 log entries, 9,999 to be precise. But I mean, that's okay if you're browsing in the console, that's more than you generally want to read. But if you wanted to programmatically extract a large volume, then once you hit that limit, you end up having to try and use time ranges to extend it to another page. There's not an inbuilt support for pagination across volumes greater than 10,000, which is a pity. Yeah.

Luciano: And that's, I think, something a little bit confusing when combined with the export data functionality, because it's not obvious that you are exporting only that much as a maximum bound. So sometimes you feel like, okay, this is just generating a big file. It gave me all the data. Now be careful, you might have missed some data. If you hit that limit, it's not going to go more than that particular limit.

Eoin: So in the user interface, one of the things I miss is when you find some error, for example, often what you want to do and what you can do with some of the third parties is find nearby related logs. So if you can imagine you're searching across 20 log groups and you found an error, then you got a stack trace, but you're only seeing the errors at that point. So immediately you want to say, okay, well, I want to see all the other messages related to that request in that Lambda function five seconds before and five seconds afterwards. And the only way to do that right now is to find the request ID and put it into a new query and to continually start editing your query to do drill downs from an error back to related logs and then into another log stream. So it would be nice to have a little bit more of a user experience improvement where you could click on a request and find your via related logs very quickly, for example. Yeah, that would be very useful. I think I had a few occasions where I wished I had that feature.

Luciano: One thing that I want to remark is that because we mentioned a few times that this is something that can give you capabilities that are very close to metrics because you can get this kind of information. But one thing to really understand is that the model of storing this information and receiving the data is entirely different when you are using CloudWatch logs inside and when you're using metrics. Metrics are already highly optimized for quick retrieval and quick aggregation, just the way that this data is stored. Logs are effectively not. You can imagine them as being text files somewhere. And every time you run a query, you are literally scanning through all these text files. I know that AWS is probably very, very smart because you get results in pretty reasonable time. So probably they parallelize this data in some very efficient way. But nonetheless, you are scanning through large amounts of data. And actually, interesting enough, from the UI, you can see how much data you are actually scanning for every single query you run. And the reason why I wanted to remark this point, which is maybe not so obvious, because of course, this has a direct impact on cost because of course, every query is not necessarily cheap depending on how much data you are scanning.

Eoin: Yeah, it's actually one of the few areas in the AWS console where you can see the billing units update in real time because the volume of data scanned is being updated multiple times. Multiple times per second as the query is running.

Luciano: Yeah, one thing actually I found myself doing very often is when I'm building a new query, of course, I'm not confident in my log insight skills as you probably are. So it takes me a while, a little bit of trial and error before I fine tune my query to do what I want to do. So I just run a very generic query, but then I try to stop it as soon as I see some results. Because of course, it doesn't make sense to keep it going for a large amount of time if I know this is not going to be my final query. So I kind of progress in small increments and I found it very nice that you can stop the queries before you actually keep scanning gigabytes and gigabytes of logs. So that's a neat trick and it's good to see that indicator going up that reminds you, you maybe don't want to pay for this query if you don't have your final query right there.

Eoin: Yeah, and the queries can run for up to 15 minutes, right? So that gives an idea of the volumes it can process, but also the potential cost. So maybe that's a good segue into the pricing topic. I know that you can scan up to five gigabytes of data for free in the free tier. After that, we're looking at 57 cents for ingestion of data into CloudWatch logs, right? So this is, sorry, this is storage, right? So storage and ingestion, over 50 cents to ingest, and then 3 cents per gigabyte per for storing your logs. So you can compare that to your third party log aggregator and see how the cost compares. Then when it comes to log insights queries, there's a price per gigabyte scanned, which is like just over half a cent per gigabyte. So you can imagine terabytes that starts to escalate. It kind of scared me a little bit when I saw this first and when I started running queries at log insights for the first time, the fact that all of a sudden you start running queries, you can run into big bills. It hasn't materialized in any kind of bill shock yet. I found that the cost, especially compared to the value when you're troubleshooting and looking for insights with issues, I found this to be a good value for money feature personally. It's one of the areas where I wouldn't gripe too much about the pricing.

Luciano: Yeah, especially if you already have done your work in terms of structuring metrics and alarms, or you already have other indicators for what you're looking for, you can probably just be very selective, for instance, with the time ranges. And that will limit the amount of logs you're going to be scanning every time you run a query. Yeah, yeah, absolutely. Yeah. And I would say compare it to the third party options out there.

Eoin: Some of them may offer a much cheaper option depending on your usage, because it would be down to volume ingested, volume storage, the retention, and then also other third party options might be related to the number of users you have on the system as well. So there's different dimensions to consider. And here might as well vary.

Luciano: Yeah, maybe one final topic before we wrap up this episode is to discuss, I don't know, suggestions or tips that we might have in terms of when you produce your logs, is there anything you might do that will make your life easier in the future? Right? Do you have any recommendation in that sense? I would always start with using structured JSON logs.

Eoin: And I think this has been a major improvement when it comes to being able to run queries. It means you don't have to parse logs yourself. So previously there used to be kind of a trade-off between human readable logs and structured logs. I think now people tend to favor structured logs because it's much easier to query and parse and programmatically interpret. And if you need to present them for human readability, you can process them. What do you think is that? Would you also agree that structured logs are the way to go? I would agree.

Luciano: And for a long time, I've seen those Apache Access Log format standards, which I think is exactly what you described. It was a good compromise between readability, but also something that you can easily parse. But of course, that comes with the problem that once you have the kind of standard, the standard is very limited. There is only so many fields that the standard is giving you. While in real life, if you want to go outside the domain, for instance, of access logs, you probably want to log a lot more details in terms of fields and attributes and things that you care about for troubleshooting. So going with JSON is kind of an obvious strategy at that point, because we have total freedom of logging all sorts of different fields that might make sense for your application. And then, as you said, CloudWatch logs inside will already have parse every single field for you. And those fields are already available for you to query. So that's a great way. And I've seen even web frameworks starting to go in this direction. For instance, if you use loggers like Pino in the Node.js land, Pino will automatically give you JSON logging, but also has a bunch of utilities that allow you to pre-deprint those JSON logs. So they are kind of coming at the problem from the two different angles of let's make it useful and easy to process. But at the same time, if you need to read those logs, you have all the tooling that allows you to do that. Yeah, I love Pino. It's a really good logger.

Eoin: And I've used it a lot for things like Lambda functions. So I know that Pino allows you to say have nested loggers and to put contextual data into your logs as well. So what kind of additional data helps you to get better results when you start querying later? Yeah, I think definitely given that in AWS, we are building more and more distributed systems.

Luciano: One thing that you would definitely need to make sure you have is a correlation like this. So for every log, if that log is something that you can consider part of a transaction in a way or a request from a user and you have a unique ID for that particular transaction or request, make sure that it is propagated through all the different components that are trying to satisfy that particular request. Because at that point, if you have a correlation ID for something that went wrong, maybe a specific request failed, you can easily query and get a unified view of all the logs with that particular correlation ID. And that's literally just one filter looking at one field where you know correlation ID equal a specific value. So that's something I found extremely useful, especially recently to troubleshoot particular issues. It takes a little bit of diligence to make sure that you have that information is propagated correctly everywhere. But as you say, if you use loggers like Pino, they have built-in features that can help you to make sure that information is correctly propagated everywhere. Other things is similarly you can have trace IDs if you're using tracing like X-ray or maybe if you use open tracing, you can also put that trace information in your logs and that can help you out to correlate logs with traces. I remember for instance, one thing I really liked from using data log in the past is that they push you to do that. And also when you look for traces, for every single trace, you see like a window with all the logs that have the same trace ID. And sometimes that can be very useful. So hopefully we'll have something similar eventually in CloudWatch. I don't know if it's already possible in some other way, but.

Eoin: I know that the service lens part of the AWS console is a going in this direction, but I still, I haven't really played with it very extensively. So I know the idea is to show you all these things at the same time, but I don't know if it's up to the level of some of the third parties out there with a really slick responsive user interface for that. I still tend to do that. Do things manually and dive from one tab to the other to correlate things. Yeah.

Luciano: One last thing to be aware about is that of course, when you log in a structured way, you might be very tempted to just log entire objects without even thinking like, what are you actually printing? Right. Because it's like, okay, just, if something bad happens, chances are, I want to see these entire objects. I think it's a reasonable way of thinking about logs, but be careful with that mindset because you might be end up logging sensitive information. What happens if it's, I don't know, a payment lambda, and you might end up logging credit card details or personal information about the user. So there are ways, and again, it comes to different libraries. There are ways that you can anonymize this data before it gets logged. So just, I don't have like a silver bullet type of solution for you, but just be aware of the problem and check out different libraries and what kind of support they give you and try to see if they can be applied to your use case.

Eoin: Completely agree with investing some time in X-Ray and trying to get the trace IDs in there. Cause I really like the way X-Ray really kind of covers all the services now and gives you a very, very good view. Not only is it good for troubleshooting, but it's also in distributed world when you've got lots of tiny components running together and event-based communication with each other, it's sometimes very difficult to visualize your architecture and keep your architecture diagrams up to date. But X-Ray is actually good for that too, because it will almost, the diagrams emerge from your traces and then you can get very good performance data as well as troubleshooting for errors in X-Ray too. And then if you link it to your logs, you've got a very, I would say high level of maturity when it comes to troubleshooting.

Luciano: Yeah, I think the ideal goal is that if you imagine, like you are giving a user-facing experience, like somebody calling an API or loading a webpage, if you could kind of leave the same experience that you gave to your user, to your observability stack, I think that's the ultimate dream where you can literally say, okay, this is what was happening. And this was the speed and this was the components involved. And these are the logs for the component.

And then you see that the entire architecture respond into that request and everything that happened. I think we are getting close. I think these days we have all the tooling, maybe takes a little bit of effort in configuring everything correctly, but I think that this kind of observability dream is not a dream anymore. It's something that it is achievable with a little bit of effort. So definitely something to look forward, especially for heavy user-facing products where you really want to make sure you are always giving the best possible user experience. Exactly. And it is part of the well-architected framework to get this level of observability.

Eoin: And it's probably the days of looking in a single access dot log with grep and awk are well behind us with a distributed serverless architecture or microservices architecture. So to address some of the complexity that these modern architectures give you, you have to fill that gap with good observability. So it is worth investing the time and probably upfront actually. So if you're starting a new project, adding X-ray, structured logs, metrics, the tools, the plugins we talked about in the previous episode, they're all there and it's pretty low barrier to entry to get going. It takes more time if you're retrofitting it to an application you've already been running in production for a year or more. Yeah, I totally agree with that.

Luciano: So yeah, I think with this, we have covered probably a lot. So this is probably a good point to try to summarize this episode and then finish up. I don't know if you're feeling that you are now closer to be a CloudWatch, Log Insights Ninja or Log Ninja in general. Probably not, but nonetheless, I hope that we gave you a lot of good suggestions and that you can get value out of them. And by the way, as with any other episodes of this podcast, if you think that there are better ways to do the things that we are recommending, definitely let us know in the chat or comments in whatever platform you're using or reach out to us on Twitter, because we'd be more than interested in having a conversation. We are sure we can always learn a thing or two from our audience. So thank you very much for being with us. We definitely recommend you to check out the metrics episode if we haven't done it already and we'll see you at the next episode. Bye.