Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: Keeping track of what's going on in AWS organization accounts can be very tricky. You might have potentially hundreds or thousands of changes happening every day. There is a tool called CloudTrail, which makes it possible to log all of this activity. But how do you best get insights into it? Today, we're going to drill down into how you can use CloudTrail and also Athena together to ask questions about what's going on, troubleshoot issues, and answer questions for security and compliance. We will also touch on AWS Glue, and we will go deep diving into Athena, infrastructure as code, and AWS organization trails. My name is Luciano, and I'm joined by Eoin for another episode of AWS Bites. fourTheorem is the company that makes AWS Bites possible. If you're looking for a partner to accompany you on your cloud journey, check them out at fourtheorem.com. The link is in the show notes. I'd like to start today by giving some definitions. So what is CloudTrail and what is Athena?
Eoin: I think it's a fundamental service in AWS, actually, and useful for developers and administrators alike, because it lets you keep track of what's going on for audit and compliance purposes, but also, like you say, for troubleshooting. And as a developer, it's a pretty useful tool to have in your toolkit. So it captures mainly management events, like those relating to create and update and delete actions on AWS resources. Every account gets a trail for free, but then you can also pay if you want multiple trails, or if you also want to capture some data events as well. So it stores logs in JSON. You can look at them in the CloudTrail console themselves. So you just store it, look at them in CloudTrail for more recent events, but then you can send them off to CloudWatch Logs.
Luciano: I remember we spoke in a previous episode at length about CloudWatch Logs and how you can do ninja stuff with the syntax that it gives you to do all sorts of kind of queries and aggregations and filtering. So that seems like a reasonable thing that you would want to do, I guess. But given that we have all these tools in CloudTrail, why should we consider S3 and Athena instead? Actually, let's start by defining what even is Athena.
Eoin: Athena is a distributed query engine used to query data in object stores like S3, and it's based on the open source Presto and Trino projects. Trino is a fork of Presto. And using Athena, you're basically running SQL queries on JSON, CSV, Parquet, or ORC data. And when you execute queries in Athena, it's basically making a query plan, and in a distributed way, scanning parts of that data concurrently in chunks and doing additional steps to filter and aggregate that data.
CloudTrail logs are stored in JSON format. So they show you things like the account and the region and the date, the identity, who's making the request, what's the user agent and the IP address. It gives you the event name, and you can see the request parameters and some of the elements that come back in the response as well. You have the ability, like you say, to query CloudWatch Logs and Logs Insights, but the storage for CloudWatch Logs is going to be more expensive. So if you want to retain data for a longer period of time, generally what people do is store the data in S3 and query it with Athena, and maybe just use CloudWatch Logs insights for more recent activity like the last week or two.
Luciano: Okay, that makes sense. So I suppose the main difference is that with Athena, you can query directly into S3 as long as you're storing structured files, which is something that CloudTrail allows you to do anyway by giving JSON support. That sounds pretty interesting. How do we get started?
Eoin: Well, there is an easy way, not necessarily the best in the long run, but from the AWS console, if you go into CloudTrail, you have a button there that if you've got a trail set up already with S3 bucket logs, it'll give you a button that says create Athena table. And when you click on that button, it immediately shows you CTAS statement is what they call it in SQL, create table as select. So it's basically creating a table by selecting data from this S3 bucket. And in this big SQL DDL statement, it's basically saying which fields in the CloudTrail JSON map to table columns in this virtual table that it's basically going to create for you.
Luciano: Right. That's also a bit different from what we are used to do with CloudWatch Logs insights, because with CloudWatch Logs inside, you just create a table from the S3 bucket. You just query like you don't have to worry about creating a table. But I suppose that's something that will come up when we start to talk about more details on how Athena works. So right now I have another question. This seems simple enough for a single account. You just go to CloudTrail, click a button, and then you're ready. Select the table definition and then you're ready to query. But what if you have multiple accounts? Because we often advise companies to create their own landing zone, structure their deployments across multiple accounts. That seems kind of a more production-related solution. Mm-hmm. What do we do in that case?
Eoin: That's all solved basically in how you set up the trail itself. So you don't really have to do anything additional in Athena. You can either set up accounts to log to a central bucket. So rather than every trail logging to a separate bucket, you can have a shared bucket with the right permissions and they all log into a different prefix. But there's actually an easier way still, which is to set up an organization-wide trail. And that way you do it from your management account or from a delegated administrator account. And you say, create a trail for my whole organization, and it will capture all the events from all accounts and put them into one bucket for you. That's the way we typically do it at Fourth Hirm. And it's easy then to set up Athena to query from that one single bucket. And then you've got Insights and you can do queries across all accounts. Like find out, okay, who did a terminate instance on EC2 in any account today?
Luciano: Yeah, I suppose accounts at that point is just another field that you can query from.
Eoin: Yep.
Luciano: At the beginning, we also mentioned Glue. So how does Glue come into the picture with this setup?
Eoin: So when you create a table in Athena using this CTAS statement we talked about, this is actually creating a Glue data catalog table under the hood. If you haven't looked at Glue or Glue data catalog, Glue has a number of different features, but we're just going to talk about the data catalog in this context. And it's basically a meta store for your tables. So you can use those meta store tables outside of Athena, but every time you're using Athena, you have to have one of these.
It comes from the Apache Hive ecosystem where you had this ability back from the big data ecosystem to define virtual tables for data that was stored on a file system or an object store. So when you do this, create table as, and you're mapping the columns to the fields in your data, this is basically just creating a data catalog table in Glue. So once you do this, create table as select statement, you can go over to Glue and you can see the table appearing in your Glue console as well. So this is just a schema representation of your table. There's no data copied into Glue or anything. It's just really a schema definition and it allows you to map SQL concepts to an underlying data store in S3. Now, instead of actually using that create table approach, you can just go and create the table include data catalog directly. This means we can create it visually in the console or we can use CloudFormation or Terraform, et cetera.
Luciano: Right. That makes a lot of sense. I imagine that it's also something that allows for a lot of optimizations because you are going to be writing SQL statements and the system underneath needs to understand which files can actually have the data that you're looking for and to scan the files in a smart way rather than always reading everything, which might be very expensive and time consuming. So I suppose that's why we go through this extra step of setting up the stable definition and recording all of that into a catalog. We do use that setup a lot for data analytics for theorem. And I know that you can get good performances if you store the data in a very specific way. So what kind of suggestions can we give to people to try to get the best performances with this kind of setup?
Eoin: Performance you get with CloudTrail queries with Athena really is going to vary hugely depending on the amount of data you're scanning. So it could be a few seconds or it could be 15 minutes. It depends on the query and how you optimize the Glue catalog table. Sometimes Athena will have to scan all of your data depending on what the query is. And that's clearly not optimal. So there's a couple of things you can do to make it faster.
One is by setting up partitioning. This is a typical optimization when you're using Glue data catalogs in Athena. It allows you to have different partitions for specific fields. For example, you could partition the logs by day. And if you have a day in your where clause, then Athena only has to scan that limited set of data because it's like an index basically. Creating partitions, there's actually a few different ways of doing this.
You can use the Glue API to create a partition every time you have a new one, like every day. Or there's an alter table add partition command you can do to do it through the Athena SQL interface. And there's also actually another one which is MSCK repair table. And that will tell Athena to go off and scan and find partitions automatically. And that's the same as doing a Glue crawler. So in the Glue data catalog world, there's also the concept of a crawler, which is like an automated process that scans your S3 objects and finds partitions.
So you can use this to create the table in the first place, actually. It can try and derive the schema for you based on the columns in your data. But it can also identify partitions and find new partitions once you've created that table. For CloudTrail, there's another feature in Athena which you can use called partition projection. And this is quite nice. And this is the one I usually set people up with the first time.
Because you don't have to add partitions as data arrives. Instead, when you create the table, you just specify ranges of values that are possible for certain fields. So you mentioned that you might want to query on account ID, right? And this is just another field. But if you know all of your account IDs, you can tell Athena in advance, this is my range of account IDs. And then it doesn't have to go to Glue data catalog to find out what the partitions are.
It can basically project those values and use them to build its query in an optimal way. And it can also do the same thing for date values or for the region field, right? Because we're all generally working with a fixed set of supported regions. Partition projection would be my first recommendation. It makes things easier. And once you set it up, it just works. Another thing you can do with optimization is use the limit clause. That's always a good one to reduce the volume of data returned. One last optimization is actually if you want it to be really performant. JSON is the slowest data format for Athena by far. It's significantly slower than all the other options, even CSV. What you can do is just build a pipeline to convert that data into Parquet, which would be the most optimal format. And you can use that using Lambda or EMR or Glue, or you can even use it. You can use a Glue crawler to do that as well. So that's something if you really want fast query performance, you could set that up.
Luciano: That's really cool. Okay. So far, we are talking about setting things up in the console and creating resources pretty much by clicking around. So click ups and then doing SQL queries. Is there any recommendation in terms of let's do this in a more production ready way and maybe use infrastructure as code so that we end up with something that is easily reproducible between accounts or customers?
Eoin: You can, and I would prefer to do it that way. I think Athena is nice because it allows analysts to be able to create things on the fly, ephemeral resources and tables without having to worry about infrastructure as code. And that's a really valid workflow. But if you want something like this, Cloud Trail queries for your whole organization, it makes sense to put it into infrastructure as code. The process of creating tables in CloudFormation or Terraform is a little bit complex and it's not something that's very well documented. Most documentation ends up pointing to the create table method, which isn't really like proper declarative infrastructure as code. So in CloudFormation, you need to specify the table columns and serialization parameters using your YAML or JSON. And it's a little bit strange, this syntax. So we've actually figured this out for Cloud Trail, so you don't have to. There is code in a gist and we'll link the gist in the show notes and you can give this a try and let us know how you get on.
Luciano: It's always fun when you're defining resources and there is some kind of special syntax that you haven't seen before, which is unique for that particular type of resource. But I guess that's the reality with the cloud. So many resources that sometimes there are these kind of exceptions and we just need to figure out how to deal with them. But in general, I think Athena looks really nice and really powerful for the way you are describing it. In some projects we saw that you can use it as a data source for QuickSight and at that point you can create nice dashboards. So is that something that you would use together with Cloud Trail to create some kind of visualization that, I don't know, will give you a very quick overview, like a single pane of glass that you can go to and have a feeling for what's going on in your set of accounts that you're monitoring?
Eoin: Exactly. This is a very powerful integration type and stops you having to jump into the Athena console the whole time and you can just, if you've got data that you just want to keep an eye on all the time on the activity in the account, this is a pretty nice integration to set up.
Luciano: That covers a very nice overview on how you can achieve a good level of observability and auditing over or across your accounts for your organization. If you set up this method, then you can have kind of a centralized way of queering and understanding what's going on and even building dashboards that you can just look and see if there are things that you should be worried about. So I think for today, that's all we have to share. But if you have any other tip, maybe have used different setups, maybe use different tools, maybe you have different ways of provisioning all this infrastructure, definitely share it with us. We are always looking for alternatives and for learning. Definitely looking forward to hearing more from you. And until then, we'll see you in the next episode.