AWS Bites Podcast

134. Eliminate the IAM User

Published 2024-11-01 - Listen on your favourite podcast player

In this episode, we discuss why IAM users and long-lived credentials are dangerous and should be avoided. We share war stories of compromised credentials and overprivileged access. We then explore solutions like centralizing IAM users, using tools like AWS Vault for temporary credentials, integrating with AWS SSO, and fully eliminating IAM users when possible.

AWS Bites is sponsored by fourTheorem, an Advanced AWS partner that works collaboratively with you and sets you up for long-term success on AWS. Find out more at fourtheorem.com.

In this episode, we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: Hello, friends of AWS Bites podcast, the show where we share stories and hard-learned lessons from the cloud trenches. Today, we are here to sound the alarm on something very serious. You need to eliminate IAM users from your AWS accounts and fast. Seriously, if you are still using them, it's like an incident just waiting to happen. It's like handing your credit card details to a stranger online and hoping they will surprise you with a gift.

But I'll give you a spoiler. You won't like the gift that you will get. So just be very careful. Don't do that. And we have talked already about governance and lending zone in the past. But today, we want to focus specifically on why IAM users and long-lived credentials are one of the biggest foot guns that exists today in AWS. And hopefully, we'll also share some strategies on how you can start to get rid of them if you are still using them.

My name is Luciano, and today I'm joined by Conor, our new co-host, who brings tons of experience in managing AWS accounts. So I'm really excited. Let's get into it. AWS Bites is brought to you by fourTheorem. If you are looking for a partner to architect, develop, and modernize on AWS, give fourTheorem a call. Check us out at fourtheorem.com. So as I said, we already covered similar topics. But today, we have Conor, who brings lots of expertise and fresh perspective. So maybe it's a good idea to start with just Conor introducing yourself so people can know all the amazing things that you have done.

Conor: Sure. Long-time listener, first-time caller, I guess, to AWS Bites. So I joined fourTheorem about 18 months ago. I'm a senior infrastructure engineer. And I guess I've worked with a lot of startups over my career. Very interested in the security space in AWS and infrastructure as code and all things safety and security in the cloud, I guess. So I've had the privilege of joining a lot of excellent teams over the years. But a lot of times, I may have been the first security nerd or AWS practitioner to join the team. So I have often encountered very mature AWS accounts with lots of IAM users. And what we want to chat about today, I guess, is there's a lot of modern techniques and tooling that help us totally eliminate those IAM users, which have become a bit of a security nightmare or certainly a potential rake in the grass for a lot of organizations. So yeah, that's me.

Luciano: I can only add that you are one of the people I know with the most Terraform expertise. So I'm really lucky to have you as a colleague every time I have a question about Terraform. So that's just to send a little bit more context. But yeah, going back to IAM users, I guess one main question that people might have is like, what is really the problem, right? It seems like something that has been in AWS forever.

So why should it be a problem in the first place, right? And we said it is one of the biggest foot guns that you have today. And the main reason to that is that when you use IAM users, generally, you are also using long-lived credentials. So you go into that specific user and maybe through the web UI and you generate credentials for that user. And then God knows where you're going to copy-paste those credentials, store them somewhere forever, maybe forget about them.

And I don't know, they can end up in development machines. They can end up in servers. They can end up inside applications that somehow need to interact with AWS. And the problem with that is that they are generally clear text. So easy to exploit or exfiltrate or copy-paste. And the other problem is that generally they are very broadly scoped. So you can have permissions to do lots of things. Even if originally you didn't intend to do all these different things, maybe it was convenient to just give this particular user lots of open permissions.

And then whoever has access to the credentials inherits all of these permissions. So imagine, yeah, you might end up with somebody just spinning up the most expensive EC2 instance just because that particular user has permissions to do that. And the other problem is that those credentials get rarely rotated. There are ways that you can rotate the credentials, but there's generally something that people don't bother doing.

And I personally have seen, it's funny that you can go to the web UI and see how long those credentials existed. Like, it's not rare to see like these credentials existed for like 2000 days. And yeah, you can think of how many things could have gone wrong in so much time. Like how many people could have had opportunities to take these credentials and use them to do something. So the other interesting use case that we have seen a lot is that if you create IAM users for, I don't know, developers in your company, people will come and go in the company. And sometimes you don't have a very strict process for offboarding people from your AWS accounts. So I know of people that were able to access accounts in AWS from not their first company, but like two previous companies that they were working before, like years before. And still their credentials were totally valid and they could see everything in the account and do all kinds of actions that they could do as developers, even though they were not employed anymore in that company. So just be aware that these are just some of the risks. But I don't know, Conor, if you have anything else you would like to add.

Conor: Yeah, I guess it is one of those difficult challenges. You know, a lot of my history was in IT and onboarding and offboarding people from teams. So you tend to bias towards single sign on and role based access control and credentials that you can expire or at least reason about their status. And like you mentioned, IAM users are incredibly challenging to keep on top of, especially in larger organizations.

So just to add to your war stories there, like I've seen, I've seen it all, I guess, you know, that the worst is when you find the root account that has an access key and secret access key generated and it's being used in a pipeline. But I've always seen, you know, front end engineer Bob has a large team and they're great and they're tenacious about getting work done. But eventually, you know, Bob moves on and suddenly all of the other front end engineers development environments stop working.

It's typically because Bob has been really helpful and, you know, shared his IAM access keys with the team, maybe check them into that repo for that app. So very difficult to manage that as a, you know, a central cloud team or a practitioner that's trying to keep on top of the AWS environment. So I guess it's not usually somebody's fault. You know, it's up to the platform team or the security team to try and put the correct guardrails in place so that it's easy to do the right thing and very difficult to do the wrong thing.

And I guess later on in the episode, we're going to get into a lot of the modern techniques for doing things the right way that just make it much more difficult to create these kind of edge cases in the first place. But yeah, IAM users will end up on developer machines. They'll end up in Dropbox, Google Drive, Slack messages. You know, you'll have users that were intended for human access that end up in pipelines or production systems.

So I guess our goal today is to tell everybody that there is a better way and you can get to the point that you have no IAM users at all. Depending on your environment, that's going to be a simple or very arduous journey. But I guess we want to help you to get there. So I guess the reason you end up with these in the first place is because somebody wants to do programmatic access to AWS, right? Whether it's from a pipeline that's running on an EC2 instance or ECS Fargate, or more commonly, the development flow where you want to interact with S3 or CloudFormation or you want to run CDK deploy from your laptop. You need programmatic access to the AWS account. All requests to AWS have to be signed with the SIG V4 algorithm. So I think what we want to focus on initially is how does it work? If I run AWS S3 LS from the CLI, what goes on under the hood there in the SDK for it to try and find credentials and authenticate itself against AWS APIs? Do you want to chat about your experience with that, Luciano?

Luciano: Yeah, that's a very interesting topic. I actually really like that kind of stuff. Like you mentioned, the SIG V4 algorithm. We're not going to go into detail on that one today, but we'll post a link in the show notes if you're curious. What we want to focus a bit more on is that there needs to be a process for if you're using the SDK or the CLI to locate credentials in that execution environment. And there are specific pages that you can find in the AWS docs, depending if you're using a specific SDK, like for instance, the JavaScript one, there is a page that details exactly all the steps that are performed to try to find a viable set of credentials when you create a client using the SDK.

And there is a similar page for the CLI, so we'll make sure to share the links as well. But just to summarize, what are the different steps? So you can have an idea of all the different things that can provide credentials to a given environment. So the first thing is that as you create a client, you are effectively instantiating a class. Let's imagine in JavaScript, you say, I don't know, new S3 client, something like that.

You can provide options there as the constructor of that function call. And some of the options there are specifically put in place so that you can provide inline credentials. So if you do that, this is the first thing that the client is going to look for. And if you provide the credentials there, those credentials will be used. But those credentials there are not mandatory. So what happens if you don't provide them?

And the next step is that if credentials are not there in the constructor options, the client is going to look for specific environment variables. And I'm sure you have seen the AWS secret key and that kind of environment variables. So this is the next thing that the client is going to look for. So if in the current execution environment, you have those variables set, those will be used for your client to interact with AWS.

Then the next step is that if you don't have those environment variables, then it's going to look for shared credentials files. You can imagine similar to when you configure your CLI that you might have credentials files that way. And then you can have in specific environments, for instance, if you're running your code in ECS, ECS has mechanisms to provide credentials through, for instance, roles.

You can have short-lived credentials. So if you have done all of that, configure correctly, the SDK can actually load these credentials. So that would be another step. And we can keep going. There are other steps. You can use credential processes, which is just a mechanism where you can have custom ways to say, I can call a specific process. And that process is responsible for somehow fetching credentials that I can use.

And this is generally when you want to integrate with, I don't know, third-party tools. Maybe you have some kind of vault or secret manager and you store configuration there. That configuration can contain your AWS credentials and you can create that mechanism to load credentials from a custom place. And then finally, another example is if you're using EC2 as a concept of instance metadata. So that's basically a mechanism where if you provide a role to an EC2 instance, that role can have effectively permissions attached to it.

And the way that your client inside that EC2 inherits those permissions is by accessing this metadata server, let's call it, it's kind of a local server that if you call it through an HTTP endpoint, it's going to give you temporary credentials that are scoped specifically with the role attached to the instance. And I guess some of these ideas are better than others because they will give you shorter term credentials.

But because the most common or at least historically the most used option is just copy-paste some credentials into the environment, people tend to do IAM users, copy-paste credentials and use them even for programmatic access on a kind of server process or something that should have been managed differently. So that's, I guess, probably trying to shed some light on the fact that, yeah, there are different ways to provide credentials and it's kind of a step-by-step. If the first step fails, it's going to look for the second step and so on. So it's important to also understand what is the priority of what the SDK or the CLI is looking for. So again, link to the links in the description if you're curious to find out exactly from the docs what are the different steps. So the question now could be, okay, I kind of get the point that I shouldn't use IAM users, but if I am, what are some of the mitigation strategies that I can start to use today? Yeah.

Conor: So I guess the community kind of arrived at a lot of interesting patterns. I was trying to find the article this morning, but it seems to be removed probably because there's better strategies available now. But I think it was Coinbase back in 2017. I had a great article on their engineering blog about the Bastion IAM account. So that might be a more typical pattern for an SSH jump host or something.

People would usually mention Bastion in that context, I guess. But what Coinbase had established, and it was a pattern that I saw used a lot across the industry, was they would elect an account. Let's call it the Bastion account. Maybe it was the management account of the organization. It didn't really matter. And what you could do is you would create your concrete IAM users in that account. So we'd create an account for Conor and an account for Luciano.

Straight away, that was a big win, right? Because you were at least centralizing the IAM user. And when you had access to maybe dozens of accounts, we still had one Conor and one Luciano instead of dozens of each. And so the pattern that was established there was to try and enable role-based access control. So typically then you'd create maybe an admin or a power user or a view-only role in each of the accounts that we operate in.

And then what they would do is they would set strict conditions in the trust relationships of those roles. And that might say, the only person who can assume me is Luciano or Conor from this Bastion account. And you could attach other STS conditions like a multi-factor token must be present and must be valid. So that was one of the ways that practitioners tried to create this kind of role-based access control, centralize it through a single IAM user.

And at least then we had our sane access pattern, easy to onboard users, easy revocation. So that was a pattern that was kind of a battle-tested pattern. And then on the client side, I guess, or on your developer machine, you've still got that plain text long-lived credential. Okay, so how do we protect that? So another excellent tool that people might recognize was by 99designs, and that was a tool called AWS Vault.

And I guess the innovation there was they allowed you to use whatever keychain you had on your system, whether it was Linux or even the macOS keychain. It would essentially escrow the access key and secret access key in the keychain. And then instead of directly exposing those credentials to the SDK or the CLI, like we just spoke about, it would either make some sort of STS get temporary credentials API call, or you would perform the STS assume role operation to assume that concrete role in the target account, right?

Let's say the admin role in my development account. And so now you had a scenario where your access keys were protected in the keychain. You were only ever retrieving temporary credentials and exposing them to the runtime, whether it was the SDK or the CLI, where a lot of these tools actually injected the environment variables into the shell. Multiple reasons for that. One reason is because all of the SDKs understand that, as we just covered.

But also it has one of the highest precedences in the credential chain. So more often than not, that's the behavior you wanted when you were running locally. So that was kind of the mitigation strategy that the community arrived at. It was a very common pattern. And I guess I'm showing my bias here where I've always worked at smaller startups and organizations. Obviously, at larger enterprises, you'd have your SAML identity provider, which would provide similar entry point to the system. You would still assume roles in target accounts. So that's one way it worked. And you can still operate that methodology. It has a good few moving parts. You need to really understand trust relationships and multi-account strategy. And you've got to have all your infrastructure's code tooling in place to make that a feasible pattern. But there are modern alternatives today. Finally, spoiler alert.

Luciano: I'm really curious to learn what those alternatives looks like. Because yeah, I'm sure that there are more scalable ways today to deal with, especially if you start to have multiple accounts, lots of potential users, different kinds of access levels that you need to ensure. And you might have lots of people coming and going in the organization as well. So yes, what are those better ways to deal with user access in general and permissions?

Conor: Yeah, so I think there's been previous episodes on creating a landing zone and then some of the infrastructure's code tooling around that. But a core component of all of those modern landing zone setups is AWS IAM Identity Center and more so its integration with AWS organizations. So what the Identity Center service lets you do is essentially implement the pattern we spoke about. So Identity Center has the concept of permission sets and permission sets are almost like a wrapper for a role and then it can have targets.

So the target could be another AWS account or an organizational unit. But essentially, we now have single sign-on. Okay, Identity Center lets us create an individual identity. You can use Identity Center's built-in Identity Store. It's a great way to start. If you're a small organization, you may not necessarily have an IDP, Azure or AD. And if you just want a sane way to manage users into your AWS organization, you can use the built-in Identity Store.

So you then have your kind of standard, you know, groups. You could have a group for certain application teams, security group, platform team. And you can map the groups and permission sets to accounts. And then Identity Center and its integration with organizations does the rest of the heavy lifting for you. So if you have 10 accounts or 200 accounts, the admin role will be created automatically by the permission set.

And then the appropriate users will be able to access it. Okay. So Identity Center really does all of the heavy lifting for you here as regards access to a large AWS estate. So the other option you have, which is convenient then, is that you can integrate most identity providers. Okay. So at 4th Erem, we have Google Workspace integrated. You can have manual user management, or you can even use something like Skim to get full automated user just-in-time provisioning into the Identity Center.

So it's quite flexible and, you know, nice to be able to integrate it with an existing identity provider. So that then gives you a lovely, you know, SSO landing page that you might be familiar with, and it'll present all of the AWS accounts and all of the roles that you have access to. So it solves the console access problem quite well. And then I guess, you know, the next step is getting your programmatic access.

So one thing that the Identity Center portal will let you do is grab temporary credentials in the context of any role that you have access to. So it's like the Identity Center console has essentially performed the, you know, STS assume role or assume role with web identity operation for you. And it's going to give you back an access key, a secret key, and a session token. And if you want, you can just paste that in a shell or, you know, configure it in an AWS credentials file, and you have your programmatic access.

Now, that is fine for, you know, maybe a tiny investigation or access to an account you don't often use. But the tool that we recommend to a lot of customers and we use internally ourselves is a tool called Granted by CommonFate. And the really nice thing about that tool is that you sign in, you perform the single sign-on flow, and it's able to enumerate all of the accounts and roles you have access to.

It's able to automatically generate the config file. And now you have this tool that you can run in your shell, and it can inject credentials into the shell for any role in any account that you have access to. So we find it, you know, a really great way to access accounts. You're only ever getting back a temporary credential. And it just, in an environment where you have lots of accounts, which we often encourage, it just makes the, you know, the user experience really nice.

So what else did we want to chat about there? So when you, if you do that, like this is what we think great looks like, I guess. It's a lot easier in a greenfields environment to create this. But we would then recommend, you know, creating an SCP policy that blocks the creation of IAM users. That's a common security best practice now, I guess, because once you've gotten to this point, you don't want people to be able to go off piste. You know, you don't want them to be able to do the bad thing. Like we said, we want to make the good thing easy. So we can just block the creation of IAM users. Now you've got to use your single point of entry into the AWS organization to do what you wanted to do. Yeah, that sounds easy.

Luciano: Again, when you are starting from scratch and you have total freedom, so you can set up things in a way that looks modern and good from scratch. But I would imagine that if you already have, I'm just going to say quite a legacy setup, maybe with lots of stuff running there, you have been using it for years. Like how do you start to enforce this kind of things? I'm sure that there are cases where you cannot just get read from one day to the next one of all IAM users, you need to, I don't know, probably take a few shortcuts or a few half baked solutions that maybe are still better than nothing. But yeah, you're probably incrementally going to get to a great place to be, but it's not going to happen from like one day to the next. So do you have any recommendations for people that might be more in this camp where it's not going to be easy to change things, but they can still have opportunities to make things better?

Conor: Yeah, absolutely. And it is a thankless job sometimes, you know, to say, look, we've got a glaring security issue here. We want to pay it down. So I think it's just important to have buy-in from the team and the organization, you know, get it on the Kanban board, get it on the backlog and start to approach it systematically because it is worth doing. You're closing off one of the biggest attack vectors you have in a modern cloud environment, I guess.

So how do you start, right? The two problems we really usually have to solve is like the human access problem, which tends to be simpler, and then the machine access problem. Okay, like we spoke about in the intro, you're probably going to have a smattering of IAM users that represent humans. You might have ones that represent machines. And if you're really unlucky, you're going to have the ones that were intended for humans that are now being used on machines.

So how do we approach it? The IAM credential report is super helpful. We'll put a link to that in the show notes. You can generate this using the CLI or from the IAM console. That will give you a CSV export, which is a great place to start. You know, you can sort it by access key age, or you can start to examine some of the policies that are attached to the users. And you can decide to organize them by danger, which sometimes makes sense, right?

Now, often that first pass will be a big win. You're going to find people that have left the company months ago or years ago. You're going to find access keys that haven't been used in some window. You're going to find machine users that are for services you no longer operate. And like very quickly, you might be able to shed half of the IAM users. All right. Now, I would also recommend implementing identity center permission sets on board yourself and maybe other people on the platform team or the security team.

Start to kick the tires on that and introduce it for a trusted group at the organization. And then you can start to reimplement your role-based access control in identity center, right? So you now have a permission set that represents maybe what the DevOps IAM user does. That's a typical thing. You might find an overprivileged IAM user that was used by a certain group or a group of individuals. So you can start to take little slices off of the problem like that, okay?

You'll often find then, you know, you could find an IAM user called S3 backup. It's quite obvious what it does, but you might find that it has the Amazon S3 full access managed policy attached. Now, you know, we could replace that with an IAM role and just move on. But we know deep within our soul, the correct thing to do here is to also evaluate that workload and look at what access it actually needs.

So you might be able to talk to the team responsible. You might be able to use IAM access analyzer, and you'll be able to create a really fine grain permission for that role, right? It probably only accesses one bucket. It might only access a soap key within the bucket, and you can give it a very strict, you know, put object policy. And with every step you encounter, you're incrementally improving your posture, okay?

So the scream test is going to be your friend. You're going to have a lot of rinse and repeat on this process. But I have done it. It is painful. But it can be done. And it's definitely one of the best things you can do for your organization from a security standpoint. The other aspect then is, well, what if I have a different cloud? And what I mean by that usually is you could have CircleCI, GitHub Actions, HashiCorp Cloud.

Often, these IAM users are created because you need access from a system that is not AWS, where we can't rely on assuming a role or something like that. So that was a very common source of IAM users, you know, some sort of pipeline user. Thankfully, in the last couple of years, OIDC identity providers have solved a lot of this. I believe there's previous episodes where they're covered, but stuff like GitHub Actions will now give you, you know, fine-grained access to a role to a particular repo or even to a particular operation in a repo, like the repo being tagged or a release being created or a certain branch name being pushed to.

So you can get extremely fine-grained access from GitHub Actions into AWS now as well. So that's kind of the tried and true method to replace those cloud-to-cloud IAM users. And I guess finally, you know, we can strive for perfection, but we're probably going to have a handful of IAM users when we're finished. Could be some legacy system. There could be a user with a customer or on a server somewhere in the wild that we don't control anymore.

Unfortunately, you might end up with a handful of IAM users. What we recommend there is the SCP to block creation of new ones so we can at least plug the leak. And then we can document those users. We could import them into Terraform or CloudFormation so that we have them in infrastructure as code. It's easy to kind of ring fence the problem. We could introduce rotation, again, if we have access to the system where they're used. And you could even implement, you know, event bridge notifications, AWS Config. You can just really scrutinize those IAM users and make sure we know whenever they do something interesting. So I think that's all I have on the topic. It's a challenge, but I would say it's definitely worth doing. It's a game changer for your security posture, particularly in AWS environments that have been around for a couple of years. Yeah.

Luciano: Thank you for all these amazing suggestions. I especially like the last one about importing the users, making sure that they're well documented, creating SCPs to avoid new users. And yeah. Rotating credentials is something that most often, at least, is done manually. So probably you want to have some kind of playbook where you kind of document very well all the steps. At least you can do it in a repeatable way and have a process that kind of reminds yourself to do it often enough. So with that, I think it's a wrap for today. I hope that we have convinced you that IAM users are not a good practice. I hope that we have given you enough suggestions and motivation to get started on getting rid of them. And hopefully you enjoyed this episode. So if you've done that, please give us feedback. As always, thumbs up, share, like, subscribe, and all the usual things. And I want to conclude with a big thank you to Conor for bringing us a fresh perspective and to be our new host for this episode. So until next time, thank you very much. And we'll see you in the next episode.