AWS Bites Podcast

131. What do you do about CloudFormation Drift?

Published 2024-09-20 - Listen on your favourite podcast player

In this episode, we discuss the concept of CloudFormation drift, what causes it, how to detect it, and strategies for resolving it. We explain that drift happens when the actual state of resources diverges from what is defined in the CloudFormation templates. Common causes include manual changes, third party tools, mixing IaC solutions, and automation. We then cover built-in drift detection in CloudFormation and integrating it with alarms. Finally, we suggest approaches for reconciling drift like change sets, deletion protection, and bringing up parallel stacks.

This episode of AWS Bites is brought to you by fourTheorem. Need to modernize your infrastructure or build scalable cloud solutions? fourTheorem brings the experience to build high-quality, maintainable, and scalable cloud applications that evolve with your business needs. Visit ⁠⁠⁠fourtheorem.com⁠ to see how we can help take your cloud journey to the next level.

In this episode, we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: Have you ever deployed infrastructure with CloudFormation, only to notice later that things weren't quite lining up as they should? Well, you might be experiencing CloudFormation Drift. We've all been there. Deployments look fine initially, but gradually drift away from their original configuration over time. This can lead to very unpredictable results when you try and update your stack later on, and it can really end in disaster.

So today, we're diving deep into CloudFormation Drift, what causes it, how to detect it and fix it, and most importantly, how to prevent it in the future. And we're going to cover what it is, how it relates to CloudFormation and other infrastructure as code tools, why it happens, how to resolve it, some best practices on detecting drift as soon as possible, and to try and avoid it entirely. I'm Eoin, I'm here as usual with Luciano, and this is another episode of AWS Bites.

This episode of AWS Bites is brought to you by fourTheorem. Need to modernize your infrastructure or build scalable cloud solutions? fourTheorem brings the experience to build high quality, maintainable and scalable cloud applications that evolve with your business needs. Visit fourTheorem to see how we can help take your cloud journey to the next level. We might just start with a quick CloudFormation summary.

We've talked about it before, but let's rapidly go through it again. CloudFormation is an infrastructure as code or IaC service from AWS that allows you to define and manage all of your AWS resources and configuration using templates that are written in either JSON or YAML. So you can use it to provision, model and manage all of the resources for your application in a fairly safe and repeatable way.

So instead of creating things ad hoc manually, you define them in code using a template and CloudFormation is the service that handles provisioning and updating it for you. So it's perfect when you need repeatability because it allows you to define a consistent repeatable infrastructure setup. And also if you've got complex environments. So if you've got projects involving multiple resources like EC2, RDS, VPCs and Lambda, CloudFormation can handle all the dependencies between them fairly seamlessly.

It's also really good if you want to automate infrastructure management. And this is almost table stakes these days, but you can automate the entire lifecycle of your infrastructure, creating, updating and deleting resources without manual intervention. And it allows you to provide support for best practices like versioning enrolled back as well. So it has automatic rollbacks if something goes wrong during deployment. By the way, if you want to dive a bit deeper on CloudFormation, we have spoken about it before. So in the show notes in the description below, you'll see links to episode 31, the battle between CloudFormation or Terraform and episode 121, where we talked about five ways to extend CloudFormation. And that's the overview of CloudFormation and infrastructure as code. But we're here to talk about Drift. So what is the concept of Drift?

Luciano: The concept of Drift is not unique to CloudFormation. In fact, it's something that can occur even with other infrastructure as code tools. For instance, you mentioned Terraform, even with Terraform. If you don't do things correctly or by the book, as they say, you can have Drift as well. So it's not necessarily an issue only with CloudFormation, but something that you need to keep into account every time you do infrastructure as code.

And today, of course, we are going to focus a little bit more on CloudFormation and the tooling around it and some of the tips that are maybe a little bit more specific to CloudFormation. But I think the general concepts and the advice that we are going to give can be applied also to another IaC tool. So if you prefer to use Terraform, I think you're still going to find value in this episode. So back to Drift, it happens when the current state of your infrastructure diverge from the defined configuration in your templates.

So in other words, if you have defined something using infrastructure as code and at some point what you have defined there doesn't exactly match what's in your AWS environment. So you have, if you just look at your templates, you think the reality should look in a certain way, but then if you actually look at your deployed stack, it doesn't really look exactly like you defined in your template. So you might be wondering how is this even possible because we just say that CloudFormation infrastructure as code in general is a way to get reproducible and deterministic deployments.

So how is it possible that all of a sudden it doesn't really match anymore what was defined in a template? And this is something that can happen as we said and generally happens when you start to misuse infrastructure as code. So again, let's maybe try to do a little bit of a step back trying to describe what's the idea of infrastructure as code. In general, when you use tools such as CloudFormation or Terraform, those are declarative tools we can say.

So tools where in a way they provide you with a language that you can use to define or describe what's the desired state of a given stack. So you never tell the infrastructure as code tool what is the sequence of steps that it needs to do to get to a certain state. You just define what is that state and you let the tool figure it out by itself how to get from the current state to the new desired state.

And the problem is that most of these infrastructure as code tools, they don't necessarily query upfront your current environment. For instance, your current AWS account. They will just store the state of the last deployment somewhere. In the case of CloudFormation, this is entirely managed by the CloudFormation service itself. If you use something like Terraform, Terraform is actually quite flexible.

It gives you visibility of the state file. You can decide to store it in a simple file in the file system. You can store it in DynamoDB, you can store it in S3. Actually, Terraform, it's quite open about that. CloudFormation is a little bit more opaque, but the same principle applies. So what happens the next time that you do a deployment is that rather than reassessing what's the current state, the tool is just going to take the latest state recorded during the previous deployment as the starting point.

So it's going to try to understand, okay, what was that starting point? What is the new state that you want to go to and make a number of assumptions and figure out what is a good plan to go from A to B. But the problem is that if you did something wrong, and we'll talk a little bit more about that in a second, what the infrastructure tool thinks it's A is not exactly matching the current reality. So it might actually be something else and therefore the plan to go from A to B is not necessarily a good plan for what concerns the reality of your stack. So maybe we should mention some more practical examples of what can actually introduce this kind of situation that we are calling effective drift.

Eoin: I think the most common case for drift is just when you're making manual changes. So someone might manually update a resource without updating that CloudFormation template. That could be like tweaking an EC2 instance setting, changing some security groups while trying to debug a connection issue or changing an IAM policy to try and resolve some permissions issues. You can also have things that are a little bit more subtle like changing the number of desired tasks in a Fargate service.

And that's something that is quite common to happen actually outside your infrastructure's code tooling. So manual changes, probably very common. Then you've got third-party tools. So sometimes external automation tools may modify resources without CloudFormation being aware of it. These can be tools that try to assess compliance and that might apply resources on your behalf to apply specific best practices.

And it can also be internal, not necessarily third-party tools, but just external automation. You might have set up specific automation that can alter resources. Like you might have a script that helps saving cost by turning or scaling down EC2 or turning it on and off just to match your expected load. Or you might just have some internal tooling that applies security best practices like turning encryption on. Using a mixture of IaC tools as well can cause drift. So if you use a mix of different tools and stacks managed by different tools happen to share the same resources, this might cause drift. This is a dangerous area, I would say. The shared resources state might look very different to every tool depending on the order of deployments and when it took its perception of state, if you like. So effectively, each tool won't be able to know what the changes are applied by the other two. So that's how it can happen, but how do we avoid it?

Luciano: Yeah, I think at this point, it should be very clear what can actually introduce drift, what drift is. So in a way, we can start to guess and working backwards and try to think, okay, how can we avoid it, right? And one thing is that we should try to avoid manual changes as much as possible, unless you know what you're doing. Like, I think there are some cases where you might still need to do manual changes because maybe you are trying to do certain things that in that particular moment, it's just easier to do it manually.

We mentioned, for example, trying to resolve an issue with a security group. It is, I think, totally legit to try to figure out exactly what is the correct configuration. And then, of course, you need to remember to apply that correct configuration to your infrastructure as code to make sure that maybe you had a little bit of a drift for a few minutes, but then that drift is immediately reconciled and your confirmation stack is effectively in line with the reality of your stack deployed on AWS.

So again, manual changes generally to be avoided. You can still do it, and in some cases, it can be useful to do manual changes, but always remember that manual changes will introduce drift unless you propagate them back into your stack. Then the other advice is always use IaC. So try to avoid hybrid mode deployments, we would call them, which is sometimes you have this situation where different teams have different practices.

So you might have teams that still do things manually, other teams that use infrastructure as code, and then somehow you end up with this kind of mixed stack that has some resources that are managed manually, other resources that are fully managed by IaC. And yeah, things can get messy really quickly and it's going to be very hard then to tell when a drift is going to happen. It's eventually going to happen, but then how it's going to happen and then what kind of results you can have as a consequence of that.

And we also mentioned not to mix different IaC tools. So you can use different IaC tools together. There are ways to do that safely. I think the risk is when different IaC tools are actually sharing the same resources, but you can have a mix of IaC tools when they manage different stacks. And other things are if you're using external automations, ideally you want this automation to either update your templates and not the resource directly, or if it's not easy for you to let the automation update the templates, maybe you can have some kind of reporting that then you can manually take and apply into your infrastructure as code rather than letting the automation change resources and then ending up with drift. So it might be of course more complex to set up automation this way, but I think you always need to prefer the safety rather than maximum automation and then end up with something that can be inconsistent. So now that we know some ways that we can avoid drift, how can we detect it? How do we really know if we have drift in one of our stacks?

Eoin: Yeah, it's very important to have some sort of ability to detect that you've got drift and then obviously take remediation steps. I mean, you might also have a degree of acceptable drift. I think when you're talking about automation tools, you know, updating security configuration or applying tags, sometimes organizations just take the view that the infrastructure as code tools update the resources and the automation continually aligns them to comply.

And that's acceptable drift, I guess to a degree. But if you have other kinds of drift, then you might just realize too late because the deployment fails in a weird way, like tries to modify a resource that doesn't exist anymore. So luckily, CloudFormation for the past few years has a feature to detect drift and you can use it from the management console by just going to a stack. You click on stack actions and select detect drift.

And I think this is supported for most configurations now. When it launched, it didn't support everything, didn't detect everything. It's going to start a background scan and CloudFormation will compare the known state with the actual state of all of the resources in the stack. And then once it's finished, you can visualize the results of the operation by clicking on view drift results under stack actions.

So in that page then after a few minutes, you'll see if your stack has drifted and if it has, it'll also give you a diff of the current stack state and what CloudFormation expected. Now, luckily, you can automate this process to some degree and trigger an alarm if one of your stacks has drifted and that's a good idea. I think if you're concerned about drift and going to take it seriously, there is a tutorial by AWS that explains you how to do that and we'll have a link in the notes below. So we've got some drift, we've seen it, we've detected it. What do we do?

Luciano: Yeah, I don't think there is a universal solution and to be fair, it's always a little bit of a pain. There are situations where maybe you just created that drift like the example we mentioned about the security groups and therefore in that case, it's very easy to reconcile your stack and fix the drift because you know exactly what did you change and you can easily reapply the same changes into your infrastructure as code.

And in the case of a security group, that's probably going to be a few properties. Maybe you're going to open different ports, but it's very limited to a specific area of your stack. But in some cases, it might be much more complex and you probably need to be a little bit creative in trying to figure out exactly first of all, why the drift happened, what is exactly the desired state versus the current state and how to reconcile the two.

And the general problem is that you might think about a very simple solution. The simplest solution is probably, okay, I'm just going to destroy the stack and recreate it entirely with a well-known, well-defined end state, which is something that works. But of course, if you have an application that is running in production, you probably cannot afford to do that because of course, you are going to create downtime.

So it gets really tricky when you want to try to reconcile drift while not creating downtimes. And that's why sometimes you need to be a little bit creative. And recently, for instance, we had a use case at work where we had some drift related to a load balancer. So we needed to do some changes to this load balancer, but doing the changes will recreate the load balancer entirely. And because in that particular stack, the DNS entries were not managed by that stack, but were managed externally, doing the changes will basically recreate a new load balancer with a different IP address or a different DNS record, and therefore we will have downtimes.

And to solve that particular use case, we basically needed to keep the existing load balancer as it was, spin up a new load balancer with basically exactly the same targets, then change the DNS in the other stack, and effectively we kind of switch load balancers that way. So we had a moment of time where we had two different load balancers that led time to the DNS to propagate correctly, and then at that point we could delete the old load balancer.

So in that sense, this is just an example that shows you that sometimes you need to think about a little bit more involved and complex solutions just to make sure you slowly converge to the desired state by trying to avoid to destroy resources that might be used in production, and therefore if you just destroy them, you might end up with downtimes. So yeah, I guess in general, what you want to do is try to apply, to figure out, first of all, what happened, figure out how do I get to the specific target state, and then try to create a plan where step by step you understand what's going to happen, what is going to be the new state, and then slowly make sure that that state converges to the desired one. Now at that point, you hopefully are going to end up with a new version of your stack where your actual state described in the stack and the state present in AWS matches. So from that moment on, if you keep doing changes correctly, only using infrastructure as code, that situation shouldn't happen again. So yeah, this is, I guess, one of the approaches. Do you have other ideas, Eoin, that you want to share?

Eoin: I would suggest maybe using change sets in CloudFormation as a way to help with that, because with change sets, it's a bit like a Terraform plan. It's not going to apply changes. It'll just create a change set for you with the diff, and it'll show you what changes are going to happen beforehand. Now you can also enable deletion protection on resources that you want to protect. Of course, if you can't afford downtime or some data loss, sometimes the simplest solution when you have drift is just to destroy the stack and recreate it.

So in the development environment or pre-production environment, that might be the way you go. Otherwise, you'll have to come up with a plan with multiple incremental steps that can help you to minimize damage as you convert your infrastructure as code state to the actual state of the stack. And that can seem like a lot of work, but at the same time, if you don't do it very frequently, it is an awful lot of work.

But if it's something you get used to, it's a good practice to just get into the habit of. Other times, it might just be safer to bring up an entirely new stack in parallel. Do all the necessary data migration, if any, and then shift the traffic to the new stack and then finally remove the old drifted stack. So yeah, I mean, resolving drift might be tedious and costly. That's why you want to avoid it as much as possible in the first place.

Maybe another worthy mention is that if drift includes new resources, if you were to consider that other resources might have been added as well, that should have been in that stack. You can also use CloudFormation import. And that's another way to manage drift. We mentioned this in our way back in episode 11. How do you move away from the management console and how to get stuff that isn't managed by infrastructure as code into it? So that's one dimension as well. Just on drift detection as well, as you know, we should probably add that drift detection itself doesn't have any cost or charge associated with it. But depending on the resources involved, correcting it might, of course. So changing resource types or scaling can trigger charges. Of course, factor in the cost of downtime or potential security risks from unmanaged changes as well.

Luciano: Absolutely. I think this covers more or less everything we wanted to share for today. To summarize, CloudFormation drift is an issue that can be very tricky and can be unexpected sometimes. So even if you do an effort to maintain your stacks well or using infrastructure as code, there are always so many different factors that can sneak in and cause drift. So I think it's just a good practice to try to stay vigilant and maybe come up with some automation like the one we mentioned from the tutorial by AWS that we have in the link in the show notes to try to give you some kind of alarm as soon as possible when drift is detected so that you can action sooner rather than later. Because of course, the more drift compounds, the more challenging it's going to get to resolve that drift. So that's everything we have. And we actually are curious to know if you had any interesting story of stack drifting and maybe you had to come up with some very creative resolution strategy. If that's the case, please share it with us, either by reach out to us individually on our social channels or maybe by leaving a comment on YouTube or rather in your podcast player of choice. So with that, thank you very much for staying with us and we'll see you in the next one.