AWS Bites Podcast

121. 5 Ways to extend CloudFormation

Published 2024-04-19 - Listen on your favourite podcast player

In this episode, we discuss 5 different ways to extend CloudFormation capabilities beyond what it natively supports. We started with a quick recap of what CloudFormation is and why we might need to extend it. We then covered using custom scripts and templating engines, which can be effective but require extra maintenance. We recommended relying instead on tools like Serverless Framework, SAM, and CDK which generate CloudFormation templates but provide abstractions and syntax improvements. When you need custom resources, CloudFormation macros allow pre-processing templates, while custom resources and the CloudFormation registry allow defining new resource types. We summarized recommendations for when to use each approach based on our experience. Overall, we covered multiple options for extending CloudFormation to support more complex infrastructure needs.

AWS Bites is brought to you by fourTheorem, an AWS Partner that specialises in modern application architecture and migration. If you are curious to find out more and to work with us, check us out on fourtheorem.com!

In this episode, we mentioned the following resources.

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Luciano: CloudFormation allows us to define stacks, which is essentially a way to define collections of resources, and that's done using YAML or JSON in the forms of templates. This is the AWS way of helping us to manage our infrastructure from the creation to updating and deleting resources as our application evolves. But sometimes CloudFormation building capabilities just aren't enough for what we need to do.

Maybe you find yourself wrestling with verbal syntax, or perhaps you need to provision resources that are not yet supported by AWS, or maybe resources that are outside the scope of AWS, maybe by third-party providers. Or you might need, for instance, to perform certain tasks before or after CloudFormation runs. Maybe you need to build some kind of application, maybe a front-end, and package it in such a way that then you can deploy to AWS as part of your own stack.

So today we will explore different ways to extend CloudFormation capabilities. We will cover things like custom scripts, templating engines, higher-level tools such as CDK, SAM, and the serverless framework, but we will also cover CloudFormation macros and custom resources. We will cover different use cases and the pros and cons of every approach. So my name is Luciano, and together with Eoin, we are here for another episode of AWS Bites podcast. AWS Bites is brought to you by fourTheorem, an AWS partner that specializes in modern application architecture and migration. If you're curious to find out more and to work with us, check us out on fourtheorem.com. So let's start maybe by giving a quick recap of what CloudFormation is and why we might need to extend it. Eoin, do you want to cover that? Sure. Yeah.

Eoin: Back in episode 31, we did a battle episode talking about CloudFormation versus Terraform. It's worth going back to a look at that one. It's one of the most popular episodes even still. And in that, we talked a little bit about CloudFormation and how you might extend it, but we're going to dive a little bit deeper today. So CloudFormation is the native infrastructure as cloud solution from AWS. Like you said, Luciano, you get templates that allow you to define stacks, which are just a collection of resources. You use YAML or JSON, and then you can deploy them to AWS. AWS's responsibility then is to manage the state of those resources while they're provisioning the dependency between them and to detect failures, retry failures, and then roll back if anything goes wrong and try to keep your cloud in a consistent state. That's really the goal. And it does its job pretty well. There are some cases, of course, where it might be a bit limited or it can be inconvenient. So you do need to build from time to time alternative solutions on top of it. I think we've been in this case quite a lot. We can let you know in a bit what we tend to go for. So CloudFormation does do its job pretty well, but there are some cases where you might need to customize it or extend it just because there are different situations you find yourselves in. So one such thing is you just find that the syntax might be very verbose. So you might want to write something a bit more high level, higher level component, if you will, CloudFormation that allows you to abstract that and have a modular reusable component. Then again, you might also just want to provision resources that aren't even supported yet by AWS. This happens less these days, but it still does happen. One example I know you've come across recently, Luciana, is Amazon Q Business. It's in preview, but there is no CloudFormation support yet. And then there are resources outside the realm of AWS, like you might be using a third party vendor like Auth0 or SupaBase or something else. And you want to make sure that all your resources are part of the same stack for consistent updates and deployments and for making it easy to reference resources from one stack to another. Now you might need to do things before or after CloudFormation runs. So that's another use case for customizing it. So you might want to build a front end application and copy all of the assets into an S3 bucket, or you might want to pre-process assets like optimizing pictures, building container images, something that normally falls outside the realm of provisioning infrastructure, but it does become part of your deployment. Or you might just want to fetch configuration that you might need as a parameter for your stacks. So how do we get started? What's the first thing you came across as a solution to this, Luciana?

Luciano: Yeah, this is something that I came across a few years ago, I guess, when I started my Cloud journey, it was quite common for people to write their own wrapper scripts and sometimes even use templating engines. And the idea is that you will not run the CloudFormation CLI directly for deployments, but instead you will run your own custom script that will do many things and at some point also run CloudFormation for you. And the idea is that you might or might not use a pre-made template because sometimes you might create your own simplified templates, so to speak.

And then as part of your script, you will use a templating engine, like something like Jinja, for example, to generate additional pieces or generate entirely your own actual CloudFormation template and then call the CloudFormation CLI to deploy that template. This was very common, for instance, because for a long time, CloudFormation didn't even have a concept of for loops. So for instance, if you had to provision, I don't know, 20 lambdas that were very, very similar with each other, it would be very annoying to just do all of that copy paste.

So you would probably find yourself creating some kind of simpler higher level configuration and then use something like Jinja to generate the actual underlying CloudFormation code for every single lambda. And you might also use that approach to run other kinds of commands, for instance, to, as we said, build containers maybe before you try to deploy the template that references that container on a specific registry. So you can build and publish it as part of that script.

And maybe you can also do things afterwards, maybe, I don't know, indicate somewhere, maybe in a chat message that a deployment was completed or the status of that deployment, all sorts of integrations like that. It was something very common for people to do these kinds of things back in the days. Now, is this a good practice? It can be very effective, but I would personally discourage it because the problem is that you will end up with very custom code that you'll need to maintain over time and that every new engineer in the team will need to get comfortable with. And also it will be very different from company to company because everyone is effectively creating their own solution from scratch. And writing this code in a reliable way can also get very complicated very quick. For instance, you just need to try to imagine that the more steps you have, the more opportunities for things to go wrong there are, and you have to think, okay, what do I do if something goes wrong at this step? How do I clean up? How do I make sure that my entire deployment ends up in a consistent state? And again, this is not something very easy to do well. So you might end up with lots of trial and errors before you have something stable. And in between, you might end up with a few broken deployments, which is not a great situation to be in. So there are today lots of better ways to achieve most of the same things we just discussed here without having to struggle with creating your own custom scripts. So I think the main one is to use new tools that came in the industry in the last few years. So which ones come to mind first to you, Eoin?

Eoin: Yeah, there's a whole suite of tools that are essentially CloudFormation generators, and they all have their pros and cons. One of the first ones people might call to mind is the Serverless Framework. It's a third party open source, primarily. I know that the latest version does have a commercial license if you're a company of a certain size, but the Serverless Framework gives you a simplified YAML syntax that will eventually be converted to a CloudFormation template. Now, Serverless Framework supports other clouds and not just AWS, but AWS is, I would imagine, the biggest and most used cloud with Serverless Framework.

The Serverless Framework YAML essentially becomes a superset of CloudFormation where you can define normal CloudFormation in there, but you can also provide this concise syntax for things like Serverless functions and the triggers for those functions. That's really where it shines. One of the notable things about it is that it has a really wide and rich plugin system for all sorts of use cases. So you can extend it with all sorts of plugins that can make the job of generating CloudFormation even easier. And more recently, CDK, and we talked about this recently in the state of JavaScript, sorry, not state of JavaScript, state of Serverless survey results, CDK is getting really popular and it allows you to define infrastructure as code using the program language of your choice, TypeScript, Java, Python, .NET, and Golang. And it makes it easy then to create higher level abstractions. There's a whole CDK episode we did where we talked about the different levels of constructs. And at the basis of that, you have essentially types in the language of your choice that generate just the raw CloudFormation resources. Anything that's a higher level construct is just built on top of that. So it generates CloudFormation templates, but makes it easier for programmers who are skilled in these languages to make it reusable, modular, and extensible. Now, if we go back to YAML land, we have the Serverless Application Model or SAM, and it is similar to Serverless Framework in principle and use cases. I would imagine it's inspired by it in some way, but it is an official AWS solution and competes with Serverless Framework. It is slightly different in that it doesn't have that plugin ecosystem. In fact, it's much more strict when it comes to the syntax, but the way it works is by having its own template language that looks like a superset of raw CloudFormation, but offering those interesting shortcuts. And its magic is done by using CloudFormation macros, which is one of the ways of extending CloudFormation that we're going to discuss in a moment. Now, I mentioned a few previous episodes that are relevant. Don't worry, all the links to those will be in the description below. So we talked about CloudFormation macros, Luciano, do you want to take us through those?

Luciano: So CloudFormation macros are effectively, you can imagine them as a function that can be executed and takes as an input your entire template and returns as an output an extended version of that template. And the idea is that you can define macros somewhere, register it into your own account, and then you need to reference them at the beginning of your own template. So you generally have an annotation called transform with the name of a specific macro, and that tells CloudFormation that that template needs to be transformed from using that macro before the deployment.

And you can use this concept to do all sorts of different things, for instance, automate tasks, enforce policies, or even just create new resources, for instance, as part of expanding that template. So what are they good for? We recently used the CloudFormation macros for a project called SlickWatch. We will have the link in the show notes. And the idea of SlickWatch is that, for instance, you might want to have best practice generated dashboards and alarms, depending on the resources that you use on your template. So you can basically say, if I'm using an SQS queue, for instance, in my template, this macro will automatically generate dashboards and alarms for you to monitor the most common things that you would be worried about when using something like an SQS queue. And of course, you can extend that concept to different resources. And it solves lots of the boilerplate that you generally have to do when it comes to monitoring for specific resources. So this is just an example of when a CloudFormation macro can be very convenient. Take a simple template as an input, see what's inside, and do something useful and produce maybe a slightly enhanced version of that template. And we already mentioned how it generally works, but the idea is that you will need to write that code somewhere. And generally, you can write it as a Lambda function. Then you need to publish this Lambda function, register it as a CloudFormation macro, give it a name. And from that moment on, you can reference that transform into your own templates before deploying in this specific account where you provision the macro. So you, of course, you need to make sure the macro is provisioned before you use it. If you are, for instance, deploying a third-party macro from maybe an open source project like SlickWatch. But then once you have done that thing one-off, you can use that transform in all sorts of projects that you are deploying in that particular account. Now, there might be some small problems with it. For instance, one, as I said, is that you need to make sure things are provisioned before you can actually use it. And the other one is that if you have issues, for instance, as you write your transform, it's not always very straightforward to debug exactly where the issue is because effectively you are taking a template that might be very, very big and very involved in terms of properties. So you need to write generally very complex code that can parse the original template, make sure you really understand the semantic of the template, and then you change things in that template so there is a lot that can go wrong in doing all these different things. And we found ourselves that when building SlickWatch, there is a lot of debugging that goes on. Of course, you can recreate locally different versions of the templates, run it locally, see what is the output of the template, but you still need sometimes to deploy the template to make sure it is semantically correct and is doing exactly the thing that you want it to do. So there might be lots of kind of development cycles before you get to a point where you are happy with the outcome of that transform macro. There is actually another good example that we bumped into multiple times and we might have mentioned it in previous episodes, is a macro called AWS SSO Util by Ben Kiyoi. And it's very convenient because it gives you higher level syntax to do things that might get very, very verbose if you need to define the low-level resources that you generally need for SSO type of operations. So another one we will have a link in the show notes, and it might be a good one to check out if you are just curious to see what it takes to build a cloud formation macro. What else does come to mind, Eoin? We should definitely talk about custom resources.

Eoin: I think this is the one we've ended up using the most, probably because they're pretty quick and relatively easy to implement compared to all the other methods. There can be a few foot guns with the custom resources method, so we should talk about that. There's two ways of implementing them. You can do it with a Lambda function, or you can actually do it through SNS where you receive a notification and you can actually trigger logic anywhere, like on an EC2 instance. And when you want to create a custom resource, it's fairly straightforward. You just basically in your template, you define a type and that could be anything like custom colon colon whatever name you specify, or you can just give us the static version, which is like AWS CloudFormation custom resource. If you use the custom version, just be aware that you can't have multiple double colon separators for more complex spacing. You can't have any more nesting. If you tried to do that, your stack will hang forever until you cancel the deployment and delete the generated change sets. So ask us how we know that one.

You can also specify custom properties. So you can pass custom properties into your custom resource so that the logic that handles it can read them. And then you just need to specify what's handling the creation, updation, and deletion of the custom resource. So that's going to be a reference to a Lambda function ARN or an SNS topic, and you pass those in a field called service token. So that's the only mandatory property for a custom resource. So there's not a lot of code involved. Let's say you're using the Lambda methods then.

When you deploy a stack with one of these custom resources in it, CloudFormation is going to invoke that Lambda function and pass in a few properties. Important one would be the request type, which tells whether the action you should perform is a create an update or delete. You just need to essentially look at those, figure out what needs to be done. And like if you're creating a new resource, return a physical resource ID. If it's an update, get the physical resource ID, check for old properties versus new properties, do the update, and then return the physical resource ID. Now, when you're doing an update, it doesn't actually have to be the same resource ID you originally received. If you provide a new one, then it's considered a resource replacement, which means that CloudFormation will call the delete action with the old one.

If you're really handling that proper life cycle of create, update, and delete, that's some of the things you'll need to watch out for. I've seen quite a few naive and simple versions of custom resources, which essentially just have an idempotent action that happens when a create or update occurs. And that I suppose simplifies the implementation. And you don't have to worry too much about detecting which properties have changed. When your function is invoked, it will also pass in a response URL. You can call that to tell CloudFormation that you're finished and your update or delete or whatever has succeeded or failed. It's a little bit more involved than just returning a value from your Lambda function handler. You need to actually make a request to that specific URL. There are some libraries that can help you to do this correctly and safely. There's a few things you need to be aware of. First of all, your Lambda function might actually time out before it's able to send the response, which CloudFormation will never get the response and it will take an hour to time out, which I've been through multiple times and it's not fun. Another thing is if an error happens in your import code, like in your module initialization code, and you're not catching that and sending back a CloudFormation response in the handler, then that can also cause this one hour hang. So you have to be really careful. Node.js on NPM, there are a few modules we can link in the show notes that make it a bit easier to safely implement CloudFormation customer sources and catch all of these errors and make sure that you always send a response before any possible timeout. Now, the good thing about these customer sources is that they're probably the easiest ones to start with. We use them quite often. Recently, we were using them to run database schema migrations when you deploy an application so that you make sure you always deploy it with the application stack. And that makes a lot of sense for us and it was a good fit. That's very easy to use for self-contained resources that you need just for one project. We mentioned the problems with timeouts. You do have to be really careful. It's a good idea to deploy your code, test it outside of CloudFormation environment and really validate that it works before you try it inside CloudFormation. It's missing things like drift detection, resource import. It's difficult to share it across multiple projects. And I suppose as well, you're using a Lambda function. These resources are then running in your own account. So you need to make sure you've got the right networking and IAM permissions. If you're running that Lambda function in private PC subnets, you need to make sure that you set up the VPAC endpoints to interact back with CloudFormation with the response. Another gotcha there is that the response URL actually comes through as an S3 pre-signed URL. So you need to make sure then in that case, you've got a S3 gateway endpoint or interface endpoint. And if you don't do that, you're just going to hang and it was going to take a while for it to fail. And then sometimes try and roll it back and it'll roll back by trying to invoke the same function again, which is going to timeout again, which is going to take you another hour. So CloudFormation custom resources can be great, but when they don't work, they really induce rage. So when it comes, is there anything we could do to mitigate the risk of rage here, Luciano? What else have we got?

Luciano: Another alternative is CloudFormation Registry, which seems to be kind of the evolution of custom resources in many ways. It's a recent development in AWS. And I have to be honest, before we cover this particular point, this is the one we have used the least. So we will only cover it at high level and try to mention the differences with customer resources. But I don't think we have the level of experience to tell all the pain points and all the foot guns that I'm sure that there will be some of them somewhere, even with the CloudFormation registry. So let's get into it. As I said, there's a new way of doing effectively the same thing you do with custom resources, addresses the same use cases in many ways. But the idea is that rather than having just a lambda, that it is part of your own stack, it's a little bit more like a CloudFormation macro, meaning that you can register that custom resource at the account level, and then you can reference it in other templates that you are going to deploy in a specific account.

And this is where the idea of the registry comes from. And you can even do resources that are publicly available. So that's something that can be very convenient, for instance, when you are a provider of some sort, like a third party provider providing a service and you want to make it easy for people to access those custom resources and install them in their own accounts without having to rewrite something themselves or having to download code and then run scripts to provision it inside their own AWS accounts. So this is one of the main advantages, having this kind of registry that allows you to easily make your custom resources available, either internally in your own company across multiple accounts, or even as a third party provider to make it available to multiple accounts for your own customers. There is also a little bit more. So we say that for custom resources, you have this concept of create, update, and delete.

In the registry, this has been extended to other additional operations. So there is, for instance, a concept of read that allows you to see the state of a particular custom resource, but also a concept of list that allows you to list all the resources of a given type and see exactly what is their own state. And that gives additional capabilities to CloudFormation. For instance, this is why this particular approach supports drift detection, because they can inspect at any point in time what is the state of a given custom resource, and then how does it compare with what you have in a specific template that you have provisioned before. So you might be wondering, how do you create your own first registry custom resource?

There is a CLI that you can use as an helper. It's called the CloudFormation CLI. We will have the link in the show notes. And this CLI has a bunch of commands that you can run. And the first one you probably want to run allows you to scaffold a new project. And you can pick different languages. For instance, TypeScript is one of the supported languages. And when you do that, it's going to generate actually quite a bit of code for you that effectively is a skeleton that you can use to start building the logic. And the first thing that you will probably want to write is a JSON schema that defines all the properties that you want to accept as part of that custom resource. And because it's a JSON schema, it's not just a list of the properties, but also the validation rules that you might want to enforce for every single property.

And this is great because at this point, when CloudFormation deploys these resources, it can validate upfront whether as a user you are using the resource correctly. So compared to the other approach with custom resources, you will have to validate that at runtime. So for instance, in your own Lambda code and just fail the deployment of that particular resource if something doesn't look right from the user input. So this is kind of a way to catch errors a little bit early in the process and probably gives you a better user experience in terms of if something is wrong, it's going to fail very quickly. You can fix the problem and retry. So at this point, after you define the schema, you can run other commands, if I remember correctly, the process, and those commands might generate a little bit more code like models that you can use in your code to effectively work with the properties provided by the user.

But then of course, at some point, you need to define exactly what is that creation, read, update, deletion, list, logic that you need to perform depending on the kind of resource you are working with. For instance, if you are trying to implement something that is like a third party provider for some service, you probably need to use an API or an SDK provided by this third party to interact with the resources that live outside AWS and implement all these steps, create, read, update, delete, and list. So in the time that you will have placeholder where you can do all of these things, and then when you feel that you are ready, you can call another command from the CLI called CFS submit, which is effectively going to package all this code together, ship it to AWS, and then make your resource available through the registry. At this point, that resource will have a name and you can easily reference it inside your own templates. What are some of the advantages?

We already mentioned that there are extra features like diff detection. You can import resources as well. Another interesting detail is that the code is executed by AWS for you. So you don't have to worry as much as you have to do with custom resources. Think about, okay, where is this Lambda going to run? Did I provision enough memory or I don't know, timeout? Is that timeout correct? Is it going to be sufficient for what they need to do? Or think about, I don't know, networking restrictions, all that kind of stuff. You have less concerns because AWS is going to run the code for you. Of course, you still need to provide permissions in some way. And the way you do that is by providing a role that AWS is going to assume for you and that role constraints what can happen inside the custom logic. So you don't have to be worried in terms of, this custom resource is going to create a massive EC2 instance that is going to cost me lots of money every month. You can restrict exactly what the role is and what can happen inside that particular execution. There are other advanced features. For instance, there is a concept of hooks. And then another advantage is that AWS Config will automatically list all the custom resources that are created through the registry. So it's very convenient that if you just want to be reassured at any point in time, if you have resources coming from the registry, you want to see what those are, you can easily see a list of them, even across multiple stacks. We will have a link in the show notes with all the documentation that you need to follow if you want to implement something like this. Of course, the disadvantage is that this process feels a little bit more involved. So something you need to keep in mind as opposed to just creating custom resources for simple use cases probably is still simpler to use custom resources. For more advanced use cases, maybe where you need to make those resources available in an easier way, probably going with the registry is a better approach.

Eoin: Yeah, I was just thinking there. So I remember back a while, there was an announcement from AWS about this AWS Cloud Control API. And the idea was that they would provide this new API that was like create, read, update, delete, and list for all the resources. And that CloudFormation was going to be linked to it, but that it would also allow other providers like Terraform to quickly get access to new AWS resources without having to do all this stuff. I think I haven't heard that much about it, but I know that HashiCorp released like a new provider that was based on this Cloud Control API. So I was just wondering there as you were speaking, is it possible if you publish your CloudFormation provider, the resource provider in the registry, that it would be automatically supported then in Terraform if you use that provider? I don't know the answer to it, but I'm just wondering if that's some neat side benefit you might get by using this method. Yeah, I don't know the answer either.

Luciano: So we'll bounce it back to our listeners. If you have done something like this, let us know in the comments, what was your experience? But sounds reasonable to assume that it is either something you can do straight away, or maybe that it is easy enough to auto-generate a provider at the Terraform level to do something like that from a custom reason in the registry.

Eoin: If you haven't seen the public registry, you can go into CloudFormation console and have a look at all the third-party extensions. AWS has some in there for higher level components, but you have like MongoDB, Atlassian, Sneak, Okta, Snowflake resources in there. People assume, probably for the most case correctly, that if you're using CloudFormation, it's just for AWS resources. But with this method, it doesn't have to be that way. And it also means that if you've got a vendor and they don't support CloudFormation for infrastructure as code, you might actually point them in the direction of this episode and the documentation and tell them, get on it. Yeah, that's absolutely a very good point.

Luciano: So let's try to recap what are our final recommendation based on all the different methods we suggested. I will remark my suggestion not to use custom scripts or templates unless you really, really have to, maybe because we have a legacy application and you don't have time to rewrite that. But if you are building something new, probably there are better ways to do the things that you need to do around CloudFormation. And you can definitely use CDK, SAM or Server Framework. They are great to have that kind of higher level experience, better tooling in general, better syntax, easier to extend and reuse code in many different ways.

So definitely rely on these tools rather than writing CloudFormation from scratch, which is probably great for simple use cases. But as soon as you start to build real applications, those tools will really shine and give you lots of additional benefits that you don't get with raw CloudFormation. When you need to create custom resources or you need to somehow extend the code inside the template, there are a few different things you can do. We spoke about macros. Macros are great if you want to effectively pre-process a template that is about to be deployed. So you can get the entire template as an input. You can produce an entirely new template as an output.

And generally speaking, you might be adding a few things. Maybe, I don't know, you can automatically tag all of the resources based on some internal rules, or you can do more advanced things like the ones we did with SlickWatch for simplifying the efforts of making applications very easily observable and to have alarms. And the other two things that we mentioned are CloudFormation custom resources and the CloudFormation registry. Those are great whenever you want to actually create the concept of a new resource, either something that doesn't exist yet in AWS, because maybe it's something in preview. So you have maybe an SDK, but you don't necessarily have the corresponding resources in CloudFormation. You might be doing your own custom resources to back field that and still be able to do infrastructure as code. There are some limitations that we mentioned with custom resources and some gotchas, so be aware of those. But they are generally really good when you have something self-contained. If you have something that instead you plan to reuse across multiple applications or even make available externally, then you should be looking into the registry because that seems to be a much more complete solution and something that is easier to share even outside the boundaries of your own company. So all of that brings us to the end of this episode. I hope you find it informative. One last thing that I want to mention, I want to give credit to the Cloudonaut guys. There is a very good podcast that they did, I think a couple of years ago, but I think it's still very relevant. They cover some of the topics we discussed today, and they also mentioned some examples, some use cases that they had. So if you enjoyed this particular episode and you want to find out more, check out that episode. The link will be in the show notes. As always, if you found this useful, please share it with your friends, your colleagues, and leave us a comment. Let us know what you did like, if you have any question, and what else you would like us to cover next. So that's all. Thank you very much for being with us, and we will see you in the next one.