Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: Hello and welcome to episode 132 of AWS Bites. In today's episode, we're diving into GitHub Action Runners on AWS. GitHub Actions is a fantastic tool for automating CICD workflows and its runners handles all the heavy lifting. While GitHub offers hosted runners, using self-hosted runners on AWS gives you more control, lower latency and cost benefits. We will explore today how AWS services like EC2 and CodeBuild can be used to power your GitHub Actions runners and how to set up self-hosted runners to optimize your workflows. I'm Luciano and today I'm joined by Eoin for another episode of AWS Bites. AWS Bites is sponsored by fourTheorem, an advanced AWS partner that works collaboratively with you and sets you up for long-term success on AWS. Find out more at fourtheorem.com. So Eoin, let's maybe start by explaining what are GitHub Actions runners.
Eoin: Well, a lot of people will know GitHub Actions as the fairly powerful automation tool integrated into GitHub that allows you to automate tasks like testing, building, deploying applications directly from their repositories. It's become, I think, the leader. Everybody's rushing to it in some form or another, but it allows for custom workflows triggered by events like code commits, pushes, pull requests and releases, making CICD processes more streamlined and manageable. There's a lot you can do with GitHub Actions, but GitHub Action Runners are the underlying compute instances that execute the jobs that are defined in GitHub Actions workflows. Now, GitHub itself provides hosted runners and you will see them used on open source projects and anything on GitHub out there. But users can actually also set up self-hosted runners, giving them more control over the environment and the software and the hardware. So there are people out there using GitHub-hosted runners and using self-hosted runners. But why would you want to go to the trouble of running runners on AWS?
Luciano: Yeah, before I answer that, one thing I want to mention as well is that I really like that when you said events, that's really broad. You can capture all sorts of events inside GitHub. So if you use GitHub extensively, maybe you even create, I don't know, software releases or even track comments on PRs. All of these things are events and you can run workflows as a response to that event. And I've seen lots of very interesting use cases.
Like recently, we use Release Please, which is a tool that helps you to automate releases. And it kind of uses all of these different events when you create a PR, when you merge domain, when people comment on your PRs to kind of help you to automate most of the release workflow. So this is just to mention that one more reason, I think, why people are moving to GitHub Action if they use GitHub as a repo.
And yeah, integrates well with all parts of the software, like lifecycle management. So yeah, to go back to your question, why should you use runners on AWS? We just mentioned that running self-hosted runners means getting more control. And some specific reasons why we should do that on AWS is that if you're using AWS anyway, you can leverage IAM permissions model. So for instance, if you are maybe deploying stuff on AWS from your workflows, you can have all of these credentials set up in your runners.
And therefore, when your workflows are running, they already are out and together rather than using other mechanisms to propagate credentials to the GitHub-hosted runners. Another approach is that you... Sorry, another reason is that your runners are co-located within the infrastructure. So at that point, if you have everything running on AWS, most likely you're going to have lower latency. And yeah, you can effectively use the same regions that you use for your workloads.
And therefore, any action that you perform there, deploying resources, connecting to specific resources you might have available should be much faster than just using the self-hosted one by GitHub, which I believe run on Azure. So it might be a totally different network on a totally different region. And therefore, you might see increased latency. And another point, which is probably the main one for people is cost.
Because of course, when you are hosting your own runners, you can find lots of different ways to optimize for cost. And effectively, when you use the hosted ones by GitHub, you get a certain amount of free capacity. But then at some point, you need to start paying. And so when you host your own runners on AWS, of course, you can pick specific instances that can be cheaper. You can switch instances on and off depending on capacity. And therefore, you have more opportunities to optimize for cost. And one final reason is that sometimes you might have special requirements in terms of hardware or operative system. Maybe you need a GPU to perform certain tasks. So if you have control on the type of machines where you are running your workflows, of course, you can pick the more specialized hardware that you might need. So the next question is, if all of that sounds appealing, how do you do self-hosted runners on AWS? Well, in general, right?
Eoin: If you want to add self-hosted runners, you can do it at an organization level or for specific repositories. And if you have like a GitHub enterprise level setup, you can actually also add them at the enterprise level for all organizations. Now, to add a self-hosted runner in your GitHub setup, you need to run the runner application with which GitHub will provide. And that can run on Linux, Windows, and macOS.
And it can run on x86-64 or ARM architectures. Now, if you have existing machines, you can just follow the guided process through the repository organization runners option in your GitHub console. And there'll be a link to the documentation there in the show notes. Once you have that runner application running, GitHub will tell you how to set it up with credentials so that it could talk to GitHub.
It basically long poles GitHub for build events, all those different event types you talked about, and just starts builds as required, sending the status and log information back to GitHub, as well as like pulling and pushing cache data to GitHub. You don't actually have to keep all of those runners running all the time. So you have options there. You can have a small pool running or even no runners running all the time at all.
GitHub allows you to then scale up by using webhooks to let you know when a workflow job has been queued. In the GitHub docs, you'll see that referred to as ephemeral runners. Now, ephemeral runners have the advantage of just running one job generally at a time. So you have build isolation and increased security since a short-lived runner has less risk of intrusion and data leakage. So you can run the runner application on EC2, on premises, in containers, or even like on a fleet of Raspberry Pis under your bed. The limit is just your imagination. Now, a fairly common approach is to run it on Kubernetes. There is an Actions Runner Controller that helps with scaling and orchestrating runners as needed. That's quite common. If you are a Kubernetes shop and you have a team to run and support that infrastructure, this is a good option to follow. But another common approach is just simply running on EC2. So maybe we could talk, Luciano, what are some of the options there? You're thinking about running on EC2, you just have to run this application. But as we know, running a fleet of EC2 instances and scaling up and down, not necessarily trivial. What are our options?
Luciano: Yeah, absolutely. I agree that it's maybe something tempting to start to set up everything from scratch. You set up your own VPC, you set up EC2 instances with auto-scaling rules, but then over time, this is something you'll need to maintain. And there is a lot of moving parts. So it might become something that requires a significant amount of maintenance. And maybe at that point, maybe you have many advantages, but on the other side, you need to consider the total cost of ownership that is going to start to build up.
So be aware of all of that. But thankfully, there are some solutions that try to make all of this easier. One is called HyperEnv for GitHub Actions Runner, and it's produced by Andreas and Michael Wittig from the Cloudonaut podcast as well. We mentioned them a few times in some other episodes. So definitely worth checking it out. This is a product that you can find in the AWS marketplace, and it manages all the scaling in and out of instances for you, but lets you use your own AWS account and runs EC2 instances on demand.
And you pay per vCPU minute. There is a little fee, I mean, that you pay on top per virtual CPU minute usage. Another option that we discovered quite recently is called runs-on.com or runs-on. And it's quite similar to HyperEnv, but it basically lets you run everything yourself. You are using it commercially, but basically you buy it and then configure a license for a reasonable small fee. And effectively, this can be a nice option for non-commercial projects, especially if you are looking for something that allows you to drastically save cost.
One of these options may be something that can help you to basically get the best of both worlds. You are running and managing your own EC2 instances, but with a limited amount of management effort on your side. And at the same time, it should help you to save cost compared to the hosted option directly by GitHub. The other question is like, are you willing to pay another vendor? Of course, this is something you need to commercially decide if it's worth doing for you. But yeah, on the other side, you are buying more control and flexibility. And there is another option, of course, we mentioned already in the intro that is using CodeBuild. So what about CodeBuild? Is it a good option, a bad option? Is it easy or difficult to set up?
Eoin: Well, that's a good question. I mean, it's one that really intrigued me when they announced that you could do this last year. First, CodeBuild, for anyone who isn't really familiar with it, because I think a lot of people have given CodeBuild a bit of a pass, it's AWS's continuous integration service. And it lets you run, build, and test workloads on lots of different compute types. It's been around a good while and is fairly mature.
It doesn't have the workflow and orchestration support that you get with things like GitHub Actions, GitLab CI, or say CircleCI. But I think it's quite underappreciated, actually, CodeBuild, because it's quite simple in what it can do, but it's really powerful. It doesn't let you orchestrate workflows like those other tools do, as I mentioned, but AWS has a separate service for that, which isn't as good, I would say, which is CodePipeline.
Now, you can check out episode 44 on that when we did kind of a CodePipeline versus GitHub Actions episode. But CodeBuild itself is quite powerful, right? It supports Linux, Windows. You can do Android builds. You can do macOS builds. And you can run really large instances like with 72 vCPUs and 145 gigs of RAM. You can do ARM and x86 on it. And recently, it has added the ability to run on AWS Lambda under the hood, which can make scale up much faster if your build is suitable for Lambda.
And it even has GPU support. So it's useful for training and inference in your builds as well. So it doesn't have to be a choice of CodeBuild or GitHub Actions these days. Because CodeBuild now has support for GitHub Actions, that means you can actually use CodeBuild as your runners and get all the orchestration and all the integration and all of the action support from the whole GitHub Actions ecosystem.
But surprisingly, it's not very complex. In fact, it's a very small amount of work to set this up. And I think people should really try it if they're looking at one of these options and thinking, how can I improve cost, get better integration with AWS in my GitHub Actions? I'd definitely say, give this a try. And we have some help, lots of resources for you to check out. There are plenty of subtle things that can go wrong in the setup.
So we will share a code example that will help here, as well as some repositories that other people have created that will help you to get it right first time, not like us. So the first thing you need to do is to configure authorization for CodeBuild to connect to your GitHub account, like be it a repo, an organization, or your whole enterprise. Now there's three ways of doing that. You can use OAuth in the console, you can use a personal access token, or you can use CodeConnections.
I prefer the CodeConnections approach, keeping personal access tokens up to date and securing them and sharing them is something I'd rather avoid. So CodeConnections is a method that uses GitHub apps, and AWS basically automates the process of installing a GitHub app into your GitHub account that has the right permissions for the repos. And you can create this with infrastructure as code. So in our code example, that will link in the show notes, it'll create your CodeConnection.
This used to be known as CodeStar Connections. So you will see in the documentation both terms being used. And that basically creates this connection resource in a pending state. You then need to go into the console and activate it and link it into your GitHub account with an OAuth flow. And you just need to do this once. If you've already done it, you can just reuse an existing one. So once you have that CodeConnection set up, then you can actually create a code build project.
And the code build project creation is pretty simple. There are a few gotchas with the documentation. So again, look at our code examples. Hopefully it'll make it easier for you. You just need to set up your source. Normally with code build projects, you're linking it to a repo, but you can also make it a generic one that isn't linked to a specific repo, but instead is linked to your organization.
And then you just set your authorizer to the CodeConnection ARN you've just created. You need to enable webhook triggers so that it can create GitHub webhooks and link AWS and GitHub together. And then you'll need your IAM role as normal. And that'll need permissions to access CodeConnection, CloudWatch logs, and whatever else you might need for your scenario. So this is one of the advantages is that rather than having to do an OIDC integration or some vault shenanigans, you can just use IAM for your permissions in your code build job.
And you can still do all those other methods as well, of course. In your build project, you can also set the default compute type, like the operating system, container image, the CPU architecture, and whether you're using standard Linux or AWS Lambda. The nice thing here, however, is that you can actually override those values in your GitHub workflow YAML files as well. And you can switch the instance size and the operating system as needed.
And you can even do matrix builds with all the different CPU architectures and operating systems too. And then you can set a build concurrency within your project. It is worth mentioning that the default code build concurrency limits are quite low, like sometimes one. So that's a soft limit. You just need to go into the service quote as part of the AWS console and request one. They're just trying to save you from running up a big build.
Now, the code build project name you give is important as this is what links it to your GitHub workflow YAML. So in your workflow, you just set your runs on. If you've used self-hosted runners, you'll be used to setting runs on something like self-hosted. That's a property in your workflow YAML. And you just need to use a special syntax, which is like code build dash project name dash a couple of variables that come from your GitHub run.
And then you just use your GitHub actions as normal, right? You'll commit this workflow. It should start running directly on code build. Your actions should look like any other GitHub actions running and your pull requests and deployments are all running in GitHub actions. But the actual build is running on code build. And your GitHub action caches, as we mentioned, are still there. Might be worthwhile talking about why you'd use code build for GitHub actions.
I'd say like we talked about some of the reasons for running on AWS, but for code build specifically, I was generally pretty impressed by it. It's quite simple to set up and gives you a lot of flexibility. Setting up all the stuff on EC2, managing a third party vendor, another one might be a bit of friction for you and your organization to do that. So code build is definitely worth checking out. So if you have like a VPC in AWS and you want your build environment to access this network and keep everything within that AWS network boundary, code build is a good way to do that. If you want to use IAM roles for permissions and don't want to use the OIDC credentials flow or something else, that's a good reason to use code build. Or maybe you've already got GitHub runners and you're just struggling to cope with capacity. So you'd want some extra on-demand paper use capacity and you can just add some code build into the mix as well. Or you might actually have reserved code build capacity because you can use, you can create code build reserved capacity fleets that are cheaper and always available for you and you could utilize those in your actions as well. Should we describe our experience in general? We have our code repository. What do you think, Luciano? How would you assess it?
Luciano: Yeah, I think we spent some time experimenting with this and getting to understand the limitations. We, as you said, we have a link to our repo in the show notes. So definitely worth checking that out to see exactly what did we try. And this repo contains a CDK stack that sets up everything for you, which means it's going to create the CodeConnection, it's going to create two different code build projects.
One is using a small size code build runner and another one is using Lambda runners with four gigabytes of RAM. So you can see effectively that the difference between the two when you run this particular stack or maybe you can use it as a reference and just pick one or the other and copy paste that particular piece of configuration into your own specific project. To be fair, getting the configuration right was a little bit challenging.
The documentation isn't too bad, but it doesn't cover all the configuration options, particularly if you want to set it up for an organization and not just for a single repository. I think there are some blind spots there and we were left to figure it out exactly how to do it. Not the end of the world. I think eventually we figure it out, but just something that can be maybe improved on. So if someone from AWS is listening, please make that a little bit better for future users.
So the examples also don't provide a good CloudFormation, CDK or Terraform example. So it's, as always, a little bit more you go in the web UI and click stuff around, which I mean, it's a good way of starting, but in general, we all know that it's better to do things in infrastructure as code. So it would be nice to start to see more examples in that direction coming directly from the official documentation.
Of course, the community is always great. There are other people that have tried things like this already and they have shared their own solutions and you can find examples using Terraform or even CDK-TF. So it was easy to just look at those and try to figure out exactly what we were doing wrong and what we should change in our setup. We will have some links for other repos as well in the show notes.
In terms of performance, the performance that we saw was actually pretty good, but not as fast as you will get with runners that are always running directly on GitHub. That's especially the latency to start a new workflow. In general, we saw about 30 seconds of overhead just to start a new run of a specific workflow. And we tested with a very simple workflow that just did 60 seconds of sleep and the minimum end-to-end time that we observed was 90 seconds.
So this is how we noticed that there was some delay just outside our business logic for the workflow itself. Of course, in that delay, we need to assume that there is lots of stuff going on, for instance, provisioning the ephemeral runner, which of course is going to add that overhead, but I don't know, 30 seconds feels maybe a little bit too much. Hopefully, something that you can somehow fine-tune and reduce that latency because sometimes if you have very short-lived workflows, it might be nice to see them starting straight away and get a result straight away.
Very commonly, one thing that I do a lot, especially for open source projects is to have, I don't know, a workflow that just does linting and some other very small checks for code quality and those tests can run in like seconds. So it would be nice to just see a green tick for every commit in a matter of a few seconds. So hopefully something worth looking at and maybe something that can be improved. In our case, we tested 100 concurrent jobs on each runner. Most of them took around 90 seconds, but we had high degrees of concurrency, so 60 of them more or less were running simultaneously and some of them took over 3 minutes. Now, probably it's because they were left in a queue for a while, but yeah, I don't know if you have more details on that, or if we should just talk about pricing.
Eoin: Very difficult to know, I think, where all that time is going. If you look at, because you can look at the code build job execution timings as well, and in general it shows like three seconds to provision a runner. I think there's also some just communication delay between code build and GitHub and I don't know who's responsible or if it's just like there's a queue and the queue has to drain or what, some of it you don't see, but basically the times we're reporting that 90 seconds is everything in your GitHub actions workflow and code build will often be finished but it still seems to take some time for GitHub to say, okay, your job workflow is finished.
So your job might actually be done, it's just not reporting it as done. I don't have all the details. Now, we should talk about pricing. It's interesting to compare. Now, it's also quite difficult because you're not always comparing apples with apples but if you exclude the AWS free tier which has like a good code build junk and the free minutes you get with your GitHub subscription, we can just do a fairly simple pricing comparison.
If we take a fairly modest 2vCPU instance and try to pick something that's relatively similar across the different options, maybe looking at the cheapest which is just on-demand EC2. Now, you could also do spot EC2, of course. That's probably your best cheapest option is to use spot and I think the runs-on tool we mentioned allows you to do spot as far as I remember. If we take an EC2 medium and then compare it with a code build small, the price for code build is about seven times EC2.
Now, but if you compare that to GitHub Actions on a standard 2-core, GitHub Actions is about 11 times the cost of EC2. Now, if you go to reserved code build instances, it's only about 4x the price and lambda is not too bad actually. It's even slightly cheaper than code build standard. It's about 6.4 times the price for a 4 gigabyte lambda which is roughly similar. So you can choose different size lambdas.
Now, of course, you're not really comparing the same thing because with EC2 you're going to have more of a rectangular utilization graph which often has a lot of waste in it, right? So you might end up running your EC2 for five minutes in order to run a two-minute job. So there's always going to be a bit of extra waste when you use EC2 compared to using something that's very on-demand and per second like code build or lambda or GitHub Actions.
And of course, you have the extra overhead with EC2 just in terms of setup and maintenance. It doesn't really make sense to always just compare on a cost basis like this because when you have very bursty demands for your builds and sometimes you've got nothing running, most of the time you've got nothing running but then you've got a hive of activity when people need lots of concurrency and maybe you don't want to wait for lots of EC2s to spin up and react.
You want things that are really very suitable for the specific workflows you need, lots of different compute types, some on Lambda, some on Linux, some on Windows, whatever. So it might just make more sense to use something like code build or something else. Of course, I'm talking about pricing there. If you're using ARM, it's half the price on code build. With EC2, it depends on the instance type but if you can get away with ARM, definitely go for that.
You get 50% saving out of the box. Now, we talked about running on Lambda and it's quite interesting when you actually see your GitHub Actions running on Lambda. It's almost surprising to see it just succeed when it's like cloning your repo, doing NPM install. Even though these are just node processes, of course they should run on Lambda but in the testing we did, we didn't come across any major limitations.
Everything we did, in fact, we just took an existing open source project we had, the SLIC Watch project we've talked about before. It's got quite a few build steps in it, doing code coverage, tests, unit tests, running lots of different NPM builds, TypeScript transpilation and everything. It just worked. There are some limitations with Lambda. You can use custom images but you can't do things like root, use root.
So you can't use package manners like yum and RPM. You can't do Docker in Docker because you don't have privileged mode and it's not a proper Docker runtime. You don't have any file system access outside slash TMP. You don't have the GPU option. You don't have caching and you don't have a VPC. So if you want to use a VPC, you have to use a standard runtime in CodeBuild. But still, I would say, like you just mentioned, linting Luciano. That's a perfect example of a workflow that can run very easily in Lambda. And the chances of you having to wait for CodeBuild to provision a container is much less if you're just using Lambda. So it should be up and running in a few seconds and you should have your results a few seconds later. Actually, I was using the Biome JavaScript TypeScript linter in one of these tests and the whole thing just ran in a couple of seconds.
Luciano: Yeah, Biome is amazing, especially for large projects. It can do all the linting and formatting in literally seconds or less. Anyway, I think we have covered everything we wanted to cover for today. Hopefully, we gave you a good overview of what is GitHub Action Runners, why should you consider running your own hosted runners on AWS. And yeah, hopefully, we gave you an idea of the different options, pricing comparisons.
So let us know if you found all of that interesting or if you have other questions about things maybe we didn't cover or if you tried yourself, maybe you know other tools, maybe you have your other resources that you came across and use them to set things up. Please definitely share them with us because, of course, we are happy to compare other options and, of course, make all of these resources available to other people. Speaking of which, again, all the things we mentioned today, we will have links in the show notes. We will also have a few additional links about other resources that we used during our experiments. So hopefully, all of that together will give you a very good starting point for whichever option you are going to choose if you end up playing with the idea of running your own GitHub Runners on EC2 or Lambda or maybe using CodeBuild. So thank you very much and we'll see you in the next episode.