AWS Bites Podcast

126. Bastion Containers

Published 2024-06-28 - Listen on your favourite podcast player

This episode discusses solutions for securely accessing private VPC resources for debugging and troubleshooting. We cover traditional approaches like bastion hosts and VPNs and newer solutions using containers and AWS services like Fargate, ECS, and SSM. We explain how to set up a Fargate task with a container image with the necessary tools, enable ECS integration with SSM, and use SSM to start remote shells and port forwarding tunnels into the container. This provides on-demand access without exposing resources on the public internet. We share a Python script to simplify the process. We suggest ideas for improvements like auto-scaling the container down when idle. Overall, this lightweight containerized approach can provide easy access for debugging compared to managing EC2 instances.

AWS Bites is brought to you by fourTheorem an AWS consulting partner with tons of experience with AWS. If you need someone to help you with your ambitions AWS projects, check out fourtheorem.com!

In this episode, we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: Hello and welcome to AWS Bites, episode 126. I'm Eoin and I'm here with Luciano. Today we're going to revisit the topic of secure access to private VPC resources. A while back we talked about SSH, SSM and bastion hosts in episode 78. Recently though we've been using a bit of a simpler approach that avoids EC2 instances altogether and uses containers instead. By the end of this episode you'll hopefully have all the knowledge you need to use Fargate and SSM as a lightweight on-demand access mechanism for private resources in a VPC. AWS Bites is brought to you by fourTheorem. If you're looking for a partner to architect, develop and modernize on AWS give Fourth Theorem a call and you can check out fourtheorem.com. Luciano, when we're looking at VPC access, private VPC access, what problem are you actually trying to address here?

Luciano: Yes, so the problem is commonly when you have some resources that exist in a private VPC so they are not publicly accessible on the internet and at some point you maybe you have a bug, maybe you need to do some manual intervention on a resource, you effectively need to access that resource whether that's a database or a server or a container but running on a private perimeter. Another example that we mentioned a lot is for instance if you have a Redis cluster sometimes you just want to run some Redis commands just to see like maybe the state was corrupted, maybe you want to validate why a certain bug happened that maybe involves Redis. So yeah, the problem is how do you connect to these kind of resources to do all these kinds of investigations or actions that you might need to perform for a resource that exists in a private VPC subnet. And generally speaking there are some traditional solutions.

The most common one and the one that has been existing for the longest probably is setting up a jump box or bastion host which is basically an EC2 instance that has a public IP. Maybe you are protecting that instance with some security group that limits the IP ranges that can access it and then of course you have SSH keys but it's still an instance with a public IP so it's publicly reachable on the internet and that instance also has access to the private VPC so effectively what you do you establish a connection to this machine using the public internet and then use that machine as a tunnel to reach out into the private VPC and connect to the specific resource that you want to access. Other approaches are using a VPN and another thing that we've been discussing in a previous episode is still relying on the concept of a jump box but it makes it a little bit more automated and on demand and it's a CLI tool called Basti that effectively makes it easy to create that instance, create the connection for you, create the tunnel for you only when you need it so it's going to create it and destroy it depending on when you need that session and also when you want to use it to access an RDS database it does a bunch of stuff to make that access to RDS even easier for you. So you can see that as an improvement on the classic idea of the jump box or bastion host. So all of these approaches are something we discussed in detail in episode 78. So go and check out that one if you missed it we will have the link in the show notes but these are approaches that work but hopefully today we're going to be able to present a better solution but maybe let's start by discussing what are the challenges with this particular type of approaches.

Eoin: Well we've definitely seen challenges where some companies actually have restrictions around creating EC2 instances and it just might mean that there's more governance and procedure around creating instances and operating system security and running agents on them so it just might mean that there's a lead time in actually getting EC2 instances up and running and it's not just possible to fire one up because of all the security stuff and sometimes it's just really hard they just don't let you do it just because there's more attack servers area so there's a little bit more of a process around it. As well as that EC2 instances do require quite a bit of management and maintenance to keep them up and running and up to date. Sure you can get one up and running reasonably quickly in general but then you have to think about what happens when you need to upgrade to the newest version of Ubuntu and various other pieces of the stack become deprecated or suffer from some sort of bit rot and for that reason a lot of people we've spoken to about this kind of challenge just prefer if they could just run a container. It's generally easier to do there's fewer guardrails in place so we've been thinking about this and a while back started actually figuring out ways of using containers as a bastion instead of having to worry about virtual machines at all.

Luciano: Yeah I'll try to describe how that works so let's say that you have your private VPC all set up you have an existing or a new subnet and you want to set up a ECS cluster of course assuming that you don't have one already you can just set up a Fargate service with a task definition and this is going to be your Bastion container. You basically need those two things you need to set up the cluster and then the service and you have the that Bastion container running as a container but of course you need to make sure that certain conditions are met so you need to do a little bit of extra configuration. The first thing is that you need to enable ECS for your Fargate service this basically will connect your container to SSM allowing authorized principles to effectively use SSM as a tunnel to reach into the container and this is kind of the key here because the idea is that you don't need to expose anything on the public network through the AWS control plane and SSM you'll be able to establish a connection directly into your container so your container is not directly exposed to the open internet.

The cool thing about SSM is that it can allow you to run commands for instance to create a shell environment within the container and again this is without exposing anything on the public internet so SSM will basically route those commands through AWS into the container. We have a little bit of a better explanation in episode 78 so again please check out that one if anything we are saying doesn't make too much sense hopefully episode 78 will provide more details that can explain better what we are about to say today. So you'll need to decide which container image to use at this point.

We have the opportunity to run a container but of course what kind of image do we need because depending on the image we pick we will have different tools that we can use. So what do we want to do with this container? Do we want to access a database? Maybe we need a specific database client. We want to access Redis. Maybe we need the Redis CLI installed. We want to just run AWS commands. We need the AWS CLI there. Maybe we need to do some network troubleshooting. We need specific network utilities installed in our Dockerfile. So that's also an important step.

Make sure that you understand exactly what kind of use cases do you want to cover and prepare your Dockerfile accordingly so that you have all the tools that you need already available as soon as you create a session. Of course you don't want this service to be always up and running because you are going to be paying for a running container that maybe you only use occasionally when you want to debug something and the cool thing with Fargate is that you can create the service but where the desired count is zero. So that basically means that you have an instance of the server already pre-configured but effectively no container is running so you are not using any compute in practice. And when you need it you can just bump that count to one. At that point Fargate is going to spin up one instance of that container and then you can start to create whatever. You can run the commands through SSM to log in or create a shell and I think this is where Eoin you're going to explain how do you actually use this container.

Eoin: If you have an EC2 instance and you want to connect to it with SSM it's reasonably straightforward because you can just click connect in the EC2 console and get a shell open in your browser or you can use AWS CLI with the SSM extension installed. Then you just run `aws ssm start-session` and you pass in the instance ID for the EC2 instance. With ECS or Fargate it's a little bit different. When you want to create a remote shell in a container you can run `aws ecs execute-command`, pass in your cluster task container and it will start up an interactive session for you on your container.

So that's very handy if you've just got containers you want to debug even if you're not using it as a bastion it's just handy if you're trying to debug something running in the container or figure out some problem. Now once you have this remote shell you have access to the container that has access to your private VPC resources without having to make that publicly exposed on the network. If you need to connect to your database you can just run your database client in this shell but what if you don't want to run a shell but instead you want to connect like a graphical database client from your own computer instead. In that case you're going to need a tunnel that presents a local socket securely forwarding all the traffic to and from the database server on the private network. And with EC2 it's again a similar method but with ECS AWS actually provides something called an SSM document.

Now SSM documents are useful for lots of different types of automation on remote servers but there's a specific one called start port forwarding session to remote host that AWS provides as an SSM document that anyone can use. So if you run this document with SSM it's actually going to set up the tunnel for you and all of a sudden you've got the magic happening that allows you to securely tunnel through to your local machine. The command syntax itself it's a bit of a mouthful but, because anyone might be interested, it goes a bit like this so you're doing `start-session` and then you pass in a target which is composed of your ECS cluster, task ID and container. But you also pass in a document name which is your "StartPortForwardingSessionToRemoteHost" document and then you give that document some parameters as a stringified JSON and those parameters have your remote host, the remote port number and the local port number. And that's going to set up one or more port mappings for you between the remote environment, the remote network and your local network. And then you can simply use your database client to connect to Postgres remotely and securely over this local port. And you can set up loads of different port mappings at the same time so you might have one tunnel for your database, one for a private API gateway endpoint, one for your ElastiCache. Once you've done the solution once it's quite easy to repeat and set up as part of your troubleshooting environment. Now there's probably a few steps there, is there any way we can kind of codify this all together, make it easier for people?

Luciano: Yeah probably there is room here for a new open source tool, something similar to what Basti does that tries to remove all the complexity of running multiple commands, passing the right parameters. We can probably apply that same idea to this new approach using containers. But for now we haven't done all of that, we have a simpler version of this which is effectively a Python script that we will make available in a gist, the link will be in the show notes and this should simplify this process making it easier for you to just run the right commands with just minimal configuration. It's not a lot of code so you can probably read it in 10 minutes and really understand what's going on. You should find more or less everything we explained today, just follow the order, see the different commands and how the parameters are wired together. But yeah generally what we will do is when we set a CDK project we will have the script to be part of our deployment and available when we used to use this particular pattern. And you can automate the entire process by providing this script in a development environment and for troubleshooting. For example what you do is you start a container in Fargate, you create the tunnel, you generate IAM credentials for your database users assuming you want to access the database, then you launch something like psql if your database is a Postgres. If you have another database server you need to use the correct client to access that server and the specific command that is required for your client to connect to that particular machine. So this is all we can do with this particular script but is there any room for improvement?

Eoin: Yeah one thing I thought that would be nice to add this would be just some automation to shut down the container automatically when the tunnel hasn't been used for a period of time. We've discussed different ways of doing this. One way might be if you capture CloudTrail events for SSM start session to your Fargate service by using CloudWatch logs for your CloudTrail. And then you could do a CloudWatch logs metrics filter which is going to count up the number of start sessions over time. You could use those metrics then to auto scale the container down when no session has been started for a certain period. Now you don't necessarily know that it's not being used just because it was started a long time ago so maybe it's also possible to use SSM session logs because you can configure SSM to also log all of the commands for compliance that are executed over a remote session. So it might be useful to do that as well to gain more detailed activity information and determine whether sessions are actually idle. We'd love to get ideas from people but generally I think this approach is a simpler option. It allows you to get those connections up for private resources because it's containerized. It might just be easier to manage, keep up to date and switch on and off on demand. I'd love to know are people using something like this from all our listeners and if you have any suggestions for improvements or other tools we can use let us know in the comments. But until next time thanks a lot for listening and we'll catch you in the next episode.