AWS Bites Podcast

88. What is VPC Lattice?

Published 2023-07-07 - Listen on your favourite podcast player

In this episode of AWS Bites, we discuss VPC Lattice, a new service in the Salad Bowl of AWS Networking. We cover all the concepts, applications, and exciting possibilities for VPC Lattice and share tips on how to use it effectively.

We talk about reducing friction between network admin and dev teams and how VPC Lattice can be a game changer for traditional and serverless workloads.

Get ready for some greens and don't miss this informative episode of AWS Bites!

AWS Bites is sponsored by fourTheorem, an AWS Consulting Partner offering training, cloud migration, and modern application architecture.

In this episode, we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: VPC Lettuce is a new service in the salad bowl of AWS networking that makes it easy for developers and admins teams to set up networking between workloads. We have been taste testing Lettuce and are ready to leaf you with all the knowledge we learned. So, Romaine calm and hear about how it can rocket your networking setup to new levels. We'll share plenty of tips so you don't hit any icebergs on your journey.

Luciano: Yeah, okay. Oh, and sorry, I have to stop you there. These puns are excellent, but it's called VPC Lattice, not Lettuce. I don't know if you are aware. Lattice? Really?

Eoin: Okay. But I spent ages working on that intro. I'm not going to redo it.

Luciano: I like it. Let's keep it.

Eoin: Okay. Lettuce, Lattice, whatever. Right. Well, whatever. We're going to talk about all the concepts, applications for VPC Lattice, apparently, and how this is a game changer for traditional and serverless workloads. I'm Eoin. I'm here with Luciano and this is the AWS Bites podcast. AWS Bites is made possible by the support of fourTheorem, an AWS partner that works with you to make migration, architecture, and development a success. See fourtheorem.com to find out more.

Luciano: So, let's start with this question. What is VPC Lattice?

Eoin: Well, now that we know what it's apparently really called, Lattice is a service that's really designed to make inbound and outbound east-west connectivity between services and applications possible with a zero trust approach to authorization. So, when we talk about east-west connectivity, we're talking about horizontal connectivity between services and applications within a workload or a set of workloads and not necessarily like public-facing APIs or say API gateways down to services in the backend, which would be north-south communication.

And it's designed to work in single or multiple AWS accounts, and it's really focused around minimizing the amount of network configuration you do. So, we talked recently about whether VPCs were necessary for serverless developers. If you're somebody who's allergic to networking as a developer and would rather get away from VPC routing and security groups, then you have to kind of liaise with development teams and always have this back and forth to get network setup right. This is another area where Lattice would really help. And there's some really nice things with it. Like, you'll never have any issues with overlapping IP ranges and CIDR blocks. So, that's something. And you get this nice kind of high level or fast level of security. So, you get this nice kind of high level of security. So, you get this nice kind of security that you can get to the point where you're kind of stuck in the middle of the network. And you get this nice kind of high level or fine-grained access control support as well. If you're in the field of microservice communication with service meshes and sidecars and all of that stuff, VPC Lattice also aims to take all of that away as well so you can focus on the actual workload and simplify that communication. It also supports things like traffic control, load balancing, and path-based routing as well.

Luciano: That's pretty cool. I can definitely resonate with some of the problems you described there. And it's exciting to see that this service is effectively trying to solve most of these issues and giving us a new tool that we can just use to be more efficient, more effective. But I'm wondering, this service needs to integrate with other services, of course. So, what does that look like? Is it going to be available only for one service or it's already quite widely available and we can use it for all sorts of integrations?

Eoin: It's widely available and what you do with it is really up to you. The use cases are kind of the communication between microservices, like we said. But also, if you just want to have a mechanism to support private APIs within an organization, that's possible too. The nice thing there actually is that you could do custom domains in a much easier way than API Gateway. If you've got things like migration in your workloads and that's part of your plans and you want to modernize over time and maybe switch, have an API that's backed by an EC2 or a container, then switch it to containers or Lambda, it supports that as well without having to go through lots of network reconfiguration. So, it integrates with, if you've got existing VPCs, it will integrate with those. It integrates with EC2 instances. It'll integrate with anything with an IP address, actually. And it also works with ECS and Kubernetes as well, including EKS. The special feature here, and I think one we're particularly interested in, is that it also works well with Lambda. So, it can trigger Lambda functions, even Lambda functions that aren't running in a VPC.

Luciano: The way you're describing this makes me think about other services that have been in AWS for a while, like, I don't know, VPC peering connection, Transit Gateway, PrivateLink, and even just doing your own routing tables or other stuff like that. So, how does it compare? Like, why should this be better than these other solutions?

Eoin: All of those things you mentioned are pretty much the traditional way of doing this East-West communication. First, we had VPC peering, and then came Transit Gateway, which enabled the same sort of routing across different VPCs and different accounts in a much more scalable way. The container world, you've got lots of approaches around service discovery and service meshes. The whole idea there is that you end up with quite a lot of configuration, and you've got this kind of split responsibility between the admin teams and the development teams. So, I suppose if there's one message you take away from this session about Lattice, it's really that it's trying to address that friction between admin and dev teams, and allow the admin teams to focus on centralized access control monitoring and the devs just to launch the service to create private VPCs that they control and be able to provide and consume services then.

Luciano: Nice. Okay. So, what are some of the main concepts when you start using it? Like, what does it look like? What do you need to... Which terms do you need to start structuring the usage of this product?

Eoin: I think the concepts are pretty simple, really, and there's two main ones. The fundamental building block is a service, a VPC Lattice service. So, this is something that's going to be backed by IP addresses, EC2, Lambda, containers. And this is the thing that's usually owned by the dev team. So, I mean, the ownership can be changed from organization to organization, but I think the basic idea here is that the dev team owns the service, and then they govern its domain, all of the APIs within it, what services are backing it, all of that is controlled by that team. So, it makes it very agile from the team's perspective. And the service supports custom domains. Then the service is kind of grouped within the next concept, which is the service network. And the Service Network is usually the thing that's owned and controlled by the network admins. And this is essentially a logical control plane that groups VPCs and the services from different teams, and you can put IAM policies on it as well. And this is getting into the zero trust approach, which we can dig into. So, the dev team can put an IAM policy on their service if they want to, and the admin team can put a network policy on the Service Network. So, it allows you to have control at two levels. And it's also possible with resource access manager, or RAM, to share both the service and the Service Network with other accounts. So, you can have the approach where the admin team creates a Service Network, shares it with dev teams, and then they create their services within it, or vice versa. Now, within a service, then it starts to look a little bit like load balancer concepts. So, within a service, then you have a listener, which supports HTTPS and gRPC. So, there's no kind of TCP or UDP network possible. It's really just like an application load balancer. And again, like an application load balancer, it has target groups, and then in the target groups, you can have your IP addresses, instances, Lambda functions, containers, load balancers as targets. And then you have your rules, just like load balancer rules, where you've got prioritized path-based routing and that sort of thing. All of those things we mentioned, like IP, EC2, ECS, they need a VPC anyway. But Lambda, you don't always configure with a VPC. And actually, to provide a service that's backed by Lambda, you don't need to associate it with a VPC at all, but it can still be triggered by Lattice. So, you only need a VPC when you're actually consuming a service through VPC Lattice.

Luciano: That's pretty cool. So, I imagine that behind the scene, AWS is taking care of routing all this traffic correctly for you, pretty much. So, tell me a bit more about how it works. You described some potential models to start to use this in a company, but maybe we can clarify more what are some of the potential patterns.

Eoin: So, if you're starting from scratch, you can imagine the admin team might first create the Service Network and specify an authorization policy on that. You can actually specify security groups as well if you want to do network-based control as well as the zero trust IAM approach. Once you have that Service Network, they would share it with RAM, and they could share it with individual accounts or with the whole organization. And then dev teams will see it in their AWS console and can reference it in their SDK or infrastructure's code templates. So, the dev team could then create a service and then make the association between the service and the Service Network. And that's what gets it to join this networking conversation. They can also specify their policy if they want. And then if you're consuming a service, basically you have a VPC because you have to have a VPC to consume a Lattice service, and you just associate that VPC with the Service Network as well. So, once you've got the RAM share through the resource access manager with your account, you can make that VPC association. And then any of the consumers running in the VPC can invoke all of the services that are associated with the Service Network, of course, provided that the authorization policy allows. And that can be really fine-grained or coarse-grained, whatever you need. The service consumer will then use a DNS to discover invoked service. So, they're using HTTPS with a domain name, and you've got two options there. You can use the Lattice-generated domain names that it creates for you, and they're always there, and they're global. Or you can actually specify a custom DNS, and those names will resolve too within your private DNS. And you can use public DNS or private DNS for that, and then invoke the service. So, I guess the two points there to remember are that to consume a service, you need to be in a VPC associated with the Service Network. And to provide a service, you don't necessarily need to be in a VPC, but your service does need to be associated with the Service Network.

Luciano: That makes sense. So, it seems that you have a lot of freedom in terms of defining the access control rules. And I don't know, is there anything there that we want to deep dive on and just provide a bit more background?

Eoin: Yeah, we can give a couple of examples. So, if we're saying that Lattice is really for kind of private internal APIs within one or a set of AWS accounts, you're imagining that the boundary is usually within an AWS organization. Now, consumers will have to be in a VPC that is associated with the network. So, it doesn't have to be within your organization, but you would have to share your Service Network with another AWS account. It could be a third party one, and then they can communicate. You could also use the policies to restrict who can access it. You can choose no authorization, and that's a valid approach. And then you could just say, okay, well, let's not share this service network with anybody outside the organization. Okay. But you might want to say, okay, let's be a little bit more careful about that and turn on IAM authorization on the Service Network and the services and perform some stricter checks there. So, those policies are optional, but they can be applied at both the Service Network and the service level. So, a couple of examples then, a Service Network policy might say, only allow principles from the AWS organization. So, you could say allow star resources, but the condition is that the principal org ID is my AWS organization.

And that would mean that if somebody accidentally shared a Service Network or a service with a third-party account, that they wouldn't be able to invoke your service because they don't have the organization in their principal. And then in the service policy itself, you could say, only allow principles with a principal tag on their identity or restrict them to HTTP get requests or even restrict them by IP address. So, you can get very fine-grained and specific.

Now, I think at this stage, it should sound like it's fairly simple to set up because we don't have any routing tables. We don't have any VPC peering, no transit gateway. So, it's all fairly straightforward. The main thing, I guess, from an IAM authorization point of view, and it might be obvious, but when you try it, you'll need to enable AWS version four signatures on the requests because otherwise you won't be able to pass an IAM authorization. So, the first thing you do when you invoke a URL against one of these services, if you've got auth turned on, is going to do an IAM check just like it would with any other AWS service. So, you need to have a signature with the service, which is VPC Lattice services in the scope of your credentials. It doesn't support payload signing. So, you have to explicitly call out the header that says there's no payload signing. If this sounds like a little bit complicated, don't worry because we do have a full code example with a complete Lattice setup and also some client code for invoking services as well. So, it should be easy at that point to see how it works.

Luciano: That's interesting. So, I imagine that one of the trade-offs is that from a development perspective, every time you are doing the call, you need to add that extra bit of code, making sure that your requests are probably using sync before and adding the signature correctly, which I don't know. I've done that in different languages and it's always a bit of a hit and miss in some languages. It is easier than others because maybe you have some libraries that can make most of the stuff easier. In other cases, you end up implementing some of it and it's very easy to get something wrong and then you spend hours and hours troubleshooting it. So, maybe that's an interesting trade-off to keep in mind. You mentioned that this is something you can use basically in a very free way. You can organize your teams in different accounts. So, how would that look like? Is that something we recommend to do? Is it more complicated or it's just seamless with Lattice?

Eoin: This is really, I guess, where it really shines actually in cross-account because of the lack of routing and everything. Once you have your Service Network set up, the process of sharing it with RAM is quite easy. You can do it in a single line of SDK or a single resource with CloudFormation or with the console even and the Service Network then just automatically appears in all of the accounts that you've shared it with and then they can quickly create the association with their service. So, the order is the admin would create and share that Service Network with RAM with specific accounts or users or roles or the whole organization and then the dev team just sees it associated with the service and they would associate the VPCs they consume with the service network and that's it. All of the things can talk to each other at that point. So, it's really, really seamless and I can imagine as well, once you've got this set up for the first time, you can start scaling to like hundreds of services really quickly and each of them has a DNS name and the process of communication is really easy. You can have conventions around your policies and what needs to be in there and it just happens a lot quicker than the typical setup when you've got like all these teams trying to coordinate, make sure you don't have the right, you don't have overlapping IP ranges and all that kind of stuff.

Luciano: How does the routing work other than, I mean, you described the DNS mechanism, but do you need to explicitly configure anything or it just happens out of the box?

Eoin: This is the other piece where Lattice is completely different to everything else because it's got a very special mechanism for routing. So, like we say, it doesn't have any routing tables as such. So, when you associate with your VPC with a Service Network, those VPCs will automatically then resolve DNS names for Lattice services to a link local IP address. So, you might have come across link local IP addresses in other places. If you've ever used the EC2 metadata service, it starts with a 169.254 IP and that means those, these are special IP ranges in the IP spec that are not routable. So, they're only valid on a local host.

But Lattice is essentially using these as kind of a door into the hyperplane infrastructure where they do all of the network virtualization. You've got this VPC control plane that we've already described where you create these logical constructs like services and Service Networks, but in your VPC, there's now a special VPC Lattice data plane and these link local addresses are the door into that data plane. So, when you do a DNS lookup on a Lattice DNS name, you'll end up with one of these link local IP addresses. These DNS names are global, so anyone in the world, if they know the DNS name, they can look up the address, but it's completely meaningless outside of your VPC. The other beautiful thing about this is that Lattice doesn't consume any of your IP ranges. You can imagine if you had a sidecar or a proxy, it needs to have an IP address and needs to do some proxying and routing. That doesn't happen here. It just automatically goes into this Lattice data plane. Lattice works within and across accounts, but it's always within a single region. So, there's no multi-region possibility or cross region propagation. Cross region routing is something that's really kind of hardcore networking anyway. So, it's not really in the domain of inter-application East-West communication.

Luciano: So, how does that work instead with Lambda? Because you mentioned before that it is possible to effectively send the traffic to Lambda, but we also know that Lambda is totally event-based, doesn't have a concept of listening in a port, for example. So, how did they make that integration happen?

Eoin: VPC Lattice actually is a new trigger type for Lambda. So, I think it's the fourth HTTP-based synchronous triggering mechanism for Lambda. So, you've already got Application Load Balancers, API Gateway, and you've got Function URLs. Actually, fifth, because I forgot about AppSync as well. So, now you've got this fifth one. So, it's a new event trigger for Lambda, and if you invoke a service, the VPC Lattice data plane is going to do that synchronous triggering of the function for you. Now, the payload is similar to putting a server on a VPC Lattice, similar to, but actually different from API Gateway or Application Load Balancer. So, it looks similar, but you'll have to parse it slightly differently. Just because it's VPC Lattice doesn't mean that the Lambda functions have to be running in a VPC with subnets configured. You only need Lambdas to have a VPC if they're going to consume other services from Lattice.

Luciano: Okay, what about in terms of observability? Because if you use regular VPCs, you're probably going to have to use logs, but what are the options here?

Eoin: One of the things that network admins are going to be really happy about with VPC Lattice is the fact that you can create a log group in CloudWatch on the Service Network level, and you get logs for all of the traffic through Lattice that you get everything. You get your HTTP request, and then you can see who is calling, what IP address are they coming from, what's their principal identity ARN, what Service Network are they coming from, what service are they coming from, what service is their target group. All of that information is in the logs. So it's really nice that you've got one central log with all this east west traffic in it. This is like a one line configuration. I'm sure a lot of network engineers have spent months configuring great observability for this kind of communication in the past. So I think this is one area which will really sell it to a lot of people. You also then get CloudWatch metrics, but those aren't on the Service Network services, and you get them per service and per target group.

Luciano: So I am almost sold, but before opening my wallet, I want to know what's the cost. So let's talk about pricing.

Eoin: Before I do talk about pricing, actually, one of the things is that somebody asked us earlier on, does Lattice support VPC Flow Logs then? Because if you've got this CloudWatch log, is it possible to do Flow Logs? And after they asked, we went and tried it, and I was surprised to see that they actually do, because I thought you'd only see a flow log between two VPC IPs and not these link local IPs, but you can turn on Flow Logs for any VPC that's connected to a Service Network, and you'll still get all of the flows with the link local addresses in there. Nice. So I've held off on talking about pricing for long enough. Let's go for it. So the pricing dimensions are three dimensions. You got per service, and it varies a bit per region, but let's look at US East One.

At the moment, it's two and a half cents per hour per service. So that sounds fine if you've got a few services. If you've got hundreds of services, which you can do with Lattice, you can see that we're shutting up pretty quickly. Then you pay 2.5 cents per gigabyte as well. And there's another dimension, which is per request, and you pay 10 cents for a million requests. I think most people would be focused on the first two, really. And if you compare it to Transit Gateway, Transit Gateway, you pay 10 cents per attachment per hour, and then two cents per gigabyte. So it's more expensive in one dimension, less expensive in the other. But comparing it to VPC Peering, I mean, VPC Peering is completely free. You could also compare it to just your existing service mesh set up in your applications, which you might've spent a lot of engineering effort on. And for that reason, now I see AWS talking about using Lattice for microservice communication, but if you're one of the companies that's got hundreds, thousands of microservices, I can imagine that Lattice could be a bit expensive. Despite that pitch, I think it might be a bit more palatable for service to service or application to application communication than just microservices. I also think looking at how much it simplifies networking, and if you compare that to the engineering cost and the opportunity cost of all the engineering effort and the interaction between teams, it might actually be a very valuable trade-off to look at Lattice if you can really go all in on it, especially, and get rid of a lot of that engineering effort.

Luciano: Okay. That's really cool so far. You mentioned we have a demo application that is available in our repository that we will share in the show notes. Do you want to describe a little bit what's the idea for that particular demo?

Eoin: This is what we use to explore and learn about Lattice and it's a multi-count setup. So it has everything we talked about. There's a networking setup for a networking account where you create the Service Network, share it with RAM. You've got your centralized logs, and then we've got a kind of assumption that you've got a existing registered domain, public one, and a public hosted zone in that networking account.

And then we've got two other accounts, account A and account B, we call them, and that's where the two different services run. And we've got Lattice services there. One of them is quite simple. It's just got a Lambda function at the back. And the other one has kind of used the weighted traffic routing so that half of it goes to another Lambda function and half of it goes to a Fargate service. And interestingly with Fargate, the way you integrate Lattice with it at the moment is by using a load balancer. So you don't route to individual tasks or containers.

You use a load balancer in front of it, which is, you know, it simplifies it in some ways, but the negatives with that are that you still pay for the load balancer traffic as well as Lattice. It just seems like an extra resource you don't need given the fact that Lattice supports all this kind of stuff anyway. So maybe in the future, we'll see an improvement there, some additional ability, because with Kubernetes, there is a gateway controller that AWS have provided that automatically creates Service Networks and services and creates IP addresses in the target groups in Lattice for you. So it's a very different approach. Anyway, back to the demo application.

This other service is going to route traffic between Lambda and Fargate. And the Lambda function is actually going to go through Lattice and invoke the other service in the other account as well. So it's kind of got the whole thing set up. And then we've got in that CDK application for the demo, we've also got like an EC2 instance and a VPC in the networking account that you can manually hook up to the Service Network. And then there's a client with the signature set up all in there that you can use to invoke one of the services. And you can see the traffic that comes back. You can see if you keep invoking it, that it'll hit sometimes the ECS container, then it'll hit the Lambda target. And the Lambda target will then talk to the other service and you can see the traffic coming back and you can explore the logs and you can look at the flow logs and all this kind of stuff. So check it out. We'll give the link to that in the show notes. Anybody who's got any interesting questions on that repo or improvements as well, we'd love to hear from you.

Luciano: That's awesome. So I guess that's a great introduction to Lattice. I'm going to try to quickly summarize the use cases that you described. So communication between microservices, private APIs, east-west communication, app modernization, because you can easily switch over from the old endpoint to the new endpoint in a transparent way. And in general, it seems that the main selling point is that it kind of reduces all that communication friction and ownership friction between admin teams and development teams. So it kind of defines very clear integration points between the two teams and then there is a lot more freedom to operate. So that can be a very compelling reason to use this service. I think we also have a few resources that we collected while doing all of this research. There is a Serverless Office Hours session, which was really interesting and we're going to have the link in the show notes. And then there are other few articles that we are going to link, including the official AWS documentation. So check out the show notes for more links. And with that, I think that's everything for this episode. We are really excited, as usual, to know your perspective. If you've used it, what do you think? Do you actually have a real use case in production? Tell us about that and that way we can compare opinions and learn something more together. So thank you very much and we'll see you in the next episode.