AWS Bites Podcast

22. What do you need to know about SQS?

Published 2022-02-03 - Listen on your favourite podcast player

Luciano and Eoin take a deep dive into SQS as part of a series on AWS event services and event-driven architecture. We talk about the kind of problems SQS can solve, all of the SQS features and how to configure and use SQS to achieve reliability and scalability without all the complexity. We also take some time to detail how SQS works with Lambda in terms of scaling, batching and filtering.

In this episode we mentioned the following resources:

Let's talk!

Do you agree with our opinions? Do you have interesting AWS questions you'd like us to chat about? Leave a comment on YouTube or connect with us on Twitter: @eoins, @loige.

Help us to make this transcription better! If you find an error, please submit a PR with your corrections.

Eoin: Hello, today we are going to answer the question, how do you use SQS? And so by the end of this episode, you will know when to use an SQS queue and we'll give some example use cases. We'll talk about the main features of SQS, how to send and receive messages, how to customize the configuration of a queue, and we'll also talk a lot about the integration between SQS and Lambda. My name is Eoin and I'm joined by Luciano and this is the AWS Bites podcast. In the last episode, we talked about all the different AWS event services you have and today we're going to do a deep dive on SQS. So I think we had a classification of event systems the last time Luciano, we had point-to-point systems, we had PubSub and we had streaming. So SQS is a point-to-point system. What is it good for? Yeah, so in the last episode we gave a few high-level details about SQS, so let's deep dive.

Luciano: We mentioned that decoupling producers and consumers is generally a good use case for SQS. Also, it's a good service to add reliability because basically when you add SQS, you have an easy way to store messages persistently, so you can consume them later and this is good for instance in cases where your consumer might not be immediately available or it might be overloaded. You can have a queue in between and that allows you to be able to manage these kind of situations in a more reliable way. The other thing is that a characteristic of SQS is that each message is expected to be processed by one consumer and that will have implications that we'll discuss later with some of the examples. Also, multiple consumers allows you to scale highly. For instance, if you are producing many, many messages and for some reason your application is not working, you can for some reason your application is more and more successful, so the number of messages grows exponentially, you can keep allocating more and more consumers and you will just get all the messages being distributed and you can compute in a more parallelized way. So that let's say SQS is generally a good way to scale workloads when the number of items to process increases over time. And the other thing is that it can be used even for cross-region or cross-account communication because from one region you can publish messages in another region or even from one account you can publish messages in another account. So it can be used for communicating that way across regions and accounts. Is there any example use case that you want to mention to try to clarify these points we just mentioned?

Eoin: There are loads of use cases where you can really make great use of SQS but I suppose some of the simple ones we talked about the last time were you want to send an email so you could have a service that consumes email sending requests and SQS is ideal for that or if you have some sort of batch processing like a service to process picture resizing requests. That's a typical example and you can imagine the same thing being applied to a lot of enterprise batch processing workloads as well like if you're doing some sort of calculation or modeling an aggregation task these are all jobs that you can put on an SQS queue and then have one or many many workers that pull from that queue. It could be an AI modeling workload as well so you can just imagine having a pool of workers and that pool can scale auto scale according to the queue depth the number of messages in your queue. That's a pretty typical pattern. You've also got I suppose thinking like about enterprise architecture or event driven microservices the architecture and decoupling systems SQS is really useful in all of those situations so decoupling within systems at a finer grain or at a macro level across a big enterprise full of applications. So yeah that's that's particularly useful you could also use it as an SNS subscription so we mentioned that it's a point-to-point channel but you can also use it with PubSub and together you get point-to-point with reliability or sorry PubSub with reliability. You also have DLQs so DLQs there's lots of enterprise integration patterns and DLQ is one of the most well known and that's essentially a dead letter queue which so it's a queue where if you have messages that have failed to be processed repeatedly you can put them into a dead letter queue and then manually re-inspect them and schedule them for redelivery later. So SQS queues can be used as DLQs in their own right with any system but SQS also has a feature but SQS also has a feature which allows you to send failed messages to another queue which is a DLQ. So that's one of the one of the cool features of SQS. Should we go through and run through the highlight features of SQS quickly? Where would you start? Yeah I think so.

Luciano: So yeah the first thing that we also mentioned in the previous episode is that we have two different types of queues in SQS. One type is called regular queues I guess for lack of better naming and the other type is FIFO queues. So with regular queues you get different types of guarantees and the idea is that a queue will do like a best effort delivery in terms of ordering so you are not guaranteed to have messages strictly ordered when you consume them and the other thing is that you get at least once delivery which basically means that when you receive the messages it might happen that you get the same message more than once. When you need a little bit more strict guarantees you can use FIFO queues and FIFO queues will give you ordering guarantees so if you produce messages with a certain order those messages are consumed in the same order and also get exactly once delivery so FIFO queues have a mechanism to remove potentially duplicated messages coming in again into the queue and we'll probably give you a few more details about that later on during this episode.

Another interesting feature is DLQ. We already mentioned it. It's something that needs to be enabled and we'll discuss a little bit more how to enable that but it's very easy to have it built in and this is very convenient because it's very common that you define a number of messages types in your application then your code evolves eventually you might be introducing a bug in the code that is processing a job and what happens is that if your job if your worker is always crashing then you're not going to be able to process that message ever so rather than keep retrying it indefinitely it will eventually be moved to a DLQ somebody can investigate and then when you realize what's the problem you fix the bug you can easily reingest from the DLQ and actually process the message so this is actually a very critical feature that I think most applications using IQ should avail of. Another interesting detail is that the protocol of Sqs is not one of the common protocols generally seen in other queuing systems like RabbitMQ or ActiveMQ that uses protocol like AMQP or MQTT. In the case of Sqs the protocol is HTTP. I don't think it makes a huge difference at the end of the day because the way you interact is through the SDK so you don't get to really work at the protocol level but there might be different features and different characteristics in terms of performance because of the underlying protocol so that that might be interesting to know for some people that are coming from other queuing systems. Then we also have a server-side encryption so messages are encrypted in transit and I suppose also they are stored encrypted. We have message delays which basically allows us to configure in different ways how and when the message should appear in the queue. We'll mention a few examples later on. The other very interesting thing is that Sqs is durable and available pretty much by default, also scalable. That means for instance if we look in contrast Kinesis, with Kinesis you need to do a little bit of capacity planning. You need to understand how many shards you need to provision and that's generally based on what is the throughput that you want to achieve. With Sqs you don't have to worry about all the stuff. Queues will automatically scale for you and you don't need to pre-provision any of that stuff. In general I would say that the biggest feature of Sqs is that it is a very simple, true to its name, a very simple queuing system and therefore it's very easy to integrate in most applications and you get very good performance basically straight away without having to go crazy with configuration. Is there anything else worth mentioning? Maybe, I don't know, some interesting integrations with other services. What do you think? Yeah, for sure.

Eoin: On the integration with other services side we can think of I suppose about the producer side and the consumer side. So when you're producing messages a lot of services are there that target Sqs already. So if you've got an API gateway you can back that with a queue. So you can send messages from API gateway directly to a queue. We already mentioned SNS subscriptions so SNS and Sqs play very nicely together. EventBridge too, in the same way and you can also integrate it into step functions. So it's pretty well integrated on the production side. Of course we could talk about the consumption side but for both consumption and production you can also use it programmatically. So let's talk about how do you send a message and how do you receive a message. So on the sending side there's a send message API and there's also a send message batch API. So you can send a single message or you can send up to 10 in a batch.

So that's the limit in a batch. Then when you receive you call on the consumption side it's essentially a pull mode. So you have an SDK or an API that you use to call receive message and receive message and it also allows you to receive up to 10 messages at a time. So you can choose to receive one or up to 10. And that operates in two modes. You've got short polling mode and long polling mode. So short polling is basically saying give me a message if one is available but if no message is available just return. So that's essentially a zero seconds wait time.

But you can also do long polling where you can say wait up to 20 seconds for messages to appear and then return. So the difference there is it depends on the volume of messages you're expecting and the nature of your system. I suppose it's important to kind of bear in mind that if you're polling more frequently that's an extra request. SQS is priced essentially on the number of requests. There's also data transfer but it's fundamentally about the number of requests so you can bear that in mind. So once you've called receive message then you can do your message processing and then delete. So it's essentially like you're starting a job and then you're committing to the fact that you've processed the job by calling delete message at the end. So there's three steps essentially when you're a consumer. So the interesting thing there is what happens when you forget to delete and yeah that's really important. So can you describe that Luciano? What would you expect to happen if you forget to delete a message from a queue after you've processed it? Yeah so I'm gonna try to give a brief description of that.

Luciano: I think we will understand more how that really works when we deep dive into the actual configuration. But in general SQS never really deletes messages that have been delivered to a consumer because it's waiting for the consumer to acknowledge that the job was completed and that's done through an explicit call to the delete message API. And so if that doesn't happen because either you forget to do that in your code or maybe there is a bug and the worker crashes before is actually able to delete the message, the only thing that SQS can do is assume that something went wrong.

Maybe the message was not delivered, maybe the job was not processed correctly, so it's gonna make the job eventually reappear in the queue so that it can be processed again. So it's kind of taking a sane default to make sure that you have a chance to process that message again in case something went wrong. So yeah make sure to delete the message when you completed processing it, otherwise you'll end up reprocessing the same job over and over again. And of course your queue will grow indefinitely because you keep accumulating more and more messages that will always reappear in the queue until they are completely expired because of the direction of a message in the queue. So given that we are starting to talk more and more about the different configuration options, should we deep dive into that? Let's do that.

Eoin: You mentioned around the deletion, it's all down to visibility. You mentioned this thing about messages, they're not deleted, they just become invisible. So then my understanding of message visibility, maybe you can chime in here, but if you receive a message, it remains in the queue, so we said that prevents other consumers from seeing it. So it gives you, the first consumer, a chance to process it. There's a visibility timeout, so the clock is ticking and the consumer has to process it within this visibility timeout. And after that timeout has elapsed, if it hasn't been explicitly deleted, it's going to reappear. So this is a configuration setting then you can set a queue level, but you can also set it at an individual message level when you call receive message. So it can be zero seconds, it can be up to 12 hours, but the default is 30 seconds.

Luciano: Yeah, I guess one case where this can be important is if you have a job that you know is going to take you a long time to process, make sure to find the fine tune this visibility timeout, because if it's too low, while you are still processing that message, it will already reappear in the queue. So you end up with duplicated processing because of that. So that can be another issue that can happen when you have long running processing jobs. Yeah, that's a good one.

Eoin: And we're going to talk about Lambda later, but Lambda has its own timeout and you want to make sure that they align the SQS timeout and the Lambda timeout because it doesn't make sense for your Lambda to take longer than your visibility timeout because you'll end up with that situation. Another configuration option I think worth mentioning before we kind of move on is message groups because we talked about FIFO queues and the ordering guarantees that you get. So those ordering guarantees aren't per for the whole queue, you can actually partition it into ordered streams using message groups. It's a bit like the way ordering works with Kinesis shards except different, but the concept is the same, right?

Because it means that you can say define multiple groups and then you can still get parallel processing with ordering. So it's a nice way to balance that. But then the order guarantees and the delivery guarantees are per message group ID. So that's interesting to know. There's a whole set of other configuration options like you can set up message delay, you can set up the queue retention up to 14 days, you could put a specific resource policy in there for security, a redrive policy for DLQs, we mentioned that already. And on FIFO queues actually you can also do deduplication. So you can ensure that you got the exactly once delivery semantics by making sure SQS can recognize when you've got a duplicate. These FIFO queues also support high throughput mode, which is interesting because when you do those FIFO queues, because you've got ordering, by its very nature, you're limiting throughput because you have to process them in order. So there's a number of settings like setting the FIFO throughput limit and stuff that you can use to make sure you get the maximum throughput for FIFO queues. Throughput for standard queues, it's essentially unlimited. So those are the configuration options. Are there any kind of constraints, limitations, a lot of AWS services, you know, you have to understand all the quotas and limitations. What's the soft limit? What's the hard limit? There aren't a lot with SQS, are there? Yeah, there aren't.

Luciano: I think the main ones is that there is a limit on the message sides, which is 256 kilobytes. But a common pattern that I've seen is you can use, for instance, S3 and just put a reference to the file in S3 in the message when you need to use more data. Then it's interesting to see that there are no limits in terms of the number of messages that can be stored in a queue. So you can keep pushing more and more messages and you don't have a limit there. And there aren't API limits in terms of requests except in the FIFO queues where you have, I think, 300 API calls per second. Is that right? Yep. And 3,000 if you use batching because you can do 10 at a time, basically. And also there is another interesting limit, which is the number of messages that can be in-flight. We didn't... I don't think we explained what in-flight means, but basically it's those messages that are currently being processed by workers on the other side.

Eoin: Received but not deleted. Exactly.

Luciano: Yeah. And the visibility time out hasn't expired yet, so they haven't reappeared in the queue. Yep. And there is a limit of 120,000 for regular queues and 20,000 for FIFO queues. Now, this is something I wanted to mention because one of the tools that we have mentioned also in previous episodes called SlickWatch, which allows you to easily get observability in serverless projects if you're using the serverless framework. In that serverless plugin that we built, we already give you a pre-configured alarm and dashboard that you can have in your application in your application to monitor if you are actually reaching this threshold of too many in-flight requests. So if you want to check that out, we'll put the link in the description. Yeah, that's good.

Eoin: It's a bit of a shameless self-promotion, but it's definitely worthwhile because those are the kinds of things you never think about. And then in the rare situation where you've got a problem and your number of in-flight messages goes through the roof, you'll get an alarm with it. I think it's worthwhile talking about AWS Lambda because AWS Lambda works really well with SQS, but it has a whole set of different considerations. So I think it's worthwhile talking about it. When you've got EC2 or ECS consumers, containers somewhere, the operation model for consumers is relatively straightforward because you're in control and you have to build all the infrastructure and scale it out yourself.

With Lambda, there's something called an event source mapping. It's the same part of the Lambda service that's also used with Kinesis and Kafka sources and DynamoDB streams. But event source mapping is essentially a consumer for SQS that is managed for you within the Lambda service, and it pulls those messages for you. So we mentioned how SQS integrates with other services on the producer side. On the consumer side, it basically doesn't integrate with anything because you have to pull messages out of it. But the exception is Lambda because it does that for you. So one of the interesting things that I came across with Lambda and SQS is how it scales differently than other Lambda sources. So if you're expecting that if you've got, we talked about the batch processing workload, if you've got many, many containers or many Lambda instances running and they're taking a long time to process SQS messages, some type of machine learning workload running in Lambda might take 90 seconds to process an event, for example. The event source mapping is only going to scale to 60 concurrent instances per minute. And this is dramatically different to other event sources where you, I mean, if you call invoke Lambda, you can get a thousand running instantly and you can get another thousand every minute. And that scales really fast. But with SQS, you can't scale that way. And even if you use provision concurrency, which I tried, it's still 60 per second to consume your SQS messages. So that can be a limit, but it depends on how long it takes to process your messages. Obviously, if your messages are going to be processed in seconds or hundreds of milliseconds, you're still going to be able to process thousands of messages or thousands of batches of messages very quickly. But it's just important to be aware of that. Well, if you've got FIFO queues then as well, you get one batch at a time per message group ID. So that's, I suppose that's probably intuitive, but if you've got five different message group IDs, then you're going to have a maximum of five consumers. There's also some interesting configuration options like the batching. So you can configure if it should invoke your Lambda after a predefined number of seconds, like every six seconds, for example, or if it had received a certain threshold number of messages, or also just based on the payload size. So the number of megabytes that it has accumulated. So that's a whole set of configuration that you get with the Lambda service in SQS. And a really new one is the event filtering, which came out just late last year. And this is kind of interesting because you can filter at the event source mapping level and say, I only want to filter messages matching this pattern. And you can do, you know, JSON filter or a string filter. What that actually means sometimes is that if your consumer doesn't match a filter, you can still end up losing messages because they've been processed by the event source mapping, but they just haven't been sent onto your Lambda because you filter them out. So you have to really think about the semantics there and whether you can and if you want another consumer to be able to pick up that message, you might need to re-architect the message delivery setup. And the last thing I'd say about Lambda is just that cross account Lambdas with an SQS in a different account are also possible, which is really helpful. I wish that was available for all the services, including Kinesis, but that's really helpful for integration across multiple applications when you've got a multi account set up, which is best practice these days, you know, separate account per application, per environment. So if you want to communicate across applications, it's a really good way to do it.

Luciano: Yeah, I think the way you described the integration with Lambda, it feels like there is a lot of magic that AWS does for you. So you have, you can basically build something quicker, but I think it's interesting to understand what's really going on under the hood. So you don't have surprises there. So I think that that was a good one to cover. Yeah. Yeah.

Eoin: It's an interesting one because Lambda is simple, SQS is simple, but they've got, they're building more and more configuration options to make it more powerful. So, you know, you sacrifice some of that simplicity with the power then you get. I think that Lambda kind of concludes a lot of the topics around SQS, but I did want to call out a couple of talks that came up at reInvent last year, some really good new talks all around the idea of enterprise integration patterns and message driven architectures, and it covers SQS, but also all the other services that we're going to talk about in this series.

And one of them was by Gregor Hopa, who was one of the authors of the enterprise integrations patterns book, a book I read a long time ago, which is very good for understanding all the different types of message driven workflows you can have in applications. So there's one by him, and there's a couple of others that we're going to put in the show notes. And if people are interested in event driven architectures and how you can build really powerful architectures with very simple services without having to build a whole lot of infrastructure, I think these are really, really worthwhile. So really strong recommendations on those. And with that, I think we'll leave it for this episode, but please follow us, especially if you want to hear more about the event driven architecture series, we're going to cover the SNS in the next episode. So thanks very much for being with us and we'll talk to you then.