Help us to make this transcription better! If you find an error, please
submit a PR with your corrections.
Luciano: Hello and welcome to another episode of AWS Bites. My name is Luciano and I'm joined by Eoin. And we wanted to talk about DuckDB for a while. And today we finally get to do just that. So a few years ago, we started to hear lots of excitement from our data scientists and analysts friends and all colleagues and everyone was talking about DuckDB and they were super excited and we couldn't figure out exactly why.
But finally, we have used it in a few different projects and now we really understand why everyone is excited about that. So today we want to share with you how DuckDB is becoming a must-have tool for anyone who has some data on a S3 bucket somewhere and will need to do some kind of analytical job with that data. So we'll share some use cases and how to get started. And by the way, there is a little bit of a treat at the end. If you are missing the recently retired S3 Select feature on AWS, stay tuned until the end, we might have a surprise for you. Hopefully a better option that you can use today. So let's get into it. AWS Bites is brought to you by Forteorem, our sponsor. And if you're looking for a partner to architect, develop, and modernize on AWS, give Forteorem a call. You can check out all the details at fourtheorem.com. So Eoin, maybe we can start by giving a little bit of an explanation of what DuckDB is. Well, it's a database, but specifically it's an in-memory analytical database.
Eoin: So when you think in-memory, you might think of things like SQLite, sometimes known as SQLite, but this is for analytical or OLAP workloads. So you can therefore run it locally because it's in-memory, but you can also run it in the cloud, basically anywhere. And it's SQL first. So anyone with knowledge of SQL can pick it up and use it really very quickly. It works with a lot of different storage formats.
It has its own format, but you can also use it with Parquet, Iceberg, CSV data, JSON data, Arrow. And one of the really big draws of it is that it's very simple to use, and it has managed to stay very simple to use. It doesn't have any dependencies, works on pretty much any modern OS or CPU, and it has language bindings for almost everything you might need. It's also pretty modular. So new features are automatically enabled by installing extensions, and you can even write your own custom extensions as well. And it's open source. I heard an interview at one of the founders of DuckDB recently, and they were talking about the different model, the governance model, but they basically set up a separate foundation holding all of its IP, keeping it clear of all the commercial entities to ensure that the community doesn't get any nasty license-switching surprises in the future. So that's the intro. Luciano, what is DuckDB good for?
Luciano: Yeah, you already mentioned that you can run it anywhere. That's probably one of the main things worth to mention. So you can run it on your laptop, on an EC2 instance, in a container running, again, on your machine or somewhere in the cloud. You can even run it in Lambda, which is probably the coolest use case, if you ask me. So one of the biggest benefits is how fast it is on modest hardware. So you don't need to run it in high-end machines or buy expensive VMs.
You can run it pretty much in almost any reasonable hardware. And it has a very efficient multicore columnar execution engine, which is able to parallelize fetching and execution of the queries that you provide to analyze your data. So you can also work with more data than available in memory, which is another interesting detail for something that is defined as an in-memory database. And effectively, we are used to massive clusters or data warehouse required for querying big data.
DuckDB kind of changes that a little bit because it solves a lot of the problem that we typically would solve with these massive clusters, but with a much simpler approach and a much more cost-effective approach. And I think that's probably one of the main reasons why everyone is so excited about DuckDB. So they say that one of the main reasons why DuckDB can be so efficient is that the engine is particularly well engineered, it's a columnar vectorized type of engine, and it can process large numbers of rows for a column at once. So this is why it's very well suited for analytical type of workloads. So because DuckDB can support more data than it can fit in memory, you often hear stories of people processing terabytes of data, even in their laptops. And I'm seeing lots of small examples and use cases of people talking online. For instance, one common use case would be if you want to reimplement your own version of Google Analytics, people are trying to do that with DuckDB and seems to be extremely easy to build and extremely efficient and cost-effective. So that's just to give you an idea of the kind of things you could be building with something like DuckDB. So where do we get started?
Eoin: Yeah, the best way to take it for a test drive is just to install the CLI. So you can do that for any Linux, Mac OS, Windows. Once you do that, you can run the DuckDB command and you have an interactive SQL REPL. So you can start writing SQL and you can run in memory, which it would do by default, or use a DuckDB file. I think 95% of the use cases I've seen, you're just using the in-memory case. But there's a lot of advantages if you want to persist the data as well.
So if you just want to query data on S3, in-memory is fine. You could start with something like select star from and then just give the URI for your S3 object. And that could be a single file or it could be a group of files. And then you can get into more complex stuff like partitioning. So if you're in the big data space, you'll be familiar with hive partitioning on stores like S3, where you have certain parts of the path, which represent basically partitions of data, like date equals 2025, 02, 18.
And when you have those dates in your where expression, it means it only has to read in that folder. You can do that with DuckDB as well. You can also create in-memory tables for intermediate results or indeed for assisted tables if you're not using in-memory mode. And then you can do joins. And it has, you know, very rich SQL support, but it also has custom functions for doing things like reading Parquet data with specific parameters and scanning lots of different partitions on a store like S3.
So on AWS, then if you want to start using it in your application, there's lots of options for integrating it. Lambda, you mentioned already, Luciano, it lets you do simple and cheap data lake querying or even ETL pipelines if you're doing transformations. That's one of the places where I've found it very useful because you can do ETL a couple of lines of SQL just running in a Lambda function. And that's your job done.
You don't have to stand up any kind of complicated or just heavier ETL infrastructure. And you can obviously run it in containers, for example, on ECS, Fargate, EKS, something like that. You can use it in AWS Glue, if you like, in your Python shell jobs. And you can also integrate it into step functions. And that's one of the things that I've found it really good for, actually, is putting it into step functions to do one step in your step function, which will do massive transformations on a lot of data.
And you gave a preview at start. You said that we might have an alternative for S3 Select. Unfortunately, S3 Select has been retired since late last year for new AWS accounts. If you have existing accounts, you might still be able to use it. But it was a very handy feature. It was a console tool and an API. That you could use from the SDKs that would allow you to query data on S3 directly. It was quite a limited subset of SQL supported in S3 Select.
You could basically just select with a where clause on a single file. But people really miss it since it's gone. And the good news with DuckDB is that you can replace it. Essentially, you can use the CLI as a drop-in replacement for what you would have done in the AWS console. But you can also use DuckDB as a programmatic replacement for the AWS SDK usage of S3 Select. And the good news is that you'll have a lot more features than you would have had with S3 Select. So we'll talk a little bit about that more because one of the recent use cases I came across was I wanted to use S3 Select in a step function just to do some filtering on data. It was in a new account, so that wasn't available to me. And I was able to replace it with DuckDB and get really good results. So I'll come back to that in a bit. But maybe we should talk about before that how S3... Sorry. Maybe we should talk about how DuckDB compares to other things like Athena, SQLite, Pandas, Polars.
Luciano: Yeah, let me try to give a quick overview on that. So comparing it to Athena, DuckDB is much faster and simpler in terms of setting it up and the code that you need to write. I think it's fair to say that Athena will probably support much more data like if you are at massive scale, but you can get very far with DuckDB. And the reason why Athena or even Redshift, if we want to bring that in the conversation, they will support more data is because they are large-scale multi-node systems while DuckDB is a single-node system.
So unless you want to write your own crazy demultiplexer, multiplexer type of approach where you can shard multiple nodes and manage all that querying in between, I think DuckDB is more suitable for things that will fit a single node. But I think with that, you can still solve lots of problems, even at relatively big scale. So that just keeps the comparison between Athena and even Redshift, I guess, versus DuckDB.
Another one that it's very often compared to is SQLite. And I think it's fair to say that these days there is an alternative called LibSQL. So if you are not aware, LibSQL is a fork of SQLite. So in a way, it's very similar, tries to solve the same problems as a slightly different governance than SQLite. But apparently, I'm seeing it more and more. So possibly, it's going to eventually become a big contender in this space.
So both SQLite and LibSQL, if you compare them with DuckDB, I think the approach is very similar. So I think it makes sense to compare them. But at the same time, they are trying to solve very different problems. So where SQLite or LibSQL are giving you effectively an easy way to have a local embeddable transactional database, DuckDB is much more focused on large-scale analytics. So effectively, the way that the data is being stored and retrieved is very, very different.
And therefore, I think you need to pick the right tool for the right job. So if you're doing analytics, stick with DuckDB. Otherwise, probably SQLite will give you better results. Then if we bring into the pictures Pandas and the more recent Polars, which is pretty much... So Pandas is the original one written in Python and Polars is kind of a more modern rewrite written in Rust. So more focused on bringing, I guess, additional performance that the Rust programming language can bring.
So these are two very popular data processing libraries that are used for analytics. I think the use case is very similar, but the approach is probably very different. So while Pandas and Polars effectively are programming languages libraries, so they offer you like a programmatic interface, you import a library in a piece of code you're writing, and then you have this concept of an object that represents a data frame, and you can use this data frame to do all sorts of kind of things using different programming languages like Python or Rust.
You write different instructions, effectively you write your own script, and that way you can create even fairly complex analytical pipelines. I don't think those will be completely replaced by DuckDB. I think on one side maybe Pandas and Polars can give you an extra bit of flexibility. Maybe people just prefer the more programmatic approach so that there is definitely a difference in the things you can do and the way you approach the problem. But with that being said, I think you can be able to replace some of the work that you might be doing with Pandas and Polars with just a query or a single SQL even in DuckDB. So I think it's worth figuring out where is the sweet spot where maybe DuckDB will give you a much simpler solution and you don't have to write and maintain code. You probably just focus on writing a SQL statement. Yeah, I think my rule of thumb
Eoin: is like when you run into multiple lines of SQL, then eventually it gets more complicated. You feel like you're back in store procedure land in the early 2000s and at that point you might be better off using something a bit more imperative like Python.
Luciano: So you mentioned step functions before. I'm really curious to see how you use DuckDB with step functions.
Eoin: Yeah, I mean there's no way you can create like a custom user-defined task or state in step functions apart from using something like Lambda or a container task. So I mentioned that we had this issue where we couldn't use S3 select and we trialed using DuckDB instead. It worked really well and we actually decided to make this a general purpose pattern so that anyone can really easily integrate things like DuckDB into step functions and other workflows.
The way we did that was creating a Lambda runtime for DuckDB. And this might sound a little bit strange but here is out. The layer allows you to basically deploy a Lambda function that doesn't have any code. So it just has a DuckDB engine and we created this runtime layer on top of it that basically accepts a query and allows you to then do something with the results. So it's really lightweight. It's written in C because DuckDB's core itself is written in C so it was fairly straightforward to create a thin wrapper around that written in C as well.
So it's very tiny to deploy quite lightweight in terms of startup and then you really just have to incur the analytical time for whatever query you're running. So it means you don't have to have a Python runtime or a Node.js runtime. You don't have to worry about dependencies and upgrades. It's just a single binary and you can then use this as a response to an event bridge event or integrate it into step functions and all of a sudden then you have the ability to do DuckDB in your step function definition and you can just write your SQL in the input parameters for the Lambda task and interpolate any variables you might need from your step function state.
And we've been able to use this then for just doing simple things like taking a CSV file and converting it into JSON and then doing something else with the response. But you can also do different filtering, grouping, aggregation before you return your JSON. So it's really powerful. It is, I think it's worthy of a runtime because DuckDB is so powerful and you can do so much with it. And I think a lot of us who use Lambda heavily are always looking for ways to reduce the number of functions you need to maintain because it just seems like you've got a lot of cattle to herd then.
And this is a good way to do it. So we'll have a link to the repo in the show notes. It's essentially a SAR app, so a serverless application repository app that you can use to install the runtime in your own account. And then you just create a simple function that you can deploy. You might need to give it some permissions. So normally, rather than returning analytical data in the response of your function invocation, you might write it to S3 just because of the size of the data or the format of the data.
So you might want to give that function you deploy some access to S3, like read access to some buckets and maybe write access to a staging area where your intermediate results are stored. And once you have that, you can do some pretty mad things with DuckDB and step functions, like you can even call APIs using DuckDB because it has an HTTP extension and you can even do like authenticated requests to APIs, transform the response in SQL.
You can convert tabular data on S3 to JSON, like I mentioned, that you can then use in your step function state. You can do like fairly significant ETL or ELT type workloads and then just anything else in terms of analytics aggregation. You know, if you've got a step function that's doing a map or even a distributed map to process lots and lots of units of data, you could then use DuckDB to collate all that data, aggregate it all together and get results. So, yeah, I'm curious to hear how people get on with it and if they can try it out and find some really good use cases because it really is kind of a Swiss army knife type tool and it's just limited by your imagination. That's awesome.
Luciano: I think this is everything we wanted to share today. We believe as a final conclusion that DuckDB is really good, really promising, probably we'll be seeing more and more of it and especially in the cloud, in Lambda, in AWS. So, hopefully you got some value from this episode today. If you did, please remember to share it with your friends, like, subscribe, everything else and also please give us a star to our new repository if you like this project. So, thank you very much and we'll see you in the next episode. and we'll see you next big one. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye.