DuckDb is a powerful in-memory database that has a parallel processing feature, which makes it a good choice to read/transform cloud storage data, in this case, AWS S3. I’ve had a lot of success using it and I will walk you through the steps in implementing it.
I will also include some learnings and best practices for you. Using the DuckDb, httpfs extension and pyarrow, we can efficiently process Parquet files stored in S3 buckets. Let’s dive in: