Dataframe write performance to Postgres

Previously, I talked about dataframe performance, but this doesn鈥檛 include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark. ...

March 17, 2024 路 2 min 路 Karn Wong

DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 馃惣 at certain point. I first used pandas in 2017, so it鈥檚 6 years already. Things have come a long way, and so is data size I鈥檓 working with! Pandas has its own issues, namely no native support for nested schema. In addition, it鈥檚 very heavy-handed regarding data types inference. It can be a blessing, but it鈥檚 a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!). But the worst issue? Pandas can鈥檛 open data that doesn鈥檛 fit into memory. So if you have a 16 GB RAM machine, you can鈥檛 read 12GB data with pandas 馃槶. ...

April 7, 2023 路 3 min 路 Karn Wong