Faster spark workloads with comet

For big data processing, spark is still king. Over the years, many improvements have been made to improve spark performance. Databricks themselves created photon, a spark engine that can accelerate spark queries, but this is proprietary to Databricks. Other alternatives do exist (see here for more details), but they are not trivial to setup. But if you use Apache Arrow DataFusion Comet, surprisingly it does not take much time at all to setup. Comet stands on arrow, a data format growing in popularity. ...

April 7, 2024 · 2 min · Karn Wong

Dataframe write performance to Postgres

Previously, I talked about dataframe performance, but this doesn’t include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark. ...

March 17, 2024 · 2 min · Karn Wong

DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 🐼 at certain point. I first used pandas in 2017, so it’s 6 years already. Things have come a long way, and so is data size I’m working with! Pandas has its own issues, namely no native support for nested schema. In addition, it’s very heavy-handed regarding data types inference. It can be a blessing, but it’s a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!). But the worst issue? Pandas can’t open data that doesn’t fit into memory. So if you have a 16 GB RAM machine, you can’t read 12GB data with pandas 😭. ...

April 7, 2023 · 3 min · Karn Wong