Hello World API Performance Benchmark (Go, Node, Python, Rust)

Correction 2024-09-21: after using multi-stage build, Node image size dropped from 1.1GB to 130MB. The first programming language I achieved proficiency was Python, so for the longest time I’ve been using it to do most stuff. Last year I picked up Go, and I had a blast with it. This month I picked up Rust for data/ml works, and so far I was very impressed. It got me thinking - it’s been said multiple times is Python is slower than compiled languages, and Node is very easy to use but uses a lot of memory. Since I can code in multiple languages, why not do a simple API benchmark? So here’s the results. ...

September 20, 2024 · 3 min · Karn Wong

Streamlit load test performance

Streamlit is well-loved by many people, especially among data folks due to the fact that it does not require prior web programming knowledge to get started. Popular use cases for streamlit can be anything from a quick machine learning application poc or internal dashboards. But what if you want to create a production deployment? Would streamlit still be a viable option? Experiment setup For a production deployment, let’s assume there are 500 concurrent users. For a single-page streamlit with a chat box taking 3 seconds to return a single-paragraph message: ...

September 7, 2024 · 2 min · Karn Wong

Faster spark workloads with comet

For big data processing, spark is still king. Over the years, many improvements have been made to improve spark performance. Databricks themselves created photon, a spark engine that can accelerate spark queries, but this is proprietary to Databricks. Other alternatives do exist (see here for more details), but they are not trivial to setup. But if you use Apache Arrow DataFusion Comet, surprisingly it does not take much time at all to setup. Comet stands on arrow, a data format growing in popularity. ...

April 7, 2024 · 2 min · Karn Wong

Dataframe write performance to Postgres

Previously, I talked about dataframe performance, but this doesn’t include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark. ...

March 17, 2024 · 2 min · Karn Wong

DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 🐼 at certain point. I first used pandas in 2017, so it’s 6 years already. Things have come a long way, and so is data size I’m working with! Pandas has its own issues, namely no native support for nested schema. In addition, it’s very heavy-handed regarding data types inference. It can be a blessing, but it’s a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!). But the worst issue? Pandas can’t open data that doesn’t fit into memory. So if you have a 16 GB RAM machine, you can’t read 12GB data with pandas 😭. ...

April 7, 2023 · 3 min · Karn Wong