Faster spark workloads with comet

For big data processing, spark is still king. Over the years, many improvements have been made to improve spark performance. Databricks themselves created photon, a spark engine that can accelerate spark queries, but this is proprietary to Databricks. Other alternatives do exist (see here for more details), but they are not trivial to setup. But if you use Apache Arrow DataFusion Comet, surprisingly it does not take much time at all to setup. Comet stands on arrow, a data format growing in popularity. ...

April 7, 2024 · 2 min · Karn Wong

Dataframe write performance to Postgres

Previously, I talked about dataframe performance, but this doesn’t include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark. ...

March 17, 2024 · 2 min · Karn Wong

Using Apache Iceberg to reduce data lake operations overhead

Every business generates data, some very little, some do generate ginormous amount of data. If you are familiar with the basic web application architecture, there are data, application and web tier. But it doesn’t end there, because the data generated has to be analyzed for reports. A lot of organizations have analysts working on production database directly. This works fine and well, until the data they are working with is very large to the point that a single query can take half a day to process! ...

November 15, 2023 · 4 min · Karn Wong

Spark on Kubernetes

Background For data processing tasks, there are different ways you can go about it: using SQL to leverage a database engine to perform data transformation dataframe-based frameworks such as pandas, ray, dask, polars big data processing frameworks such as spark Check out this article for more info on polars vs spark benchmark. The problem At larger data scale, other solutions (except spark) can work, but with a lot of vertical scaling, and this can get very expensive. For a comparison, our team had to scale a database to 4/16 GB and it still took the whole night, whereas spark on a single node can process the data in 2 minutes flat. ...

September 12, 2023 · 4 min · Karn Wong

Data Engineering Resources

Note: if you’ve seen the list elsewhere, it was probably me. I first posted this list on Data Engineering Discord and Data Engineer Cafe. Books Data fundamentals (good entrypoint) Fundamentals of Data Engineering - Joe Reis & Matt Housley Seven Databases in Seven Weeks - Luc Perkins & Eric Redmond & Jim Wilson Designing Data-Intensive Applications - Martin Kleppmann The Data Warehouse Toolkit - Ralph Kimball & Margy Ross Data Science for Business - Foster Provost & Tom Fawcett Practical Statistics for Data Scientists - Peter Gedeck & Peter Bruce & Andrew Bruce Software engineering Python Crash Course - Eric Matthes The Pragmatic Programmer - Andrew Hunt & David Thomas Platform Terraform: Up & Running - Yevgeniy Brikman Management Team Topologies - Matthew Skelton & Manuel Pais Radical Candor - Kim Scott Data Teams - Jesse Anderson Practical DataOps - Harvinder Atwal Resources https://brendanthompson.com/posts/2021/11/my-terraform-development-workflow https://www.terraform-best-practices.com/ https://github.com/open-guides/og-aws https://awesomedataengineering.com/ https://github.com/opendatadiscovery/awesome-data-catalogs https://github.com/datastacktv/data-engineer-roadmap https://www.moderndatastack.xyz/stacks https://www.secoda.co/glossary https://www.gentlydownthe.stream/ https://b-greve.gitbook.io/beginners-guide-to-clean-data/

September 9, 2023 · 1 min · Karn Wong