Faster spark workloads with comet

For big data processing, spark is still king. Over the years, many improvements have been made to improve spark performance. Databricks themselves created photon, a spark engine that can accelerate spark queries, but this is proprietary to Databricks. Other alternatives do exist (see here for more details), but they are not trivial to setup. But if you use Apache Arrow DataFusion Comet, surprisingly it does not take much time at all to setup. Comet stands on arrow, a data format growing in popularity. ...

April 7, 2024 · 2 min · Karn Wong

Using Apache Iceberg to reduce data lake operations overhead

Every business generates data, some very little, some do generate ginormous amount of data. If you are familiar with the basic web application architecture, there are data, application and web tier. But it doesn’t end there, because the data generated has to be analyzed for reports. A lot of organizations have analysts working on production database directly. This works fine and well, until the data they are working with is very large to the point that a single query can take half a day to process! ...

November 15, 2023 · 4 min · Karn Wong

Spark on Kubernetes

Background For data processing tasks, there are different ways you can go about it: using SQL to leverage a database engine to perform data transformation dataframe-based frameworks such as pandas, ray, dask, polars big data processing frameworks such as spark Check out this article for more info on polars vs spark benchmark. The problem At larger data scale, other solutions (except spark) can work, but with a lot of vertical scaling, and this can get very expensive. For a comparison, our team had to scale a database to 4/16 GB and it still took the whole night, whereas spark on a single node can process the data in 2 minutes flat. ...

September 12, 2023 · 4 min · Karn Wong

DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 🐼 at certain point. I first used pandas in 2017, so it’s 6 years already. Things have come a long way, and so is data size I’m working with! Pandas has its own issues, namely no native support for nested schema. In addition, it’s very heavy-handed regarding data types inference. It can be a blessing, but it’s a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!). But the worst issue? Pandas can’t open data that doesn’t fit into memory. So if you have a 16 GB RAM machine, you can’t read 12GB data with pandas 😭. ...

April 7, 2023 · 3 min · Karn Wong

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommended solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ./:/home/jovyan/work Run docker-compose up from the same folder where the above file is located. You should see something like this. It’s the same from running jupyter notebook locally. Click the link at the end to access jupyter notebook. Creating pyspark ... done Attaching to pyspark pyspark | WARNING: Jupyter Notebook deprecation notice https://github.com/jupyter/docker-stacks#jupyter-notebook-deprecation-notice. pyspark | Entered start.sh with args: jupyter notebook pyspark | /usr/local/bin/start.sh: running hooks in /usr/local/bin/before-notebook.d as uid / gid: 1000 / 100 pyspark | /usr/local/bin/start.sh: running script /usr/local/bin/before-notebook.d/spark-config.sh pyspark | /usr/local/bin/start.sh: done running hooks in /usr/local/bin/before-notebook.d pyspark | Executing the command: jupyter notebook pyspark | [I 12:36:04.395 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret pyspark | [W 2021-12-21 12:36:05.487 LabApp] 'ip' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [I 2021-12-21 12:36:05.497 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.9/site-packages/jupyterlab pyspark | [I 2021-12-21 12:36:05.498 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab pyspark | [I 12:36:05.504 NotebookApp] Serving notebooks from local directory: /home/jovyan pyspark | [I 12:36:05.504 NotebookApp] Jupyter Notebook 6.4.6 is running at: pyspark | [I 12:36:05.504 NotebookApp] http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). pyspark | [C 12:36:05.509 NotebookApp] pyspark | pyspark | To access the notebook, open this file in a browser: pyspark | file:///home/jovyan/.local/share/jupyter/runtime/nbserver-7-open.html pyspark | Or copy and paste one of these URLs: pyspark | http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 This snippet ...

December 21, 2021 · 3 min · Karn Wong