Spark on Kubernetes

Background For data processing tasks, there are different ways you can go about it: using SQL to leverage a database engine to perform data transformation dataframe-based frameworks such as pandas, ray, dask, polars big data processing frameworks such as spark Check out this article for more info on polars vs spark benchmark. The problem At larger data scale, other solutions (except spark) can work, but with a lot of vertical scaling, and this can get very expensive. For a comparison, our team had to scale a database to 4/16 GB and it still took the whole night, whereas spark on a single node can process the data in 2 minutes flat. ...

September 12, 2023 · 4 min · Karn Wong

Data Engineering Resources

Note: if you’ve seen the list elsewhere, it was probably me. I first posted this list on Data Engineering Discord and Data Engineer Cafe. Books Data fundamentals (good entrypoint) Fundamentals of Data Engineering - Joe Reis & Matt Housley Seven Databases in Seven Weeks - Luc Perkins & Eric Redmond & Jim Wilson Designing Data-Intensive Applications - Martin Kleppmann The Data Warehouse Toolkit - Ralph Kimball & Margy Ross Data Science for Business - Foster Provost & Tom Fawcett Practical Statistics for Data Scientists - Peter Gedeck & Peter Bruce & Andrew Bruce Software engineering Python Crash Course - Eric Matthes The Pragmatic Programmer - Andrew Hunt & David Thomas Platform Terraform: Up & Running - Yevgeniy Brikman Management Team Topologies - Matthew Skelton & Manuel Pais Radical Candor - Kim Scott Data Teams - Jesse Anderson Practical DataOps - Harvinder Atwal Resources https://brendanthompson.com/posts/2021/11/my-terraform-development-workflow https://www.terraform-best-practices.com/ https://github.com/open-guides/og-aws https://awesomedataengineering.com/ https://github.com/opendatadiscovery/awesome-data-catalogs https://github.com/datastacktv/data-engineer-roadmap https://www.moderndatastack.xyz/stacks https://www.secoda.co/glossary https://www.gentlydownthe.stream/ https://b-greve.gitbook.io/beginners-guide-to-clean-data/

September 9, 2023 · 1 min · Karn Wong

A Networking God Tale: All I Want is to Run a Speedtest Behind a Firewall

Imagine going to your client’s site to deploy a software. During the deployment process, you notice that the speed is atrociously slow. You have a suspicion that your client’s network bandwidth is the issue. To test this theory, you go to a speedtest website and run a test. Turns out you can’t because it’s blocked at the firewall level. Then you try another speedtest website, oops still got blocked. Then you try a few more, still no dice. ...

August 27, 2023 · 2 min · Karn Wong

Spatial data to QGIS server playbook (yes, this is for prod)

Some of you might be familiar with geoserver for serving spatial data as consumable WMS/WFS layers. The issue is that as far as I know, you have to manually manage assets upload and specifying styles manually. Also the tool is a bit dated. One modern alternative is QGIS server, you can find pre-made docker image online, and it also syncs with the Desktop version. The good thing about QGIS server is that you can create a QGIS project via the desktop application, then upload it wholesale to Postgres instance as QGIS server backend. ...

August 10, 2023 · 2 min · Karn Wong

Create Kubernetes service accounts with Terraform

Sometimes you’ll have to grant other people (or entities) access to your Kubernetes cluster. Easiest is you can give them your admin credentials, but this is similar to giving your house key to a friend, when they only need access to your living room. You can give them different keys, depending on access level required. Those could be readonly access to view services status, or deploy service account that can create/update services. ...

August 1, 2023 · 3 min · Karn Wong