Slim down python docker image size with poetry and pip

Python package management is not straightforward, seeing default package manager (pip) does not behave like node’s npm, in a sense that it doesn’t track dependencies versions. This is why you should use poetry to manage python packages, since it creates a lock file, so you can be sure that on every re-install, the versions would be the same. However, this poses a challenge when you want to create a docker image with poetry, because you need to do an extra pip install poetry (unless you bake this into your base python image). Additionally, turns out using poetry to install packages results in larger docker image size. ...

April 7, 2024 · 2 min · Karn Wong

Setting up Postgres locally, what could go wrong?

There are multiple reasons why someone wants to set up a postgres locally. Either for learning SQL or as an application’s backend. Over the years I see people struggle with using postgres locally, so here are common use cases and possible issues, with solutions for each. For Learning SQL SQL is very common for analysts to use for accessing data from a database, because the data size outgrows Excel. However, SQL is a query language, not a database engine. This essentially means if you want to get familiar with SQL, there are other simpler alternatives, such as SQLite or DuckDB (which can load data from local files directly without doing an explicit data import). Plus, you don’t need authentication to use either of them! ...

December 23, 2023 · 3 min · Karn Wong

Things to watch out for GCP SSL with Cloudflare DNS

For our production workload, we deploy the workloads on Kubernetes, in which an ingress resource is created per each deployment. Resources in ingress are GCP Load Balancer and SSL Certificate. As for DNS, we use Cloudflare since it enables CDN without extra configurations on our part. A few months after the deployment went live initially, we were informed that the website couldn’t be accessed. Turns out GCP couldn’t renew the SSL Certificate (error FAILED_NOT_VISIBLE.) Looking at GCP docs, turns out if the DNS couldn’t be resolved to the Load Balancer IP, it couldn’t provision/renew a certificate. ...

December 18, 2023 · 1 min · Karn Wong

Reduce operational costs with terraform

Background Think of websites you visit each day. Most likely they are hosted on a cloud provider such as AWS, GCP, Azure. The good news is it’s very easy to create a simple deployment with a virtual machine, but for scalable and high-availability workloads, usual recommendations is to use a container-based runtime such as AWS ECS/EKS, GCP Cloud Run/GKE. These services also require more configurations than a simple VM deployment. ...

November 4, 2023 · 3 min · Karn Wong

Spark on Kubernetes

Background For data processing tasks, there are different ways you can go about it: using SQL to leverage a database engine to perform data transformation dataframe-based frameworks such as pandas, ray, dask, polars big data processing frameworks such as spark Check out this article for more info on polars vs spark benchmark. The problem At larger data scale, other solutions (except spark) can work, but with a lot of vertical scaling, and this can get very expensive. For a comparison, our team had to scale a database to 4/16 GB and it still took the whole night, whereas spark on a single node can process the data in 2 minutes flat. ...

September 12, 2023 · 4 min · Karn Wong