Karn Wong

Thoughts on summarization service system design

For a summarization task, there should be an input, in which it’s reduced to a handful of paragraphs. This input is in text format. You don’t necessarily start from a text format though, since the source content can be audio or video files. But this means at the end, the source input has to be converted into a text format, and this involves a transcription task. Transcription means taking an audio, then convert it to text. Luckily these days there are APIs you can use to achieve this. Depending on each API provider, but it’s safe to assume most would support WAVE or FLAC encoding. ...

Faster spark workloads with comet

For big data processing, spark is still king. Over the years, many improvements have been made to improve spark performance. Databricks themselves created photon, a spark engine that can accelerate spark queries, but this is proprietary to Databricks. Other alternatives do exist (see here for more details), but they are not trivial to setup. But if you use Apache Arrow DataFusion Comet, surprisingly it does not take much time at all to setup. Comet stands on arrow, a data format growing in popularity. ...

Slim down python docker image size with poetry and pip

Python package management is not straightforward, seeing default package manager (pip) does not behave like node’s npm, in a sense that it doesn’t track dependencies versions. This is why you should use poetry to manage python packages, since it creates a lock file, so you can be sure that on every re-install, the versions would be the same. However, this poses a challenge when you want to create a docker image with poetry, because you need to do an extra pip install poetry (unless you bake this into your base python image). Additionally, turns out using poetry to install packages results in larger docker image size. ...

Dataframe write performance to Postgres

Previously, I talked about dataframe performance, but this doesn’t include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark. ...

How to connect to Cloud SQL from Cloud Run (no, you don't need a VPC)

A minimal application architecture would compose of a database, and an application backend. Serverless database is still in its infancy, but thankfully container-based runtime is very much alive and doing well. On GCP, a serverless container-based runtime do exist, known as Cloud Run. Standard database access pattern Per standard security practices, you should not expose your database to public, this means you should use a proxy/tunnel or private network to reach your database. ...