GCP's service account credentials can be a security risk. Here's how to mitigate them.

If you look online, many sources would tell you that you should use service account to authenticate for GCP services. While this is true, it’s not for all the cases. For local development, you should use Application Default Credentials Imagine working in a team, and you have to work with Cloud Run, so you request your infra team for a service account. This looks good, but then your teammates also have to work with this service. They happen to be in a hurry, so you share your service account to your teammates. Now this can be a problem, because now there are multiple users who have access to this service account. It would be very tricky to trawl through the audit logs and identify which developer interact with cloud run, because the system only sees a single identity. ...

July 14, 2024 · 2 min · Karn Wong

Thoughts on summarization service system design

For a summarization task, there should be an input, in which it’s reduced to a handful of paragraphs. This input is in text format. You don’t necessarily start from a text format though, since the source content can be audio or video files. But this means at the end, the source input has to be converted into a text format, and this involves a transcription task. Transcription means taking an audio, then convert it to text. Luckily these days there are APIs you can use to achieve this. Depending on each API provider, but it’s safe to assume most would support WAVE or FLAC encoding. ...

June 9, 2024 · 3 min · Karn Wong

Faster spark workloads with comet

For big data processing, spark is still king. Over the years, many improvements have been made to improve spark performance. Databricks themselves created photon, a spark engine that can accelerate spark queries, but this is proprietary to Databricks. Other alternatives do exist (see here for more details), but they are not trivial to setup. But if you use Apache Arrow DataFusion Comet, surprisingly it does not take much time at all to setup. Comet stands on arrow, a data format growing in popularity. ...

April 7, 2024 · 2 min · Karn Wong

Slim down python docker image size with poetry and pip

Python package management is not straightforward, seeing default package manager (pip) does not behave like node’s npm, in a sense that it doesn’t track dependencies versions. This is why you should use poetry to manage python packages, since it creates a lock file, so you can be sure that on every re-install, the versions would be the same. However, this poses a challenge when you want to create a docker image with poetry, because you need to do an extra pip install poetry (unless you bake this into your base python image). Additionally, turns out using poetry to install packages results in larger docker image size. ...

April 7, 2024 · 2 min · Karn Wong

Dataframe write performance to Postgres

Previously, I talked about dataframe performance, but this doesn’t include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark. ...

March 17, 2024 · 2 min · Karn Wong