Karn Wong

Use SQL against CSV (or other hard files) without CLI

CSV as a file format is very versatile, almost any programs can parse it. The only issue is you can’t use SQL against CSV files directly. This is a major pain point, since using SQL is so much faster than firing up a jupyter notebook and wrangle the data in python, or use Excel and apply transformations until you get desired results. But the question is how do we use SQL against CSV files in the first place. Many people want this to become a reality, so a few tools exist on github: ...

DevX starts at your local machine

Platform engineering is all the rage these days. Often, you’ll often hear this term with the keyword DevX. How are they related? Imagine you are working on a microservice backend. You are just starting out, so you don’t have much features to work on yet. But as a PoC, you only need to [fetch data] and [return aggregated price]. You can do microservices on Kubernetes, but you are not familiar with DevOps so you turn to a cloud provider - AWS. ...

The mythical ChatOps in action

Imagine having multiple services running, each has its own logs. Most people don’t read them, and they shouldn’t, because services emit a lot of logs! But we need them, because it’s the only way to diagnose and troubleshoot system errors. But you might say “my service is not a system! It’s only doing tiny stuff!” Gotta break it to you, your small part is a piece in a large system networks stitched together! So your seemingly-tiny service is also important! ...

DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 🐼 at certain point. I first used pandas in 2017, so it’s 6 years already. Things have come a long way, and so is data size I’m working with! Pandas has its own issues, namely no native support for nested schema. In addition, it’s very heavy-handed regarding data types inference. It can be a blessing, but it’s a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!). But the worst issue? Pandas can’t open data that doesn’t fit into memory. So if you have a 16 GB RAM machine, you can’t read 12GB data with pandas 😭. ...

Kubernetes with Grafana Cloud

Kubernetes is awesome, I think this is obvious if you have more than a handful of services to manage. If you use cloud, either VM or container-based runtime, it would provide you a dashboard to see the metrics. But what about kubernetes? Since you would have multiple services inside a single cluster, in which it’s backed by VMs, at best you would only see into your VMs’ metrics, but doesn’t provide separate metrics per each service. ...