What SQL can't do for data engineering

I often hear people ask “if you can do data engineering with SQL, then what’s the point of learning spark or python?” Data ingestion Let’s circle back at bit. I think we all can agree that: there’s a point in time where there’s no data in the data warehouse (which DE-who-use-SQL’s use as base of operation). The source data could be anything from hard CSV/Excel or API endpoints. No data in data warehouse, DE can’t use SQL to do stuff with the data. ...

May 15, 2022 · 3 min · Karn Wong

Use SSH key during docker build without embedding the key via ssh-agent

Imagine working in a company, and they have a super cool internal module! The module works great, except that it is a private module, which means you need to install it by cloning the source repo and install it from source. That shouldn’t be an issue if you work on your local machine. But for production usually this means you somehow need to bundle this awesome module into your docker image. You go create a Dockerfile and there’s one little problem: it couldn’t clone the module repo because it doesn’t have the required SSH key that can access the repo. ...

February 6, 2022 · 2 min · Karn Wong

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommended solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ./:/home/jovyan/work Run docker-compose up from the same folder where the above file is located. You should see something like this. It’s the same from running jupyter notebook locally. Click the link at the end to access jupyter notebook. Creating pyspark ... done Attaching to pyspark pyspark | WARNING: Jupyter Notebook deprecation notice https://github.com/jupyter/docker-stacks#jupyter-notebook-deprecation-notice. pyspark | Entered start.sh with args: jupyter notebook pyspark | /usr/local/bin/start.sh: running hooks in /usr/local/bin/before-notebook.d as uid / gid: 1000 / 100 pyspark | /usr/local/bin/start.sh: running script /usr/local/bin/before-notebook.d/spark-config.sh pyspark | /usr/local/bin/start.sh: done running hooks in /usr/local/bin/before-notebook.d pyspark | Executing the command: jupyter notebook pyspark | [I 12:36:04.395 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret pyspark | [W 2021-12-21 12:36:05.487 LabApp] 'ip' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [I 2021-12-21 12:36:05.497 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.9/site-packages/jupyterlab pyspark | [I 2021-12-21 12:36:05.498 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab pyspark | [I 12:36:05.504 NotebookApp] Serving notebooks from local directory: /home/jovyan pyspark | [I 12:36:05.504 NotebookApp] Jupyter Notebook 6.4.6 is running at: pyspark | [I 12:36:05.504 NotebookApp] http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). pyspark | [C 12:36:05.509 NotebookApp] pyspark | pyspark | To access the notebook, open this file in a browser: pyspark | file:///home/jovyan/.local/share/jupyter/runtime/nbserver-7-open.html pyspark | Or copy and paste one of these URLs: pyspark | http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 This snippet ...

December 21, 2021 · 3 min · Karn Wong

Reduce docker image size with alpine

Creating scripts are easy. But creating a small docker image is not 😅. Not all Linux flavors are created equal, some are bigger than others, etc. But this difference is very crucial when it comes to reducing docker image size. A simple bash script docker image Given a Dockerfile (change apk to apt for ubuntu): FROM alpine:3 WORKDIR /app RUN apk update && apk add jq curl COPY water-cut-notify.sh ./ ENTRYPOINT ["sh", "/app/water-cut-notify.sh"] Base image Docker image size alpine 11.1MB ubuntu 122MB Ubuntu image size is 1099% larger!!!!!! ...

December 19, 2021 · 1 min · Karn Wong

Secrets management with SOPS, AWS Secrets Manager and Terraform

Correction 2023-07-06: I only recently realized SSM and Secrets Manager are not the same. At my organization we use sops to check in encrypted secrets into git repos. This solves plaintext credentials in version control. However, say, you have 5 repos using the same database credentials, rotating secrets means you have to go into each repo and update the SOPS credentials manually. Also worth nothing that, for GitHub actions, authenticating AWS means you have to add repo secrets. This means for all the repos you have CI enabled, you have to populate the repo secrets with AWS credentials. When time comes for rotating the creds, you’ll encounter the same situation as above. ...

November 30, 2021 · 4 min · Karn Wong