Data Engineering

What SQL can't do for data engineering

I often hear people ask “if you can do data engineering with SQL, then what’s the point of learning spark or python?” Data ingestion Let’s circle back at bit. I think we all can agree that: there’s a point in time where there’s no data in the data warehouse (which DE-who-use-SQL’s use as base of operation). The source data could be anything from hard CSV/Excel or API endpoints. No data in data warehouse, DE can’t use SQL to do stuff with the data. ...

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommended solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ./:/home/jovyan/work Run docker-compose up from the same folder where the above file is located. You should see something like this. It’s the same from running jupyter notebook locally. Click the link at the end to access jupyter notebook. Creating pyspark ... done Attaching to pyspark pyspark | WARNING: Jupyter Notebook deprecation notice https://github.com/jupyter/docker-stacks#jupyter-notebook-deprecation-notice. pyspark | Entered start.sh with args: jupyter notebook pyspark | /usr/local/bin/start.sh: running hooks in /usr/local/bin/before-notebook.d as uid / gid: 1000 / 100 pyspark | /usr/local/bin/start.sh: running script /usr/local/bin/before-notebook.d/spark-config.sh pyspark | /usr/local/bin/start.sh: done running hooks in /usr/local/bin/before-notebook.d pyspark | Executing the command: jupyter notebook pyspark | [I 12:36:04.395 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret pyspark | [W 2021-12-21 12:36:05.487 LabApp] 'ip' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [I 2021-12-21 12:36:05.497 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.9/site-packages/jupyterlab pyspark | [I 2021-12-21 12:36:05.498 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab pyspark | [I 12:36:05.504 NotebookApp] Serving notebooks from local directory: /home/jovyan pyspark | [I 12:36:05.504 NotebookApp] Jupyter Notebook 6.4.6 is running at: pyspark | [I 12:36:05.504 NotebookApp] http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). pyspark | [C 12:36:05.509 NotebookApp] pyspark | pyspark | To access the notebook, open this file in a browser: pyspark | file:///home/jovyan/.local/share/jupyter/runtime/nbserver-7-open.html pyspark | Or copy and paste one of these URLs: pyspark | http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 This snippet ...

Don't write large table to postgres with pandas

We have a few tables where the data size is > 3GB (in parquet, so around 10 GB uncompressed). Loading it into postgres takes an hour. (Most of our tables are pretty small, hence the reason why we don’t use columnar database). I want to explore whether there’s a faster way or not. The conclusion is writing to postgres with spark seems to be fastest, given we can’t use COPY since our data contain free text, which means it would make CSV parsing impossible. ...

Data engineering toolset (that I use) glossary

Big data Spark: Map-reduce framework for dealing with big data, especially for data that doesn’t fit into memory. Utilizes parallelization. Cloud AWS: Cloud platform for many tools used in software engineering. AWS Fargate: A task launch mode for ECS task, where it automatically shuts down once a container exits. With EC2 launch mode, you’ll have to turn off the machine yourself. AWS Lambda: Serverless function, can be used with docker image too. Can also hook this with API gateway to make it act as API endpoint. AWS RDS: Managed databases from AWS. ECS Task: Launch a task in ECS cluster. For long-running services, launch via EC2. For small periodical tasks, trigger via Cloudwatch. For the latter, think of cron-like schedule for a task. Essentially at specified time, it runs a predefined docker image (you should configure your entrypoint.sh accordingly). Data Parquet: Columnar data blob format, very efficient due to column-based compression with schema definition baked in. Data engineering Dagster: Task orchestration framework with built-in pipelines validation. ETL: Stands for extract-transform-load. Essentially it means “moving data from A to B, with optional data wrangling in the middle.” Data science NLP: Using machine (computer) to work on human languages. For instance, analyze whether a message is positive or negative. Data wrangling Pandas: Dataframe wrangler, think of programmable Excel. Database Postgres: RMDBS with good performance. DataOps Great expectations: A framework for data validation. DevOps Docker: Virtualization via containers. Git: Version control. Kubernetes: Container orchestration system. Terraform: Infrastructure as code tool, essentially you use it to store a blueprint for your infra setup. If you were to move to another account, you can re-conjure existing infra with one command. This makes editing infra config easier too, since it automatically cleans up / update config automatically. GIS PostGIS: GIS extension for Postgres. MLOps MLflow: A framework to track model parameters and output. Can also store model artifact as well. Notebook Jupyter: Python notebook, used for exploring solutions before converting it to .py.

Shapefile to data lake

Background: we use spark to read/write to data lake. For dealing with spatial data & analysis, we use sedona. Shapefile is converted to TSV then read by spark for further processing & archival. Recently I had to archive shapefiles in our data lake. It wasn’t rosy for the following reasons: Invalid geometries Sedona (and geopandas too) whines if it encounters invalid geometry during geometry casting. The invalid geometries could be from many reasons, one of them being unclean polygon clipping. ...