Data Engineering

What SQL can't do for data engineering

I often hear people ask “if you can do data engineering with SQL, then what’s the point of learning spark or python?” Data ingestion Let’s circle back at bit. I think we all can agree that: there’s a point in time where there’s no data in the data warehouse (which DE-who-use-SQL’s use as base of operation). The source data could be anything from hard CSV/Excel or API endpoints. No data in data warehouse, DE can’t use SQL to do stuff with the data....

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommended solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ....

Don't write large table to postgres with pandas

We have a few tables where the data size is > 3GB (in parquet, so around 10 GB uncompressed). Loading it into postgres takes an hour. (Most of our tables are pretty small, hence the reason why we don’t use columnar database). I want to explore whether there’s a faster way or not. The conclusion is writing to postgres with spark seems to be fastest, given we can’t use COPY since our data contain free text, which means it would make CSV parsing impossible....

Data engineering toolset (that I use) glossary

Big data Spark: Map-reduce framework for dealing with big data, especially for data that doesn’t fit into memory. Utilizes parallelization. Cloud AWS: Cloud platform for many tools used in software engineering. AWS Fargate: A task launch mode for ECS task, where it automatically shuts down once a container exits. With EC2 launch mode, you’ll have to turn off the machine yourself. AWS Lambda: Serverless function, can be used with docker image too....

Shapefile to data lake

Background: we use spark to read/write to data lake. For dealing with spatial data & analysis, we use sedona. Shapefile is converted to TSV then read by spark for further processing & archival. Recently I had to archive shapefiles in our data lake. It wasn’t rosy for the following reasons: Invalid geometries Sedona (and geopandas too) whines if it encounters invalid geometry during geometry casting. The invalid geometries could be from many reasons, one of them being unclean polygon clipping....