Data Engineering

Spark join OOM fix

I have a big pipelines where one step performs crossjoin on 130K x 7K. It fails quite often, and I have to pray to the Rice God for it to pass. Today I found the solution: repartition before crossjoin. The root cause is that the dataframe with 130K records has 6 partitions, so when I perform crossjoin (one-to-many) it’s working against those 6 partitions. Total output in parquet is around 350MB, which means my computer (8 cores, 10GB RAM provisioned for spark) needs to be able to hold all uncompressed data in memory....

Workarounds for archiving large shapefile in data lake

If you work with spatial data, chances are you are familiar with shapefile, a file format for viewing / editing spatial data. Essentially, shapefile is just a tabular data like csv, but it does thingamajig with geometry data type, where any gis tools like qgis or arcgis can understand right away. If you have a csv file with geometry column in wkt format (something like POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))), you’ll have to specify which column is to be used for geometry....

Mongodb export woes

There’s a task where I need to export 4M+ records out of mongodb, total uncompressed size is 17GB+ 26GB export methods mongoexport The recommended way to export is using mongoexport utility, but you have to specify the output attributes, which doesn’t work for me because the schema from older set of records are less than the newer set DIY python script the vanilla way But you can interact with mongodb from python, and if you read from it it’ll return a dict, which is perfect for this because you don’t have to specify the required attributes beforehand....