Spark join OOM fix
I have a big pipelines where one step performs crossjoin on 130K x 7K. It fails quite often, and I have to pray to the Rice God for it to pass. Today I found the solution: repartition before crossjoin. The root cause is that the dataframe with 130K records has 6 partitions, so when I perform crossjoin (one-to-many) it’s working against those 6 partitions. Total output in parquet is around 350MB, which means my computer (8 cores, 10GB RAM provisioned for spark) needs to be able to hold all uncompressed data in memory....