Mongodb export woes
There's a task where I need to export 4M+ records out of mongodb, total uncompressed size is 17GB+ 26GB
export methods
mongoexport
The recommended way to export is using mongoexport
utility, but you have to specify the output attributes, which doesn't work for me because the schema from older set of records are less than the newer set
DIY python script
the vanilla way
But you can interact with mongodb from python, and if you read from it it'll return a dict, which is perfect for this because you don't have to specify the required attributes beforehand. So what I do is:
=
=
The cons for this solution is it needs a lot of hdd space since it's uncompressed. But it works best if you need to export a collection with mismatched schema.
the incremental export way
You can also incrementally export your collection from mongodb using .skip($START_INDEX).limit($INCREMENT_SIZE)
, but it performs worse than the vanilla way, since what mongodb does is just iterating through everything all over again to get to your specified start:end
index.
Performance comparison
On my local machine (<10 MB/s transfer speed) I could export a collection with around 4.5M records within 1 hour, but on a VPS with incremental export it takes 9 hours and counting.
Takeaway
Please do not store a large dataset in mongodb where you need to dump everything out, especially if you use it as a raw data source. It's fine if you store prepped output for API to be queried via _id
(primary key).