I’ve been a fan of GNU Parallel for a while but until recently have only used it occasionally. That’s a shame, because it’s often the simplest solution for quickly solving embarrassingly parallel problems.
My recent usage of it has centered around database export/import operations where I have a file that contains a list of primary keys and need to fetch the matching rows from some number of tables and do something with the data. The database servers are sufficiently powerful that I can run N copies of my script to get the job done far faster (where N is value like 10 or 20).
A typical usage might look like this:
cat ids.txt | parallel -j24 --max-lines=1000 --pipe "bin/munge-data.pl --db live >> {#}.out
However, I recently found myself scratching my head because parallel was only running 3 jobs rather than the 24 I had specified. …
[Read more]