@mjgardner It is not really a problem for this problem if one is willing to use coarse level parallelism : we estimated that a 40 subprocess fork with all I/O done on a NVME will take less than 10 min with about 30%-40% of the time spent on I/O. Would be nice to illustrate that one can achieve near compiled performance, though, for some rather obvious reasons!