20.5.16

Transpose large data matrix using BASH. II. GNU Parallel.

In a prior post, I presented a low memory BASH solution for transposing large data matrices.  Here is a way to speed that basic procedure using parallel processing on an HPC.

1. Generate a large data table for testing (~2GB, ~1E9 elements):
ncol=2472;
nrow=404627;
seq -s' ' 1 $ncol > m.txt;
foo=$(for ((i=1; i<=$ncol; i++));
do
   echo $[ 1 + $[ RANDOM % 4 ]]; 
done;);
foo=$(echo $foo | tr "\n" " ");
export nrow;
export foo;
perl -e 'for($i=0;$i<$ENV{nrow};$i++){print "$ENV{foo}\n"}' >> m.txt;

Notes: In the 3rd line, a header is created such that columns will be labeled consecutively.  These become important later.  Watch this step, some Linux versions add a linebreak, others do not.  You want the linebreak.

2. Run on HPC using GNU Parallel:
InputFile="m.txt";
seq 1 $ncol | parallel --sshloginfile ~/machines --jobs 24 "cut -d' ' -f{} $InputFile | tr '\n' ' ' | sed 's/ $/\n/g' > ~/{}.txt; echo Col {};";

Notes: The method above works as follows. First, seq delivers a set of numbers (from 1 to the total number of columns in the input matrix) to GNU Parallel.  GNU Parallel then distributes $ncol jobs among nodes specified in the file ~/machines.  The option --jobs 24 specifies that each node has 24 cores.  This approach cuts a single column from the input file, transposes it, then writes it to disk.  I had no luck with the GNU Parallel option --keep-order, which would presumably allow one to avoid this intermediate write step.

3. Fuse the output files together:
> mrot.txt;
for ((i=1; i<=$ncol; i++));
do
   cat "$i.txt" >> mrot.txt;
   rm "$i.txt";
done;