In a prior post, I presented a low memory BASH solution for transposing large data matrices. Here is a way to speed that basic procedure using parallel processing on an HPC.
1. Generate a large data table for testing (~2GB, ~1E9 elements):
ncol=2472;
nrow=404627;
seq -s' ' 1 $ncol > m.txt;
foo=$(for ((i=1; i<=$ncol; i++));
do
echo $[ 1 + $[ RANDOM % 4 ]];
done;);
foo=$(echo $foo | tr "\n" " ");
export nrow;
export foo;
perl -e 'for($i=0;$i<$ENV{nrow};$i++){print "$ENV{foo}\n"}' >> m.txt;
Notes: In the 3rd line, a header is created such that columns will be labeled consecutively. These become important later. Watch this step, some Linux versions add a linebreak, others do not. You want the linebreak.
2. Run on HPC using GNU Parallel:
InputFile="m.txt";
seq 1 $ncol | parallel --sshloginfile ~/machines --jobs 24 "cut -d' ' -f{} $InputFile | tr '\n' ' ' | sed 's/ $/\n/g' > ~/{}.txt; echo Col {};";
Notes: The method above works as follows. First, seq delivers a set of numbers (from 1 to the total number of columns in the input matrix) to GNU Parallel. GNU Parallel then distributes $ncol jobs among nodes specified in the file ~/machines. The option --jobs 24 specifies that each node has 24 cores. This approach cuts a single column from the input file, transposes it, then writes it to disk. I had no luck with the GNU Parallel option --keep-order, which would presumably allow one to avoid this intermediate write step.
3. Fuse the output files together:
> mrot.txt;
for ((i=1; i<=$ncol; i++));
do
cat "$i.txt" >> mrot.txt;
rm "$i.txt";
done;
No comments:
Post a Comment