16.9.14

Transpose large data matrix using BASH

Problem: Data matrix of genomewide SNP data in wrong orientation.  1005 individuals x 214051 SNPs = 2.1E8 string elements.  Transposing very large data matrices may overwhelm system memory if the entire matrix is loaded at once.  This leads to disk caching, further slowing an already time consuming task.

One solution: A brief BASH script is used to cut consecutive columns from the data matrix.  tr converts end of line characters to commas, converting the column of text into a row.  Rows are then consecutively appended to output file.  Memory usage negligible, 5.5 hrs.

Steps:
1. Use the following BASH script (assumes comma delimited csv file with 1005 columns):

  #!/bin/bash

  InputFile="head.txt"
  OutputFile="outfile.txt"
  NumColumns=1005

  > $OutputFile

  for (( i=1; i<=$NumColumns; i++ ))
   do
    echo $i"/"$NumColumns
    cut -d',' -f$i $InputFile | tr '\n' ','  >> $OutputFile
    echo >> $OutputFile
   done

2. Modify InputFile, OutputFile, and NumColumns variable as needed.

A faster but more memory intensive solution, from the boards:
awk '
{
for (i=1; i<=NF; i++)  {
    a[NR,i] = $i
    }
}
NF>p { p = NF }
END {   
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' $InputFile > $OutputFile;

No comments:

Post a Comment