Problem: Data matrix of genomewide SNP data in wrong orientation. 1005 individuals x 214051 SNPs = 2.1E8 string elements. Transposing very large data matrices may overwhelm system memory if the entire matrix is loaded at once. This leads to disk caching, further slowing an already time consuming task.
One solution: A brief BASH script is used to cut consecutive columns from the data matrix. tr converts end of line characters to commas, converting the column of text into a row. Rows are then consecutively appended to output file. Memory usage negligible, 5.5 hrs.
Steps:
1. Use the following BASH script (assumes comma delimited csv file with 1005 columns):
#!/bin/bash
InputFile="head.txt"
OutputFile="outfile.txt"
NumColumns=1005
> $OutputFile
for (( i=1; i<=$NumColumns; i++ ))
do
echo $i"/"$NumColumns
cut -d',' -f$i $InputFile | tr '\n' ',' >> $OutputFile
echo >> $OutputFile
done
2. Modify InputFile, OutputFile, and NumColumns variable as needed.
A faster but more memory intensive solution, from the boards:
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' $InputFile > $OutputFile;
No comments:
Post a Comment