quuxfu: chop-chop: June 2016

Problem: The converter built in to PLINK converts nucleotide encoded data to a 0/1 binary format automatically when using --recode-fastphase to create a fastPHASE input file. This causes irretrievable loss of information.

Solution: I can't find any way around this other than to make a new converter.
1. Using PLINK, recode .ped file such that each chromosome has its own .ped/.map set of files. Something like:
./plink --file mydata --chr 1 --recode --out malc1;
./plink --file mydata --chr 2 --recode --out malc2;

2. Use the following script to create fastPHASE input files. Enter your values for 'nchr', 'iroot', and 'oroot' after the function dbl().

dbl() {

i=$1;

f="$infile"".ped";

nrow=$(wc -l "$f" | awk '{print $1}');

echo "splitting row $i/$nrow";

# split lines containing genotypes, by allele, for each individual

#odd alleles

#even alleles

echo "$odd" > "$i.tmp";

echo "$even" >> "$i.tmp";

}

export -f dbl;

nchr=17; #number of chromosomes

iroot="malc"; #root string of input files for each chromosome

oroot="chr"; #root string for output files, in fastPHASE format

for ((i=1;i<=$nchr;i++));

do echo "chr"$i;

export infile="$iroot""$i";

h1=$(wc -l "$infile".ped | awk '{print $1}'); #n samples for fastPHASE header line 1

h2a=$(head -1 "$infile".ped | awk '{print NF}');

h2=$((($h2a-6)/2)); #n loci for fastPHASE header line 2

h3a=$(awk '{print $4}' "$infile".map);

h3=$(echo "P $h3a" | tr "\n" " "); #locus positions for fastPHASE header line 3

s=$(awk '{print $2}' "$infile".ped);

#split single line per sample plink format into two line per sample fastPHASE format

seq 1 $h1 | parallel --env dbl --env infile dbl;

#Reassemble into a single file:

outfile="$oroot""$i".inp;

echo "$h1" > "$outfile";

echo "$h2" >> "$outfile";

echo "$h3" >> "$outfile";

for ((j=1;j<=$h1;j++));

do sample=$(echo "$s" | sed -n $j"p")" ";

echo "writing chr $i sample $sample $j/$h1";

echo '>'"$sample" >> "$outfile";

head -1 "$j.tmp" | sed 's/0/?/g' >> "$outfile";

tail -n +2 "$j.tmp" | sed 's/0/?/g' >> "$outfile";

rm "$j.tmp";

done;

3. Phase away.

quuxfu: chop-chop

16.6.16

Convert PLINK to fastPHASE with BASH and GNU Parallel