Color Space SNP Calling

MAQ is able to process SOLiD data. It does alignment in the color space, convert color alignment to nucleotide alignment and then do SNP calling in the nucleotide space. Suppose you have a nucleotide reference sequence ref.fasta, color read sequence file reads_F3.csfasta and its quality file reads_F3_QV.qual. You should run through the following commands to get the SNP calls.

    1> reads_ shortname
    2> maq fastq2bfq shortname.fastq shortname.bfq
    3> maq fasta2csfa ref.fasta > ref.csfa
    4> maq fasta2bfa ref.csfa ref.csbfa
    5> maq fasta2bfa ref.fasta ref.bfa
    6> maq map -c ref.csbfa shortname.bfq 2> aln.log
    7> maq csmap2nt ref.bfa
    8> maq assemble cns.cns ref.bfa 2> cns.log
In this procedure, step 1 converts reads in the SOLiD format to the FASTQ format. The primer nucleotide base and the first color will be trimmed off. MAQ cannot use the first nucleotide in the alignment and this information will be missing in all the subsequent steps. Fortunately, simulation shows that even for 25bp reads, missing the first nucleotide will not greatly hurt the accuracy. Note that script also works with paired end reads. In this case, three files will be generated, one of singletons and two for each end respectively. Step 6 does alignment in the color space. It correctly handles the orientation or color reads. Step 7 converts the color alignment to nucleotide alignment. The resulting reads will be one bp shorter as there is not much information for the first and the last color. If input reads are 25 characters in length, the reads will be 23bp after csmap2nt. Since this step, we will work in the nucleotide space. The color reference sequence and color alignment will not be used any more.

Converting Color Alignment

AB decodes color sequence with a set of rules, but to me, using a dynamic programming is simpler and more straightforward. The dynamic programming takes a nucleotide reference sequence and an aligned color read as input and works by finding a decoding which minimizes color changes and nucleotide changes at the same time.

The decoding problem can be solved directly or reduced to find the shortest path on a DAG. Given a reads with m colors, we can construct a DAG (directed acyclic graph) with 4(m+1) nodes and 16m arcs. A node at (b,i) represents a nucleotide b at i-th position; an arc between (b1,i) and (b2,i+1) represents a color associated with (b1,b2). Knowing the read color sequence, we can assign weights to arcs; knowing the nucleotide reference we can assign weights to nodes. The task is to find a path such that the sum of weights on the arcs and nodes is optimal. This can be easily solved by a dynamica programming. The iterative formula is briefly given here.

Known Issues and Future Development

  • At present, MAQ underestimates the mapping quality of reads containing mutations, which will lead to high false negative rates due to the stategy used in calling SNPs. I am aware of this issue and will try to improve this point in the next release.
  • Command cns2cssnp was used to call SNPs for SOLiD data. It is obsolete now. Vaughn has also pointed a bug in cns2cssnp. I will try to fix this bug. It is anyway good to has an alternative SNP caller.