Calling SNPs in Pooled Samples

Maq supports SNP calling for pooled samples. This can be simply done by setting the number of haplotypes/strains in the pool with -N in the assemble command or easyrun. It is important to note that you should also disable the filter on the neighbouring quality with ` SNPfiler -n 0' if you run SNPfilter separately, or ` easyrun -E 0' if you only use easyrun.

For example, suppose we have pooled six bacterial strains in a pool. To call SNPs, we should run: easyrun -N 6 -E 0 ref.fasta reads.fastq
The final SNP calls are given by `easyrun/'.

Preliminary Evaluation

I have tried SNP calling for two pools, one consisting of 5 strains and the other of 6. Resequencing the first pool yields 40X high-quality single ended reads. Maq finds almost all substitutions which are not in repetitive regions. The second pool gives 17X reads with mid-level quality. Only ~58% substitutions are found by maq. Another ~20% SNPs were found initially but were filtered out due to low qualities. The accuracy difference between two pools are caused by data quality and may also be related to the similarity between the pooled samples. In addition, it is only possible to evaluate false negatives with current data. Genotyping is further needed to confirm the false postitive rate.

Consensus Quality and SNP Quality

Consensus quality is the Phred-scaled probability that the genotype call is incorrect; SNP quality is the Phred-scaled probability that an inferred SNP is in fact identical to the reference. Suppose at a site on the reference, the reference base is b0 and the most likely genotype is b with b different from b0. Consensus quality is Qc=-4.343log[1-P(b|D)], while SNP quality is Qs=-4.343log P(b0|D). As b is different from b0, Qs>=Qc always stands. Maq's assemble command calculates Qc, but in SNP discovery, Qs is the right quality to use. Since version 0.6.7, SNPfilter will take this difference into account.

Understanding the difference between Qc and Qs is very important to SNP calling for pooled samples. Pooling greatly reduces overall consensus quality as a low-frequency polymorphic site might be indistinguishable from a monomorphic site. However, SNP quality can be still high if the minor allele is the reference base. For example, at a site the reference base is A and we are observing eight G and two A from read alignment. On one hand, in a large pool it is difficult to decide whether the true `genotype' is G or A/G and therefore the consensus quality is low. On the other hand, we are quite sure that G allele is there and we should assign a high SNP quality. It is still possible to find a good fraction of high quality SNPs.