Understanding Qualities
Base Qualities
What is Phred Quality?
Traditionally, Phred quality is defined on base calls. Each base call is an estimate of the true nucleotide. It is a random variable and can be wrong. The probability that a base call is wrong is called error probability. If the error probability of a base is $e, the Perl code to calculate Phred quality $Q is:$Q = -10 * log($e) / log(10);For example, if the quality of a base call is 30, the probability that it is wrong is 0.001. In other words, given 1000 base calls with Q=30, one of them is wrong in average.
Phred Quality vs. Solexa Quality
Solexa quality is NOT Phred quality. The Solexa quality and Phred quality are asymptotically identical. If the error probability of a base is $e, the Solexa quality $sQ is:$sQ = -10 * log($e / (1 - $e)) / log(10);Solexa quality $sQ can be converted to Phred quality $Q with this formula:
$Q = 10 * log(1 + 10 ** ($sQ / 10.0)) / log(10);Phred quality can never be negative, but Solexa quality can be negative. This is the most effective way to tell what type of quality is used.
Read Format
The FASTQ format is widely used in Sanger before Solexa sequencing came into reality. It is supported by BioPerl. SolexaPipeline uses two other read formats. One is Solexa FASTQ-like format (those s_*_sequence.txt in the GERALD directory) and the other is SCARF. Solexa quality, instead of Phred quality, is used in both formats which are briefly documented in the SolexaPipeline documentation. Script fq_all2std.pl can convert these two formats to the standard FASTQ format.Raw Quality vs. Calibrated Quality
The Bustard module of the SolexaPipeline estimates qualities, or error probabilities, from the signal/noise ratio of each base. I usually terms it as raw quality. Gerald is able to calibrate qualities with the Phred algorithm when the alignment is available. This is calibrated quality.
We have already known that the trend of raw qualities is about right, which means bases with higher quality contains fewer sequencing errors. However, the absolute value of raw quality is not right. You may see one error out of 1000 bases with Q=40. When properly used, calibrated qualities can be much more accurate. I usually recommend to use calibrated qualities if possible.
Mapping Qualities
What is Mapping Quality?
In a probabilistic view, each read alignment is an estimate of the true alignment and is therefore also a random variable. It can be wrong. The error probability scaled in the Phred is the mapping quality. If the mapping quality of a read alignment is $mQ, the probability $me that the alignment is wrong can be calculated with:
$me = 10 ** (-$mQ / 10.0);Given 1000, for example, read alignments with mapping quality being 30, one of them will be wrong in average.
Understanding Mapping Qualities
The calculation of mapping qualities is simple, but this simple calculation considers all the factors below:
- The repeat structure of the reference. Reads falling in repetitive regions usually get very low mapping quality.
- The base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment.
- The sensitivity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensitivity, which also causes mapping errors.
- Paired end or not. Reads mapped in pairs are more likely to be correct.
When you see a read alignment can get a mapping quality 30, it usually implies:
- The overall base quality of the read is good.
- The best alignment has few mismatches.
- The read has few or just one `good' hit on the reference, which means the current alignment is still the best even if one or two bases are actually mutations or sequencing errors.
Using Mapping Qualities
This issue largely depends on the project you are doing. If you want to find the SNPs, you do not really need to care about this. Maq will consider the mapping quality in genotype calling. If you want to pinpoint the structual variations with paired end reads, you should only pick up abnormal pairs with high mapping qualities (30, for example). If you are analysing ChIP-Seq data, setting a threshold on mapping quality may also be necessary.
I am not sure what threshold is the most appropriate in practical cases and I would like to hear what users say about that. The probabilistic meaning of mapping quality is explicit and simulated data (in Maq paper) show that it can be estimated reasonably well even if there are mutations on the reference. Users can rely on that.