Frequently Asked Questions
- What does Maq do?
Maq maps short reads to the reference and calls the genotypes from the alignment. It is speficially designed for Illumina-Solexa/AB-SOLiD reads, not for 454 or capillary ones.
- What is the most important thing I should understand first?
The most important things for you to know are: 1) Maq maps a repeat read randomly, and 2) it gives a probability score (mapping quality) to each alignment. The following FAQs and this page gives much more information.
- Where to get read sequences and qualities?
I only know a bit about SolexaPipeline. So far as I know, read sequences can be acquired from Bustard, after Gerald filter or after Gerald quality calibration. I usually recommend to get the data after quality calibration because both sequences and qualities are most accurate there. Nonetheless, if you feel it difficult to get calibrated qualities, you may also use the data filtered in Gerald. My opinion is not to use unfiltered data. Do not be lured by the amount of data you get before filter. These data mostly bring troubles instead of a better result.
- Longer reads or better qualities, what do you prefer?
I prefer better qualities. Longer reads may seem attractive, but if you cannot get very high-quality data at 3'-end of reads, you should definitely stop sequencing or trim off the low-quality ends of all reads. Although maq is still sensitive enough to align reads with very poor tail, the errors in these reads may be highly dependent and somewhat weird, which may cheat maq into calling many wrong SNPs later on. Quality is more important than quantity. Do not try to get longer reads before you believe you can achieve less than 1% overall error rate.
- What is the read file format Maq recognizes?
Reads should be prepared in the standard FASTQ format.
- Will SolexaPipeline generate the FASTQ file Maq needs? If not,
what should I do?
The Gerald module of the SolexaPipeline generates a FASTQ-like format, but it is not a FASTQ in fact. You should convert the format with "maq sol2sanger". This page gives more explanation. The latest SolexaPipeline-0.3.0 generates a new file in the "Export" format. Read sequences, qualities and most of alignment information is stored in this file.
Maq also provides a Perl script to convert various formats to Sanger's FASTQ format.
- My collaborator gave me a read file far from FASTQ. What should I do?
First, you should understand what your file means, and then convert your file to the standard FASTQ format. This page may help, too. If you still have difficulty to identify the format, feel free to send an email to the maq-help mailing list, or directly contact with me.
- How does Maq alignment work? Just briefly.
Maq indexes all the input reads and scan through the reference for several times. It only stores the best two hits in the memory. Maq maps a read to the position where the sum of quality values of mismatched nucleotides is minimum. If for a read there are several equally best positions, Maq maps it randomly. Maq gives each read alignment a mapping quality (see below).
- Can Maq align reads longer than 32bp?
Yes. The maximum allowed read length is 63bp. I can loose the limit at the cost of speed and memory usage.
- Can I use Maq to align 454 reads or capillary reads?
No. For the moment, Maq can only map reads at most 63bp in length. Capillary reads and 454 reads are much longer than that. In addition, Maq usually performs ungapped alignment whereas capillary and 454 reads usually contain many short indels.
- Can I use Maq to find short indels?
Yes, but only if you have paired end reads. Maq has a module to find break points with single end reads, but it does not work well in all cases. This module is not accurate and not ready for end users.
- Is Maq able to deal with reads with variable lengths?
No and yes. In alignment, all the reads must be in the same length. However, you can `mapmerge' several alignments from reads in different lengths and after that all the subsequent maq commands accept variable lengths.
- Can I map reads to one chromosome at a time and then merge them
together with `mapmerge'?
No, you cannot. All the chromosomes of the reference genome should be put into a single FASTA file. Maq is designed in this way to avoid technical complexity.
- The qualities of the 3'-end of my reads are not great. Should I
trim them off?
If you are using maq-0.6.0 or later, I usually do not recommend to do this because Maq is designed to cope with this case. However, if your reads show obvious compositional biases, trimming the reads (with `maq map -1 30' for example) may be preferrable. If you think recursively trimming always works better, please let me know. Thank you.
- Eland tells me the best hit has one mismatch, but the best
hit found by Maq has two mismatches. What happens?
Maq is quality-aware. Before alignment, Maq divides base qualities by 10 and then cuts off the decimal part. In alignment, hit A is said to be better than hit B if the sum of the 10-divided quality values of mismatched bases of A is smaller. Maq does not always favour the position yielding fewer mismatches. Note that dividing qualities by 10 will reduce the resolution of qualities and may lead to mapping errors for reads with low qualities. These wrongly mapped reads usually get very low mapping qualities. They are wrong anyway, which is the trade-off to make Maq faster. Furthermore, counting number of mismatches is not always the right thing, either.
- I am using the default option to run `maq map'. I see hits
containing a lot of mismatches. Is this a bug?
No, it is not. By default, maq guarantees that all hits with up to 2 mismatches in the first 24bp can be found, and can also find part of hits with 3 or 4 mismatches in the first 24bp. There is no explicit cut off on the mismatches across the whole read. Nonetheless, hits with a lot of high-quality mismatches will be discarded.
- How may the "map -n" option affect my results?
Option "-n" first controls the sensitivity of the alignment. Increasing this option helps to find more hits with many mismatches. However, keeping alignments with many high-quality mismatches is not the right thing, either, as these alignments may be contaminations. In "assembly", maq discards alignments whose sum of qualities of mismatches is larger than 60.
Another subtle effect of increased "-n" is to improve the overall mapping qualities. The more hits maq sees, the more accurate the mapping qualities can be estimated. However, on human whole-genome alignment, using "-n 3" may be too slow. That is why I set "-n 2" as the default.
- What does "mapping quality" mean?
Mapping quality is the Phred-scaled probablity that the read alignment is wrong. Read this for a much longer explanation.
- What happens to the reads that can be mapped to many positions?
If a read can be mapped to several equally best positions, Maq will randomly choose one position and give the alignment a zero mapping quality.
- Can Maq output all the hits of a read?
No, Maq cannot do that for the moment.
- What does PE mean?
PE = Paired End.
- What is the gain if PE reads are in use?
More reads will get higher mapping qualities. If one read is a repeat and the other can be mapped with confidence, the repeat read can be aligned correctly and get a high mapping quality. In addition, you can use PE reads to find structural variations and short indels.
- Why can't I set the minimum insert size in the PE alignment mode?
Two reasons. The first is that the current protocol tends to produce very short insert size or even overlaped two ends. The second reason is due to the algorithm used by Maq. Without setting minimum insert size helps to get slightly better, though a bit more conservative, alignment.
- Merging .map files fails. What happens?
For the moment, 32-bit maq can only process files smaller than 2Gb. Possibly your alignment file is too huge for maq to handle. This is solvable. We are working on that. You can also use 64-bit version of maq. 2Gb would not be a limit. In addition, 64-bit maq is much faster than 32-bit in alignment.
- Can I see the read alignment in a better viewer?
- Can I quickly retrieve a small region from the maq alignment?
Yes, you can. Maqview provides a command-line tool, maqindex, to do this.
- Are there similar alignment softwares that specifically
designed for short reads?
Eland, the Solexa read alignment program, can do this. It is faster than Maq actually. Maq deliberately trades the speed for several things that are not implemented in Eland. I know people also use SSAHA2, GMAP, Mosaik and SXOligoSearch to align reads. A friend of mine, Ruiqiang Li, wrote a software, SOAP, which is capable of finding short indels. On smaller data set, cross_match may also be a worthy candidate. You can use cross_match to map unmapped reads to find indels.
- What do those "S", "M" and so on mean in the cns2snp output?
They are IUB codes for heterozygotes. Briefly:
M=A/C, K=G/T, Y=C/T, R=A/G, W=A/T, S=G/C, D=A/G/T, B=C/G/T, H=A/C/T, V=A/C/G, N=A/C/G/T
After "cns2snp", you should run "maq.pl SNPfilter" to further filter out false SNPs. "SNPfilter" is always recommended.
Here is the description in maq manuscript: "After Maq's SNP calling, we further filtered the substitutions based on five rules: i) discard SNPs within 3-bp flanking region around a potential indel; ii) discard SNPs covered by three or fewer reads; iii) discard SNPs covered by no read with a mapping quality higher than 60; iv) in any 10bp window, if there are 3 or more SNPs, discard them all; and v) discard SNPs with consensus quality smaller than 10."
For single-end reads, the threshold in Rule iii) should be changed to 40, which is the default. Users may like to change Rule v) based on the read depth and the data quality to achieve a good balance point between FP and FN. There is no clear cut-off for the moment.
Ideally, consensus quality is the Phred-scaled probablity that the genotype call is wrong. However, due to various approximation in calculation and inaccurate base qualities, maq consensus quality is not accurate. Furthermore, when "SNPfilter" is applied, a lot of false positives can be filtered out, which makes maq SNP quality very conservative. However, the trend of consensus quality is about right: high-quality genotypes are usually reliable while low-quality ones tend to be wrong. The quality is still helpful in balancing FP and FN.