Manual Reference Pages  - bwa (1)

NAME

bwa - Burrows-Wheeler Alignment Tool

CONTENTS

Synopsis
Description
Commands And Options
Output Format
Notes
          Alignment Accuracy
          Memory Requirement
          Speed
          Comparison to Other Software
Author
History

SYNOPSIS

bwa index -a bwtsw database.fasta

bwa aln database.fasta short_read.fasta > output.paf

DESCRIPTION

BWA is a fast light-weighted tool that aligns single ended short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence, except for disallowing gaps close to the end of the query. It can also be tuned to find a fraction of longer gaps at the cost of speed and of more false alignments.

BWA excels in its speed. Mapping 2 million high-quality 35bp short reads against the human genome can be done in 20 minutes. Usually the speed is gained at the cost of huge memory, disallowing gaps and/or the hard limits on the maximum read length and the maximum mismatches. BWA does not. It is still relatively light-weighted (1.7GB memory for human alignment), performs gapped alignment, and does not set a hard limit on read length or maximum mismatches.

COMMANDS AND OPTIONS

index bwa index [-p prefix] [-a algoType] <in.db.fasta>

Index database sequences in the FASTA format.

OPTIONS:
-p STR Prefix of the output database [same as db filename]
-a STR Algorithm for constructing BWT index. Available options are:
is IS linear-time algorithm for constructing suffix array. It requires 5.37N memory where N is the size of the database. IS is moderately fast, but does not work with database larger than 2GB. IS is the default algorithm due to its simplicity. The current codes for IS algorithm are reimplemented by Yuta Mori.
div Divsufsort library. This library is believed to be the fastest open source library for constructing suffix array and BWT. It requires 5N working memory. Divsufsort is not compiled by default.
bwtsw Algorithm implemented in BWT-SW. This is the only method that works with the whole human genome. However, this module does not work with database smaller than 10MB and it is much slower than the other two. Bwtsw algorithm trades speed for memory.

aln bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-c] <in.db.fasta> <in.query.fasta>

Perform gapped alignment against indexed database sequences. Maximum maxSeedDiff differences are allowed in the first seedLen subsequence and maximum maxDiff differences are allowed in the whole sequence.

OPTIONS:
-n INT Maximum number of differences [2]
-o INT Maximum number of gap opens [1]
-e INT Maximum number of gap extensions, -1 for k-difference mode [-1]
-d INT Disallow a long deletion within INT bp towards the 3’-end [16]
-i INT Disallow an indel within INT bp towards the ends [5]
-l INT Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [inf]
-k INT Maximum number of differences in the seed [2]
-c Reverse query but not complement it, which is required for alignment in the color space.

OUTPUT FORMAT

BWA generates alignments in PAF (Pairwise Alignment Format). PAF is first used in a comparison of different short read alignment software. It is TAB dilimited. Each line consists of read name, left clipping (always 1 for BWA), strand, database sequence name, position of the leftmost base in alignment, mapping quality, CIGAR string (always read length followed by ‘M’), read sequence (always dot), read quality (always dot), name of program (always ‘bwa’) and additional fields related to specific alignment program. For BWA, the additional fields are eland-like flag, number of hits having the best score and the number of hits having the best score minus 1. The flag can be ‘U?’, which stands for unique, ‘R?’ for repeat and ‘N?’ for no match in fact. BWA will randomly place a repetitive read in the database.

NOTES

    Alignment Accuracy

When seeding is disabled, BWA guarantees to find an alignment containing maximum maxDiff differences including maxGapO gap opens which do not occur within nIndelEnd bp towards either end of the query. Longer gaps may be found if maxGapE is positive, but it is not guaranteed to find all hits. When seeding is enabled, BWA further requires that the first seedLen subsequence contains no more than maxSeedDiff differences.

When gapped alignment is disabled, BWA is expected to generate the same alignment as Eland, the Illumina alignment program. However, as BWA change ‘N’ in the database sequence to random nucleotides, hits to these random sequences will also be counted. As a consequence, BWA may mark a unique hit as a repeat, if the random sequences happen to be identical to the sequences which should be unqiue in the database. This random behaviour will be avoided in future releases.

    Memory Requirement

With bwtsw algorithm, 2.5GB memory is required for indexing the complete human genome sequences. In short read alignment, the peak memory is about 100MB plus three quarter of the database sequence.

    Speed

Indexing the human genome sequences takes 3 hours with bwtsw algorithm. Indexing smaller genomes with IS or divsufsort algorithms is several times faster, but requires more memory.

Speed of alignment is largely determined by the error rate of the query sequences (r). Firstly, BWA runs much faster for near perfect hits than for hits with many differences, and it stops searching for a hit with l+2 differences if a l-difference hit is found. This means BWA will be very slow if r is high because in this case BWA has to visit hits with many differences and looking for these hits is expensive. Secondly, the alignment algorithm behind makes the speed VERY sensitive to [k log(N)/m], where k is the maximum allowed differences, N the size of database and m the length of a query. In practice, we choose k w.r.t. r and therefore r is the leading factor. I would not recommend to use BWA on data with r>0.02.

In a practical experiment, BWA is able to map 2 million 32bp reads to a bacterial genome in several minutes, map the same amount of reads to human X chromosome in 8-15 minutes and to the human genome in 15-25 minutes. This result implies that the speed of BWA is insensitive to the size of database and therefore BWA is more efficient when the database is sufficiently large. On smaller genomes, hash based algorithms are usually much faster.

    Comparison to Other Software

BWA is initially designed to be a concise open source tool that achieves similar results and speed of Eland. It aims at theoretical simplicity, but not at a comprehensive software suite with rich features as MAQ and SOAP. If speed is not critical, MAQ will be preferred.

It is not straightforward to compare the speed of Eland, the fastest short read aligner to date, and BWA. Generally, Eland’s speed is not affected by the error rate r. It may outperform BWA when r is high, but on high-quality reads, Eland is twice slower to give alignments on the human genome. It is worth noting that Eland counts 2-mismatch hits even if the read has a perfect hit, but BWA will ignore all 2-mismatch hits if a perfect hit exists. This point should also be taken into account in a fair comparison. In fact, we can make BWA twice faster by generating best unique hits only. However, such results are not very informative.

AUTHOR

Heng Li at the Sanger Institute wrote the key source codes and integrated all the following codes for BWT construction: bwtsw, implemented by Chi-Kwong Wong at the University of Hong Kong, IS by Nong Ge at the Sun Yat-Sen University, and libdivsufsort by Yuta Mori.

HISTORY

BWA is largely influenced by BWT-SW. It uses codes from BWT-SW and mimics the binary file formats of BWT-SW. At the same time, BWA is also different from BWT-SW. BWA uses quite a different algorithm to search for alignments and is much faster. While BWT-SW aims at a tool of gerneral purpose, BWA is more tuned towards short read alignment.

I started to write the first piece of codes on 24 May 2008 and got the initial stable version on 02 June 2008. During this period, I was acquainted that Professor Tak-Wah Lam, the first author of BWT-SW paper, is collaborating with Beijing Genomics Institute on SOAP2, the successor to SOAP (Short Oligonucleotide Alignment Program). SOAP2 uses more advanced techniques than BWA. It is expected to be of similar speed to BWA, but with more functionality.


bwa-0.2.0 bwa (1) 15 August 2008
Generated by manServer 1.07-lh3 from bwa.1 using man macros.