Introduction
MAQ dumps all the short read alignments in a single .map file, which is a compressed binary file specifically designed for short read alignment. MAQ provides the utilities to merge such files, extract various information, swiftly get read alignments in any region of the genome, call SNPs and short indels, generate pileup format, and view the alignment in the MAQ's graphical alignment viewer. If you have a short read aligner and want to realize all the downstream analyses implemented in MAQ, you may consider generating .map files from your aligner. Currently, MAQ supports converting Eland standard output and Illumina's export alignment format. NovoCraft developers have written a converter for the output of their program.
Before you start to write a new converter, please be aware that .map format only supports reads no longer than 127bp (or 63bp before maq-0.7.0) and just one indel location. We plan to design a new format which is similar to .map but with more flexibility. When we come to this point, we will provide a converter from the .map format to the new format.
MAQ alignment format
A .map file contains a header and the alignments. The header simply keeps the names of all the reference sequences in the alignment. In the alignment part, each read alignment is stored in an 184-byte struct (or 120 bytes before maq-0.7.0). The alignments are required to be ordered: a read mapped to i-th reference in the header should appear before the read mapped to j-th reference (i<j); for reads mapped to the same reference, a read mapped with smaller coordinate should appear earlier.
You can download mapview.c and mapview.pl from this website, which print a .map file in plain text. The following explains the codes. Basically, the .map format is defined by two C structures:
#ifdef MAQ_LONGREADS
# define MAX_READLEN 128
#else
# define MAX_READLEN 64
#endif
#define MAX_NAMELEN 36
#define MAQMAP_FORMAT_OLD 0
#define MAQMAP_FORMAT_NEW -1
#define PAIRFLAG_FF 0x01
#define PAIRFLAG_FR 0x02
#define PAIRFLAG_RF 0x04
#define PAIRFLAG_RR 0x08
#define PAIRFLAG_PAIRED 0x10
#define PAIRFLAG_DIFFCHR 0x20
#define PAIRFLAG_NOMATCH 0x40
#define PAIRFLAG_SW 0x80
/*
"130" shows the alternative definition when flag==(PAIRFLAG_SW|PAIRFLAG_PAIRED).
name: read name
size: the length of the read
seq: read sequence + qualities (see also below)
seq[MAX_READLEN-1]: SE mapQ (equals map_qual if unpaired); 130: indel length
map_qual: the final mapping quality; 130: position of the indel (0 if no indel)
alt_qual: the lower mapQ of the two ends (equals map_qual if unpaired); 130: mapQ of its mate
flag: status of the pair, a bitwise flag using PAIRFLAG_ macros
dist: offset of the mate (zero if unpaired, or two ends mapped to different chr)
info1: mismatches in the 28bp (higher 4 bits) and mismatches (lower 4 bits)
info2: sum of errors of the best hit
c[2]: counts of all 0- and 1-mismatch hits on the reference
*/
typedef struct {
uint8_t seq[MAX_READLEN]; /* the last base is the single-end mapping quality. */
uint8_t size, map_qual, info1, info2, c[2], flag, alt_qual;
uint32_t seqid, pos;
int dist;
char name[MAX_NAMELEN];
} maqmap1_t;
typedef struct {
int format, n_ref;
char **ref_name;
} maqmap_t;
Once you can read/write these two structures, you can read the header with:
maqmap_t *maqmap_read_header(gzFile fp)
{
maqmap_t *mm;
int k, len;
uint64_t tmp = 0;
mm = (maqmap_t*)calloc(1, sizeof(maqmap_t));
gzread(fp, &mm->format, sizeof(int));
if (mm->format != MAQMAP_FORMAT_NEW) {
fprintf(stderr, "Wrong alignment format!\n");
exit(3);
}
gzread(fp, &mm->n_ref, sizeof(int));
mm->ref_name = (char**)calloc(mm->n_ref, sizeof(char*));
for (k = 0; k != mm->n_ref; ++k) {
gzread(fp, &len, sizeof(int));
mm->ref_name[k] = (char*)malloc(len * sizeof(char));
gzread(fp, mm->ref_name[k], len);
}
gzread(fp, &tmp, 8); /* for backward compatibility */
return mm;
}
write the header with:
void maqmap_write_header(gzFile fp, const maqmap_t *mm)
{
int i, len;
uint64_t tmp;
gzwrite(fp, &mm->format, sizeof(int));
gzwrite(fp, &mm->n_ref, sizeof(int));
for (i = 0; i != mm->n_ref; ++i) {
len = strlen(mm->ref_name[i]) + 1;
gzwrite(fp, &len, sizeof(int));
gzwrite(fp, mm->ref_name[i], len);
}
gzwrite(fp, &tmp, 8); /* useless, for backward compatibility */
}
and delete the header struct with:
void maq_delete_maqmap(maqmap_t *mm)
{
int i;
if (mm == 0) return;
for (i = 0; i < mm->n_ref; ++i) free(mm->ref_name[i]);
free(mm->ref_name);
free(mm);
}
Reading/writing an alignment is much simpler. You can read/write an alignment with a single call to zlib functions:
gzread(fpmap, m1, sizeof(maqmap1_t));
gzwrite(fpmap, m1, sizeof(maqmap1_t));
where fpmap is the gzFile handler of the .map file, m1 is a pointer to maqmap1_t struct. Here is the complete source codes for mapview (see also MAQ manual page for the format):
void mapview_core(gzFile fpin)
{
uint32_t j;
maqmap_t *m = maqmap_read_header(fpin);
maqmap1_t *m1, mm1;
m1 = &mm1;
while (gzread(fpin, m1, sizeof(maqmap1_t)) == sizeof(maqmap1_t)) {
printf("%s\t%s\t%d\t%c\t%d\t%u\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d",
m1->name, m->ref_name[m1->seqid], (m1->pos>>1) + 1,
(m1->pos&1)? '-' : '+', m1->dist, m1->flag, m1->map_qual,
(signed char)m1->seq[MAX_READLEN-1], m1->alt_qual,
m1->info1&0xf, m1->info2, m1->c[0], m1->c[1], m1->size);
putchar('\t');
for (j = 0; j != m1->size; ++j) {
if (m1->seq[j] == 0) putchar('n');
else putchar("ACGT"[m1->seq[j]>>6&3]);
}
putchar('\t');
for (j = 0; j != m1->size; ++j)
putchar((m1->seq[j]&0x3f) + 33);
putchar('\n');
}
maq_delete_maqmap(m);
}
Alternatively, you can use the following Perl codes to get the same output:
my ($fh, $buf, @ref);
open($fh, "gzip -dc $ARGV[0] |") || die("[maq] Cannot decompress $ARGV[0]");
binmode($fh);
# read header
$buf = '';
read($fh, $buf, 8);
my ($format, $n_ref) = unpack("i2", $buf);
for (1..$n_ref) {
read($fh, $buf, 4);
my $len = unpack("i", $buf);
read($fh, $buf, $len);
($_) = unpack("Z$len", $buf);
push(@ref, $_);
}
read($fh, $buf, 8); # what is read is the number of reads, but frequently it is zero.
# read alignment
my @ACGT = ('A', 'C', 'G', 'T');
my $max = 127;
while (read($fh, $buf, 57+$max) == 57+$max) {
my ($sq, $qse, $size, $q, $i1, $i2, $c0, $c1, $f, $aq, $sid, $x, $d, $rn)
= unpack("a${max}cC8I2iZ36", $buf);
print "$rn\t$ref[$sid]\t", ($x>>1)+1, "\t", ($x&1)?'-':'+', "\t$d\t$f\t$q\t$qse\t$aq\t",
$i1&0xf, "\t$i2\t$c0\t$c1\t$size";
my @t = unpack("C$size", $sq);
my $seq = '';
foreach my $p (@t) {
if ($p == 0) { $seq .= 'N'; }
else { $seq .= $ACGT[$p>>6]; }
$p = chr(33 + ($p&0x3f));
}
print "\t$seq\t", @t, "\n";
}
close($fh);