Richard Durbin 29/9/08
glfProgs are a set of software tools to manipulate and call SNPs from .glz files. It is available from http://sourceforge.net/project/showfiles.php?group_id=191815&package_id=293182 as part of the maq.sourceforge.net project site.
.glz files are gzipped .glf files, where .glf is a binary format that contains genotype likelihood information extracted from a multiple alignment of short sequencing reads. To be specific, .glf contains the following
int chrNameLen ; /* includes terminating 0 */
char* chrName ; /* chrNamelen chars
including 0 */
int chrLen ;
typedef
struct
{ unsigned
char ref:4, dummy:4 ; /* ref A=1,C=2,G=4,T=8,N=15 etc. */
unsigned char max_mapQ ; /*
maximum mapping quality */
unsigned char lk[10] ;
/* log likelihood ratio, max 255 */
unsigned min_lk:8, /* minimum lk capped at 255
depth:24 ;
/* and the number of mapped reads */
} GLF ;
lk[g] is the likelihood of
genotype g expressed as the log of the
likelihood ratio of the most likely genotype to genotye g in phred scaled units. For example lk[AA] = 0, lk[AG]
= 10, lk[GG] =
20 would mean that the most likely genotype is AA, which is ten times more
likely than AG, and a hundred times more likely than GG.
Of course we must remember these are likelihoods: if the reference base is G, and we have no other information, we might take the prior on AG to be 0.001, and the prior on AA to be 0.0005, which would give GG the highest posterior probability of about 95%, AA with about 4.5% and AG with 0.5%. The functions below support these prior calculations, and others where there is data on multiple individuals
maq
glfgen id.glz reference.bfa id.map
glfSoloPrior
< id.glz > id-solo.glz
glfSnpCall
< id-solo.glz > id-solo.snp // call SNPs de novo
glfSoloPrior
–theta 0.1 < id.glz > id-snp.glz
glfSubCall
–sites dbSNP.pos < id-snp.glz > id.dbSNP.geno
The latter two lines genotype dbSNP positions, giving a
prior heterozygosity of 0.1 to variants at positions in dbSNP compared to a
default prior heterozygosity of 0.001.
Usage: glfSoloPrior [-theta theta] < in.glz > out.glz
Applies priors of theta/2 to homozygous calls different from the reference, theta to heterozygotes with one allele the reference, and theta*theta to heterozygotes with both bases different from the
reference. The output is again in
.glz format, but now in posterior odds.
Usage: glfSnpCall
< in.glz > out.snp
This assumes that the input is posterior odds, i.e. that
appropriate priors have already been applied to the likelihoods, e.g. by glfSoloPrior
or glfTrioPrior. It calls
SNPs with output in the same format as MAQÕs .snp output, which allows easy comparison of results and
use of downstream tools. Each line
has 12 tab-delimited fields:
chromosome,
position,
reference base,
consensus base,
Phred-like consensus quality,
read depth,
0.0 (maq gives the average number
of hits of reads mapping to this position, but we donÕt have that information),
highest mapping quality of the
reads covering the position,
minimum quality in the 3bp flanking
regions at each side of the site (6bp in total),
second best call,
difference between the phred
score of there being a SNP and the quality,
third best call.
E.g.
20
48699 C Y 112
13 0.00 99 61
T 8 C
In particular note that the confidence (phred score) for
there being a SNP is given by the sum of the 4th and 11th
columns.
Usage: glfSubCall
–sites siteFile < in.glz > out.snp
Calls genotypes at all the positions in the siteFile, which is an ASCII file with five fields per row,
the first two of which are the chromosome and position to be typed. (The other 3 are typically used to
contain external genotype information, which is not used by this program.) e.g.
20 11244 N M M
It is essential that the chromosomes come in the same order
as in in.glz and that the positions
within a chromosome are sorted.
The output is in the same format as for glfSnpCall.
Usage: glfDump [-theta theta] [-theta
theta] mother.glz father.glz
child.glz
Usage: glfDump [–name chr] [-start
start] [-end end] < in.glz
Produces ASCII version of glf looking like:
20 10000 C 19 99 0
255 87 255 255 0 87 87 255 255 255
where the columns are: chromosome, position, reference base,
depth, maximum mapping quality, maximum likelihood (phred units), 10 x
likelihoods (phred transformed).
Usage: glfExtract
–name chr [-start start] [-end
end] < in.glz > out.glz
Extracts [a subregion of] a chromosome from a .glz file.
Usage: glfFastq [-d minDepth] [-D
maxDepth] [-Q minMapQ] < in.glz > out.fastq
Call every position a sequence, outputting the quality of
the call in fastq format.
Positions at which the depth is below minDepth or above maxDepth or the best mapping quality is below minMapQ will be output in lower case (enabling them to be
easily identified for filtering).
Usage: glfMultiPrior
[-theta theta] [-f
file of names] || *.glz > out.snp
Call SNPs from a set of individualsÕ glz files. The underlying prior model is that the
individuals come from an unstructured diploid population with constant
effective population size, so that the frequency of variants seen in k of the 2N
genomes is theta/k.
The output is in standard maq .snp format, with some
differences in the interpretation of the columns. In particular, column 4 (the variant) is always a het unless
the data are monomorphic different from the reference, column 5 (quality) is
the confidence of there being a SNP at this position, column 6 (depth) is the
total depth and column 8 (max_mapQ) is the minimum of the max_mapQ from the
samples. We completely misuse
column 7 (average number of hits for reads at this position) to store our estimate
of the allele frequency, since we donÕt have access to the average number of
hits for reads at this position.
The neighbourhood quality is correct, but the last three columns (2nd best call, SNP quality, 3rd
best call) are meaningless so absent.
In addition to creating the out.snp file, posterior odds
.glz files are created for each input .glz file, with Ò-multiÓ inserted into
their names before the Ò.glzÓ, as for glfTrioPrior.