Documentation for glfProgs package version 0.1

Richard Durbin 29/9/08

glfProgs are a set of software tools to manipulate and call SNPs from .glz files. It is available from http://sourceforge.net/project/showfiles.php?group_id=191815&package_id=293182 as part of the maq.sourceforge.net project site.

.glz files are gzipped .glf files, where .glf is a binary format that contains genotype likelihood information extracted from a multiple alignment of short sequencing reads. To be specific, .glf contains the following

Per chromosome

int chrNameLen ; /* includes terminating 0 */

char* chrName ; /* chrNamelen chars including 0 */

int chrLen ;

Per base

typedef struct

{ unsigned char ref:4, dummy:4 ; /* ref A=1,C=2,G=4,T=8,N=15 etc. */

unsigned char max_mapQ ; /* maximum mapping quality */

unsigned char lk[10] ; /* log likelihood ratio, max 255 */

unsigned min_lk:8, /* minimum lk capped at 255

depth:24 ; /* and the number of mapped reads */

} GLF ;

lk[g] is the likelihood of genotype g expressed as the log of the likelihood ratio of the most likely genotype to genotye g in phred scaled units. For example lk[AA] = 0, lk[AG] = 10, lk[GG] = 20 would mean that the most likely genotype is AA, which is ten times more likely than AG, and a hundred times more likely than GG.

Of course we must remember these are likelihoods: if the reference base is G, and we have no other information, we might take the prior on AG to be 0.001, and the prior on AA to be 0.0005, which would give GG the highest posterior probability of about 95%, AA with about 4.5% and AG with 0.5%. The functions below support these prior calculations, and others where there is data on multiple individuals

Synopsis

maq glfgen id.glz reference.bfa id.map

glfSoloPrior < id.glz > id-solo.glz

glfSnpCall < id-solo.glz > id-solo.snp // call SNPs de novo

glfSoloPrior –theta 0.1 < id.glz > id-snp.glz

glfSubCall –sites dbSNP.pos < id-snp.glz > id.dbSNP.geno

The latter two lines genotype dbSNP positions, giving a prior heterozygosity of 0.1 to variants at positions in dbSNP compared to a default prior heterozygosity of 0.001.

Programs

glfSoloPrior

Usage: glfSoloPrior [-theta theta] < in.glz > out.glz

Applies priors of theta/2 to homozygous calls different from the reference, theta to heterozygotes with one allele the reference, and theta*theta to heterozygotes with both bases different from the reference. The output is again in .glz format, but now in posterior odds.

glfSnpCall

Usage: glfSnpCall < in.glz > out.snp

This assumes that the input is posterior odds, i.e. that appropriate priors have already been applied to the likelihoods, e.g. by glfSoloPrior or glfTrioPrior. It calls SNPs with output in the same format as MAQ’s .snp output, which allows easy comparison of results and use of downstream tools. Each line has 12 tab-delimited fields:

chromosome,

position,

reference base,

consensus base,

Phred-like consensus quality,

read depth,

0.0 (maq gives the average number of hits of reads mapping to this position, but we don’t have that information),

highest mapping quality of the reads covering the position,

minimum quality in the 3bp flanking regions at each side of the site (6bp in total),

second best call,

difference between the phred score of there being a SNP and the quality,

third best call.

E.g.

20 48699 C Y 112 13 0.00 99 61 T 8 C

In particular note that the confidence (phred score) for there being a SNP is given by the sum of the 4^th and 11^th columns.

glfSubCall

Usage: glfSubCall –sites siteFile < in.glz > out.snp

Calls genotypes at all the positions in the siteFile, which is an ASCII file with five fields per row, the first two of which are the chromosome and position to be typed. (The other 3 are typically used to contain external genotype information, which is not used by this program.) e.g.

20 11244 N M M

It is essential that the chromosomes come in the same order as in in.glz and that the positions within a chromosome are sorted. The output is in the same format as for glfSnpCall.

glfTrioPrior

Usage: glfDump [-theta theta] [-theta theta] mother.glz father.glz child.glz

glfDump

Usage: glfDump [–name chr] [-start start] [-end end] < in.glz

Produces ASCII version of glf looking like:

20 10000 C 19 99 0 255 87 255 255 0 87 87 255 255 255

where the columns are: chromosome, position, reference base, depth, maximum mapping quality, maximum likelihood (phred units), 10 x likelihoods (phred transformed).

glfExtract

Usage: glfExtract –name chr [-start start] [-end end] < in.glz > out.glz

Extracts [a subregion of] a chromosome from a .glz file.

Additional programs it would be good to write

glfFastq

Usage: glfFastq [-d minDepth] [-D maxDepth] [-Q minMapQ] < in.glz > out.fastq

Call every position a sequence, outputting the quality of the call in fastq format. Positions at which the depth is below minDepth or above maxDepth or the best mapping quality is below minMapQ will be output in lower case (enabling them to be easily identified for filtering).

glfMultiPrior

Usage: glfMultiPrior [-theta theta] [-f file of names] || *.glz > out.snp

Call SNPs from a set of individuals’ glz files. The underlying prior model is that the individuals come from an unstructured diploid population with constant effective population size, so that the frequency of variants seen in k of the 2N genomes is theta/k.

The output is in standard maq .snp format, with some differences in the interpretation of the columns. In particular, column 4 (the variant) is always a het unless the data are monomorphic different from the reference, column 5 (quality) is the confidence of there being a SNP at this position, column 6 (depth) is the total depth and column 8 (max_mapQ) is the minimum of the max_mapQ from the samples. We completely misuse column 7 (average number of hits for reads at this position) to store our estimate of the allele frequency, since we don’t have access to the average number of hits for reads at this position. The neighbourhood quality is correct, but the last three columns (2^nd best call, SNP quality, 3^rd best call) are meaningless so absent.

In addition to creating the out.snp file, posterior odds .glz files are created for each input .glz file, with “-multi” inserted into their names before the “.glz”, as for glfTrioPrior.