FASTQ Format Specification
Introduction
FASTQ format stores sequences and Phred qualities in a single file. It is concise and compact. FASTQ is first widely used in the Sanger Institute and therefore we usually take the Sanger specification and the standard FASTQ format, or simply FASTQ format. Although Solexa/Illumina read file looks pretty much like FASTQ, they are different in that the qualities are scaled differently. In the quality string, if you can see a character with its ASCII code higher than 90, probably your file is in the Solexa/Illumina format.
Example
@EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ;;;;;;;;;;;9;7;;.7;393333
FASTQ Format Specification
Notations
- <fastq>, <blocks> and so on represents non-terminal symbols.
- Characters in red are regex-like operators.
- '\n' stands for the Return key.
Syntax
<fastq> | := | <block>+ |
<block> | := | @<seqname>\n<seq>\n+[<seqname>]\n<qual>\n |
<seqname> | := | [A-Za-z0-9_.:-]+ |
<seq> | := | [A-Za-z\n\.~]+ |
<qual> | := | [!-~\n]+ |
Requirements
- The <seqname> following '+' is optional, but if it appears right after '+', it should be identical to the <seqname> following '@'.
- The length of <seq> is identical the length of <qual>. Each character in <qual> represents the phred quality of the corresponding nucleotide in <seq>.
- If the Phred quality
is $Q, which is a non-negative integer, the
corresponding quality character can be calculated with the following
Perl code:
$q = chr(($Q<=93? $Q : 93) + 33);
$Q = ord($q) - 33;where ord() gives the ASCII code of a character.
Solexa/Illumina Read Format
The syntax of Solexa/Illumina read format is almost identical to the FASTQ format, but the qualities are scaled differently. Given a character $sq, the following Perl code gives the Phred quality $Q:
$Q = 10 * log(1 + 10 ** (ord($sq) - 64) / 10.0)) / log(10);