Sequence file formats

Choose your preferred file format for downstream analysis of sequencing data

File formats for Illumina sequencing

Numerous options are available for converting data to compatible sequence file formats such as FASTA with quality scores (FASTQ) files, and for downstream analysis of next-generation sequencing (NGS) data. Illumina sequencing systems are designed so data can be easily streamed into cloud-based Illumina informatics platforms for data management, analysis, and collaboration.

Raw data files are provided in sequence file formats that are compatible with, or easily converted to, standardized data formats for streamlined aggregation and mining of large cohorts.

Back and profile close up image of a female HCP holding a pen over a notebook, looking at another HCP with an open laptop screen showing results in the background.

FASTQ sequence file formats

FASTQ files are text files containing sequence data with a quality score for each base, represented as an American standard code for information interchange (ASCII) character.

FASTQ file format

FASTQ is a text-based sequencing data file format that stores both raw sequence data and quality scores. FASTQ files have become the standard format for storing NGS data from Illumina sequencing systems, and can be used as input for a wide variety of secondary data analysis solutions.

FASTQ files may contain up to millions of entries and can be several megabytes or gigabytes in size, which often makes them too large to open in a typical text editor. Generally, it is not necessary to view FASTQ files, since they are intermediate output files used as input for tools that perform downstream data analysis.

FASTQ ORA file format

FASTQ Original Read Archive (ORA) files are lossless data compression files that make it easier to store, manage, and share large NGS data files. This file format reduces file size, time to transfer, and data storage costs. FASTQ ORA files are up to 5× smaller than FASTQ files in traditional fastq.gz format without compromising data integrity. FASTQ ORA files can be generated with Illumina DRAGEN secondary analysis software.

All fastq.ora file formats can be read using the free DRAGEN ORA Decompression Software provided by Illumina. Once installed, a simple command pipes the output of decompression into popular mapping tools such as BWA,1 STAR,2 and Bowtie.3

Lossless data compression cuts costs and time

Find out more about the benefits of lossless genomic data compression and how DRAGEN secondary analysis ORA files significantly reduce data analysis time and data storage costs.

BCL sequence file format

Binary base call (BCL) files contain raw data generated by Illumina sequencing systems. The BCL sequence file format requires conversion to FASTQ format for use with user-developed or third-party data analysis tools.

DRAGEN secondary analysis offers rapid BCL conversion to FASTQ files as part of its suite of pipelines. Illumina also offers BCL Convert software to convert BCL files to FASTQ files. BCL Convert is a standalone software solution that demultiplexes data and converts BCL files to standard FASTQ file formats for downstream analysis.

Other sequence file formats

FASTQ files are the typical starting format for sequencing data analysis. However, BaseSpace Sequence Hub can create other file formats that are common to secondary and tertiary analysis programs.

During secondary or tertiary analysis of NGS data, Illumina software platforms and apps often convert raw sequence files from FASTQ files to other sequence file formats (ie, *.vcf, *.bam) as part of the analysis workflow.

Front view of two female HCPs leaning on a standing desk, one is pointing to the monitor and the other is looking at the monitor; office supplies are on the desk next to another desktop and keyboard; blurry in the background is an easel with a hand-drawn chart.

Illumina informatics for oncology

Join Dylan Barfield, Illumina Staff Software Technical Product Manager, as he discusses how advancements in Illumina AI and bioinformatics are driving research to make personalized cancer care more accessible.

Sequence file formats FAQ

There are numerous commonly used file formats for Illumina sequencing data, including FASTQ, FASTQ ORA, BCL, and others. Listed are several additional commonly used file formats for Illumina sequencing data:

  1. SAM: Sequence alignment map files are a text file format that contain the alignment information of sequences mapped to a reference sequence.
  2. BAM: Binary alignment map files are the output obtained from sequence alignment in binary format. They are smaller and more efficient for software to process than SAM files.
  3. CRAM: Compressed reference-based alignment format is a highly compressed alternative to BAM files containing only base calls that differ from the reference.
  4. VCF: Variant call format is a standardized text file format used for storing variant information (eg, single nucleotide polymorphisms (SNPs), indels, fusion genes, and small variants).

Quality scores measure the probability that a base is called incorrectly and are essential for improving analysis accuracy by filtering out low-quality data. The quality score is on the Phred scale, where higher values mean a lower probability of error. In practical application, these scores are often used to assess read quality to inform trimming and filtering decisions before downstream steps.4

Learn more about sequencing quality scores.

There are several best practices for managing large FASTQ data sets, including performing data quality checks with tools such as FastQC, compressing large files, archiving older data sets, and using cloud storage. Following these best practices helps improve data storage efficiency, ensure reproducibility, and maintain data integrity, especially when combined with checksum verification.5

Learn more about cloud-based solutions for genomic data storage.

Explore our genomic data compression page to learn about DRAGEN original read archive (ORA) lossless genomic data compression technology.

/ Results

Additional resources

Developer portal

Access user guides, release notes, and additional technical information.

NGS training

Get hands-on NGS training from expert instructors. We also offer live or self-paced online courses and other educational resources.

DRAGEN secondary analysis pipelines

Discover DRAGEN secondary analysis pipelines that support various NGS experiment types, including genome, exome, transcriptome, and methylome studies.

Speak to a specialist

Talk to an expert to learn more about sequence file formats.

References

  1. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26(5):589-595. doi:10.1093/bioinformatics/btp698
  2. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. doi:10.1093/bioinformatics/bts635
  3. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. doi:10.1186/gb-2009-10-3-r25
  4. Hemstrom W, Grummer JA, Luikart G, Christie MR. Next-generation data filtering in the genomics era. Nat Rev Genet. 2024;25(11):750-767. doi:10.1038/s41576-024-00738-6
  5. Kumar S, Singh MP, Nayak SR, et al. A new efficient referential genome compression technique for FastQ files. Funct Integr Genomics. 2023;23(4):333. Published 2023 Nov 11. doi:10.1007/s10142-023-01259-x