Skip to content

Enancio

Enancio technology joins Illumina family

Reduces genomic data storage and transfer costs associated with big data

Enancio Logo

Benefits of Genomic Data Compression

Illumina is committed to delivering innovative sequencing technologies, and to helping customers manage growing volumes of data output that result from the proliferation of sequencing-based research. Enancio’s genomic data compression technology offers optimal levels of speed and efficiency, and nicely complements other Illumina informatics solutions.

Genomic data compression allows for:

  • Lower storage costs
  • High-speed file transfer
  • Reduced internal network traffic

Lossless Genomic Data Compression Technology

Enancio’s lossless genomic data compression technology reduces the data storage footprint by as much as five times by compressing the output from Illumina sequencers. Enancio technology uses a reference-based compression method. The idea is to use an ultra-fast mapping scheme to map reads onto a reference genome, and then store only the data needed to regenerate those reads: a position and a list of differences.

Other data compression technologies usually suffer from low speed. Enancio technology is optimized for high compression ratios, as well as fast compression and decompression rates, while preserving data integrity. Quality scores are encoded in a lossless way using a range encoder and context models adapted to the different types of quality schemes.

Access Enancio Decompression Software

All files compressed with the Illumina compression technology can easily be decompressed using the decompression software available here. The decompression software is free to download and to use.

Once installed, a simple command can be used to directly pipe the output of decompression on the fly into a wide range of popular mapping tools such as BWA, STAR, and BowTie. The compression and decompression technology will also be seamlessly integrated within the DRAGEN secondary analysis workflow.

Download Now

Enancio is a company recently acquired by Illumina with proprietary lossless data compression technology specifically designed for genomics data.

Enancio’s (now Illumina’s) lossless compression is specifically designed for genomics data. The DNA sequence is compressed using a reference-based method: reads are mapped on a reference genome using an ultra-fast mapping scheme devised for compression. A compact binary format is used to encode reads as positions and a list of differences, followed by an entropy coder. Quality scores are encoded in a lossless way using a range encoder and context models adapted to the different types of quality schemes.

Illumina’s compression technology reduces the data footprint of fastq files by a factor of 51 compared to gzip. This translates into direct storage cost savings and more rapid file transfer speeds.

The compression technology will first be integrated into DRAGEN BCL conversion, giving users the option to produce compressed fastq files that are 5x smaller than fastq.gz1.

The compression technology is available on the NextSeq 1000/2000, enabling the generation of compressed fastq files right off the instrument. Stay tuned for future DRAGEN releases that will include lossless genomic compression of fastq files as part of BCL Conversion

During the NGS workflow, you can optionally enable compression to generate compressed fastq.ora files during BCL conversion. Fastq.ora files can be decompressed on the fly for mapping and downstream analysis and will soon be directly ingested by DRAGEN. The integration of compression within DRAGEN BCL conversion streamlines the workflow, as shown in the figure below:

Illumina compression technology used within DRAGEN
Before Enancio's acquisition: compression as a standalone software. Compression is an extra step.

The output of the compression technology is a new compressed fastq binary file format: fastq.ora. This file format can be stored and shared to enable significant storage cost savings and reduced file transfer times. All compressed files can be decompressed with the freely available decompression software.

Fastq.ora files can be decompressed on the fly for mapping and downstream analysis and will soon be directly ingested by DRAGEN.

A 235 GB raw fastq file can be compressed to 55 GB via gzip. The data footprint is further reduced to 11 GB with the Illumina compression technology2.

Fastq files and BAM or CRAM files are typically stored for different purposes. However, fastq.ora files enable you to store a compressed copy of your raw data with a preserved MD5 sum and smaller footprint than the corresponding CRAM file.

DRAGEN can already produce CRAM compression for BAM files. Once the Enancio compression is integrated into DRAGEN, you will be able to compress Fastqs and BAMs to fastq.ora and CRAM respectively.

Utilization of the compression is completely optional. DRAGEN users remain free to adopt the storage strategy they want: activate the conversion to Illumina fastq compressed file format and store these files, disable the conversion to Illumina compressed file format fastq.ora and store fastq.gz, or store BAM or CRAM files.

Yes – the compression technology will be seamlessly integrated within the DRAGEN workflow.

Additionally, once the free decompression software is installed, a simple command can be used to directly pipe the output of decompression on the fly into a wide range of popular mapping tools such as BWA3, STAR4, and Bowtie5.

Illumina fastq compressed files can be shared. The decompression software is freely available. Once the free decompression software is installed, a simple command can be used to directly pipe the output of decompression on the fly into a wide range of popular mapping tools such as BWA3, STAR4, and Bowtie5.

Have questions about the compression technology?

Contact us to learn more.

DRAGEN Bio-IT Platform

Enancio’s genomic data compression technology will be directly integrated into DRAGEN, which provides accurate, ultra-rapid secondary genomic analysis of sequencing data.

Learn More

Related Solutions

Infrastructure & Pipeline Setup

We offer a variety of resources and information to help simplify the process of setting up your informatics infrastructure.

Sequencing Data Analysis

Our sequencing data analysis software helps you spend more time doing research, and less time configuring and running analysis workflows.

Illumina Informatics Product Portfolio

Explore a broad range of informatics products designed to simplify genomic data analysis and management.

References
  1. On files generated by NextSeq 1000/2000 and NovaSeq 6000 Systems
  2. This result has been obtained from the DNA sample NA12878 sequenced on the NovaSeq 6000 instrument with a 30x coverage. Data is accessible on the BaseSpace project: basespace.illumina.com/s/3ExEZMlH8Lkq.
  3. Li H. and Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009 Jul 15; 25(14): 1754–1760.
  4. Dobin A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan; 29(1): 15–21.
  5. Langmead B. et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009 10:R25