Posted on February 20, 2012 by Elliott Margulies, Ph.D.
Speed and accuracy are paramount when it comes to taking sequencing to the clinic. Here's a quick summary of my talk at AGBT, where I discussed research efforts to develop a fast, integrated sequencing and analysis workflow—from sample to answer.
Starting with sample prep, we've improved tolerance for low input (100 ng or less), which is critical for clinical samples such as tumor biopsies. A PCR-free workflow cuts out a huge amount of time, and we've shown it to have equivalent coverage and accuracy, as well as higher diversity than standard methods using amplification. We've also been successful in sequencing genomes from FFPE tissue as a source of DNA, opening access to millions of archival samples.
Sequencing genomes in a day on the HiSeq 2500 is routine for us, and importantly, this increase in speed does not sacrifice sensitivity or accuracy. At every step in the process, we are focusing on improved speed and simplicity while ensuring comprehensiveness and accuracy.
On the analysis front, we're adopting a standard VCF format for disseminating whole genome data—calling it a genome VCF (or gVCF), which complies with the VCF standard and includes quality information about every position in the genome, not just variant positions. This is critical in a clinical setting where ruling out the presence of specific variants is just as important as identifying causative variants.
We are also developing integrated alignment and variant detection methods that match or exceed the accuracy of current tools, but run 4 to 6 times faster. With these new tools, it is now possible to go from BCL files to variant calls in about 7 hours on a single server using a single command—much faster and simpler than other available methods. This is an important step towards realizing the full potential of sequencing genomes in a day.
Rich annotations are essential for supporting efficient genome interpretation. Detected variants are now fully annotated using the Variant Effect Predictor (VEP). They also contain dbSNP v135 annotations, 1000 genomes population allele frequencies, evolutionary constraint (overlap with phastCons elements), clinically relevant information (HGMD), and GWAS correlations compiled by NHGRI.
Finally, I discussed the idea of the "portable" genome, and how to make this information easily accessible for interpretation, through both the BaseSpace cloud analysis environment and physically delivered on a USB stick. As proof of concept, we put four whole human genomes on a single USB stick and handed them out at the conference. All genomes contain aligned reads and fully annotated variants that are directly viewable using an included, pre-configured IGV browser. These "genomes on a stick" are the first of their kind—bringing together our advances in speed, accuracy, and additional interpretable information. With all the advances we've made, you could start sample prep on a Monday morning, and go home Wednesday night with a USB stick containing a fully aligned and annotated genome.
Of the four genomes on the USB stick that we gave out at AGBT—one was sequenced in a day on the HiSeq 2500 system, and the others were a trio of data generated from PCR-free libraries. You can go to http://www.platinumgenomes.org for more ways to view these genomes in the UCSC genome browser, as well as links for downloading the raw sequence data from the European Nucleic Acid Archive.