What are STR and why do they matter?
Short tandem repeats (STR) are regions of the genome where simple sequences of DNA are copied back to back (Table 1). There are many STR regions in the human genome, most of which have no known function.
|Person A: (CAT)×3: ...CATCATCAT...|
|Person B: (CAT)×1: ...CAT...|
|Person C: (CAT)×9: ...CATCATCATCATCATCATCATCATCAT...|
Table 1. A hypothetical example of an STR
The first discovery of an association between STR variability and a medical condition was Huntington disease. To learn more about Marcy MacDonald, PhD, and the history of her team's groundbreaking discovery, check out this article from Nature Education. Individuals with Huntington disease have greater than 40 consecutive sets of three nucleotides ("trinucleotide repeat") C-A-G within a gene named after the condition (HTT). This sequence is within the coding sequence of the gene, which is translated into repeats of the amino acid glutamine. The increased number of consecutive glutamines in the resulting protein causes aggregation in neurons, which eventually leads to the clinical signs and symptoms of Huntington disease, including ataxia and neurological decline.
In families with Huntington disease, which has an autosomal dominant inheritance pattern, it was noted that children of affected individuals often had earlier onset of symptoms and a more rapid course of neurodegeneration. STR analysis showed that the expanded repeats were expanding even further in the severely affected offspring, providing a mechanism for this tragic phenomenon, now called "anticipation."
One of the most insidious aspects of Huntington disease is that symptoms typically do not appear until after a person has already had children, and those children are at a 50% risk of inheriting an expanded STR. With the tools now available, it is technically possible to screen everyone for this disease much earlier. Unfortunately, there are no effective treatments yet, so it is not appropriate to recommend this kind of testing to the general population. Efforts are underway to pursue targeted gene therapy for Huntington disease. If a gene-based cure for Huntington disease can be developed, it could become justifiable to test everyone.
The discovery of the HTT repeat expansion inspired other groups to search for this type of variant in other conditions. There are now at least 56 different genes where STR have been associated with human disease (Figure 2), including fragile X syndrome (FMR1 gene), which is one of the most common forms of inherited intellectual disability and a top-tier condition recommended for carrier screening by the American College of Medical Genetics and Genomics.
How are STR typically analyzed?
The first effective methodology for evaluating STR was the Southern blot. While this method is highly sensitive, it is cumbersome to perform in the lab and it is difficult to assess the exact number of repeats (Figure 3).
Can STR be analyzed using panel or exome data?
Unfortunately, the library preparation and target amplification or hybridization processes necessary for panel and exome sequencing by NGS remove the repetitive DNA from the assay. No amount of bioinformatics can rescue a signal that isn’t in the tube.
Where does DRAGEN STR come in?
In contrast to panel or exome sequencing, PCR-free whole-genome sequencing (pfWGS) retains the repetitive genomic DNA for sequencing. The challenge for researchers then becomes genotyping the repeat lengths when the most relevant expanded alleles often exceed the read length of short-read Illumina sequencing data. To address this issue, Illumina alumni Egor Dolzhenko, PhD, Michael Eberle, PhD, and colleagues developed ExpansionHunter. First they created a custom set of references for important STR regions against which subject data can be compared. The algorithm identifies informative sequence reads for pfWGS data, such as reads in flanking regions and reads containing the repeat sequence along with their paired-end mates. Combining the specially prepared references with these reads, the algorithm can readily identify non-expanded alleles and flag cases with potentially expanded ones.
Illumina’s DRAGEN Bio-IT platform includes STR genotyping using ExpansionHunter as an option for any samples with pfWGS data. This can be run on local hardware or in the cloud. If you are curious about implementing DRAGEN workflows with your data, please reach out!
For those interested in the math and bioinformatics, the primary literature is a great resource. A previous publication goes into even more detail.
How does DRAGEN STR perform?
When challenged with a series of positive and negative data sets for multiple STR expansion conditions, ExpansionHunter exhibited excellent performance (Figure 6). All the positive controls tested positive except one, demonstrating extremely high accuracy. The negative predictive value (percentage of true negatives testing negative) was also extremely high, though normal controls tested positive. These findings combined strongly support a “screen and confirm” approach, where pfWGS is performed on all participants and follow-up rpPCR can be used to confirm potential expansions in genes where participants are above a laboratory validated cutoff value.
The lone false negative in the study from Figure 6 is worth further discussion. This sample has an FMR1 pre-mutation, but was considered normal using the predefined cutoff. This suggests that a clinical laboratory may want to consider using a slightly lower cutoff and accepting a modestly higher rpPCR confirmation rate to ensure full sensitivity for this expansion type. This is a classic example of the give-and-take nature of balancing sensitivity against specificity.
Can DRAGEN STR identify new loci?
To greatly improve the range of diseases covered and to support researchers seeking to solve undiagnosed diseases, the ExpansionHunter team generalized the algorithm to be able to identify repeat expansions across the genome. Using this new tool, called ExpansionHunter Denovo, classic STR conditions like Friedreich ataxia and fragile X were “rediscovered” (Figure 7). Overall, 41 out of 44 known expansions were confirmed as positive using this approach.
How can STR detection be implemented?
The two major ways you can run ExpansionHunter are as part of a DRAGEN workflow or independently as stand-alone software. Versions of the DRAGEN DNAseq pipeline 3.7.5 or later (including the current version 3.10) include the option to perform ExpansionHunter analysis (see online help for details). DRAGEN can be run using physical hardware (“on-prem”) or as part of cloud-based workflows on multiple platforms. The software is also available as a stand-alone package on GitHub (Table 2).
|DRAGEN Bio-IT Platform||On-prem server*||Custom designed computer hardware optimized for accuracy and speed of secondary genomic analysis (alignment and variant calling).||https://www.illumina.com/products/by-type/informatics-products/dragen-bio-it-platform.html|
|Emedgene||Cloud||A cloud-based platform for end-to-end clinical genomic analysis, including panels, exomes, and whole genomes. This includes DRAGEN secondary analysis, annotation, filtering and tertiary analysis workflows, a knowledge database, robust reporting tools, and artificial-intelligence-based variant prioritization.|
|BaseSpace Sequencing Hub||Cloud||A cloud-based bioinformatics platform designed for managing Illumina sequencing runs and analyses.|
|Illumina Connected Analytics||Cloud||A cloud-based bioinformatics platform designed for data management and analysis across projects and types.|
|TruSight Software Suite||Cloud||A cloud-based platform for end-to-end genomic analysis of exomes and whole genomes.||https://www.illumina.com/products/by-type/informatics-products/trusight-software-suite.html|
|Linux||Software||The original ExpansionHunter software package is available for research use and can be executed on your own server.|
*“On-prem” refers to physical “on premises” computer hardware that is installed in a server room/cabinet.
1. Karczewski, K.J., Francioli, L.C., Tiao, G. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). https://doi.org/10.1038/s41586-020-2308-7
2. Martorell, L., Nascimento, M., Colome, R. et al. Four sisters compound heterozygotes for the pre- and full mutation in fragile X syndrome and a complete inactivation of X-functional chromosome: implications for genetic counseling. J Hum Genet 56, 87–90 (2011). https://doi.org/10.1038/jhg.2010.140
3. Loureiro, J.R., Oliveira, C.L., Sequeiros, J. et al. A repeat-primed PCR assay for pentanucleotide repeat alleles in spinocerebellar ataxia type 37. J Hum Genet 63, 981–987 (2018). https://doi.org/10.1038/s10038-018-0474-3
4. Egor Dolzhenko, Viraj Deshpande, Felix Schlesinger, Peter Krusche, Roman Petrovski, Sai Chen, Dorothea Emig-Agius, Andrew Gross, Giuseppe Narzisi, Brett Bowman, Konrad Scheffler, Joke J F A van Vugt, Courtney French, Alba Sanchis-Juan, Kristina Ibáñez, Arianna Tucci, Bryan R Lajoie, Jan H Veldink, F Lucy Raymond, Ryan J Taft, David R Bentley, Michael A Eberle, ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions, Bioinformatics, Volume 35, Issue 22, 15 November 2019, Pages 4754–4756, https://doi.org/10.1093/bioinformatics/btz431
5. Dolzhenko, E., Bennett, M.F., Richmond, P.A. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21, 102 (2020). https://doi.org/10.1186/s13059-020-02017-z