Fully Featured Genome: Expanding the Hunt for Genomic Variation with DRAGEN STR

Samuel Strom, Carri-Lyn Mead, Dan Letchworth, Vitor Onuchic, Mitchell Bekritsky

What are STR and why do they matter?

Short tandem repeats (STR) are regions of the genome where simple sequences of DNA are copied back to back (Table 1). There are many STR regions in the human genome, most of which have no known function.

Person A: (CAT)×3: ...CATCATCAT...
Person B: (CAT)×1: ...CAT...

Table 1. A hypothetical example of an STR

Occasionally, STR will mutate in sperm or egg cells, leading to a child being born with an increased number of repeats ("expansion") or fewer repeats ("contraction") compared to their parents. This usually happens because polymerase can slip at these sites during DNA replication. Over time, these expansions and contractions have led to STR lengths being highly variable across human populations (Figure 2).
Allele Size Distrubution
Figure 2. Example of a variable STR in human populations (DMPK gene). Healthy human subjects vary in CTG repeat number from 4 through 31. Affected individuals with myotonic dystrophy type 1 have >50 repeats.

Figure excerpted from the gnomAD v3 database, PMID 32461654.1

The first discovery of an association between STR variability and a medical condition was Huntington disease. To learn more about Marcy MacDonald, PhD, and the history of her team's groundbreaking discovery, check out this article from Nature Education. Individuals with Huntington disease have greater than 40 consecutive sets of three nucleotides ("trinucleotide repeat") C-A-G within a gene named after the condition (HTT). This sequence is within the coding sequence of the gene, which is translated into repeats of the amino acid glutamine. The increased number of consecutive glutamines in the resulting protein causes aggregation in neurons, which eventually leads to the clinical signs and symptoms of Huntington disease, including ataxia and neurological decline.

In families with Huntington disease, which has an autosomal dominant inheritance pattern, it was noted that children of affected individuals often had earlier onset of symptoms and a more rapid course of neurodegeneration. STR analysis showed that the expanded repeats were expanding even further in the severely affected offspring, providing a mechanism for this tragic phenomenon, now called "anticipation."

One of the most insidious aspects of Huntington disease is that symptoms typically do not appear until after a person has already had children, and those children are at a 50% risk of inheriting an expanded STR. With the tools now available, it is technically possible to screen everyone for this disease much earlier. Unfortunately, there are no effective treatments yet, so it is not appropriate to recommend this kind of testing to the general population. Efforts are underway to pursue targeted gene therapy for Huntington disease. If a gene-based cure for Huntington disease can be developed, it could become justifiable to test everyone.

The discovery of the HTT repeat expansion inspired other groups to search for this type of variant in other conditions. There are now at least 56 different genes where STR have been associated with human disease (Figure 2), including fragile X syndrome (FMR1 gene), which is one of the most common forms of inherited intellectual disability and a top-tier condition recommended for carrier screening by the American College of Medical Genetics and Genomics.

Pathogenic short tandem repeats
Figure 2. First 12 rows from the "Pathogenic STR Table" from gnomAD. Some genes have multiple STR loci.

Figure excerpted from the gnomAD v3 database, PMID 32461654.1

How are STR typically analyzed?

The first effective methodology for evaluating STR was the Southern blot. While this method is highly sensitive, it is cumbersome to perform in the lab and it is difficult to assess the exact number of repeats (Figure 3).

fragile X syndrome
Figure 3. Southern blot for fragile X syndrome. The individuals in lanes III1 and III3 are females having one normal allele (2.8 kb) and one expanded allele (5.2 kb).

Truncated figure excerpted from PMID 21107340.Original caption: "Southern blot analysis of FMR1 (fragile X mental retardation 1) gene. Sizes of normal unmethylated (2.8 kb), normal methylated (5.2 kb) and a control band (2.4 kb) are indicated."

To enable more accurate sizing and to scale up to analyzing dozens of samples at a time, PCR-based methods were developed. The second generation of PCR-based assays use the repeat sequence as one primer, which ensures that very large repeat expansions do not fail to amplify. This method is called repeat-primed PCR (rpPCR, Figure 5). rpPCR is valuable as a gold standard to confirm findings in individual genes, and remains the most common tool used for STR analysis. Unfortunately, it is difficult to scale up high volume testing for more than one or two expansions per case. For conditions such as spinocerebellar ataxia where there are at least a dozen different STR loci that can cause the same clinical condition, it becomes time- and resource-prohibitive to use this method. 
Examples of repeat-primed PCR of an STR in DAB1
Figure 4. An example of repeat-primed PCR of an STR in DAB1, where each repeat unit is amplified, creating a stutter pattern. The peak farthest to the right side is the longest allele.

Truncated figure excerpted from PMID 29891931.Original caption: "ATTTT RP-PCR to detect large pentanucleotide alleles in DAB1. a Schematic representation of the ATTTT RP-PCR primers that anneal with the repetitive ATTTT region, resulting in DNA amplification in normal and mutant alleles. b Electropherograms showing the fluorescent ATTTT RP-PCR analysis in control individuals from Table 1: C-75, C-88, C-91, C-95, and C-44; and in SCA37 affected individuals A-1 and A-9"

Can STR be analyzed using panel or exome data?

Unfortunately, the library preparation and target amplification or hybridization processes necessary for panel and exome sequencing by NGS remove the repetitive DNA from the assay. No amount of bioinformatics can rescue a signal that isn’t in the tube.

Where does DRAGEN STR come in?

In contrast to panel or exome sequencing, PCR-free whole-genome sequencing (pfWGS) retains the repetitive genomic DNA for sequencing. The challenge for researchers then becomes genotyping the repeat lengths when the most relevant expanded alleles often exceed the read length of short-read Illumina sequencing data. To address this issue, Illumina alumni Egor Dolzhenko, PhD, Michael Eberle, PhD, and colleagues developed ExpansionHunter. First they created a custom set of references for important STR regions against which subject data can be compared. The algorithm identifies informative sequence reads for pfWGS data, such as reads in flanking regions and reads containing the repeat sequence along with their paired-end mates. Combining the specially prepared references with these reads, the algorithm can readily identify non-expanded alleles and flag cases with potentially expanded ones.

Illumina’s DRAGEN Bio-IT platform includes STR genotyping using ExpansionHunter as an option for any samples with pfWGS data. This can be run on local hardware or in the cloud. If you are curious about implementing DRAGEN workflows with your data, please reach out!

For those interested in the math and bioinformatics, the primary literature is a great resource. A previous publication goes into even more detail.

Overview of ExpansionHunter
Figure 5. Overview of ExpansionHunter.

Figure excerpted from PMID 31134279.4 Original caption: "Overview of ExpansionHunter. (a) A locus definition is read from the variant catalog file. (b) Sequence graph is constructed according to its specification in the variant catalog. (c) Relevant reads are extracted from the input binary alignment/map file. (d) Reads are aligned to the graph. (e) Alignments are pieced together to genotype each variant"

How does DRAGEN STR perform?

When challenged with a series of positive and negative data sets for multiple STR expansion conditions, ExpansionHunter exhibited excellent performance (Figure 6). All the positive controls tested positive except one, demonstrating extremely high accuracy. The negative predictive value (percentage of true negatives testing negative) was also extremely high, though normal controls tested positive. These findings combined strongly support a “screen and confirm” approach, where pfWGS is performed on all participants and follow-up rpPCR can be used to confirm potential expansions in genes where participants are above a laboratory validated cutoff value.

The lone false negative in the study from Figure 6 is worth further discussion. This sample has an FMR1 pre-mutation, but was considered normal using the predefined cutoff. This suggests that a clinical laboratory may want to consider using a slightly lower cutoff and accepting a modestly higher rpPCR confirmation rate to ensure full sensitivity for this expansion type. This is a classic example of the give-and-take nature of balancing sensitivity against specificity.

ExpansionHunter performance
Figure 6. ExpansionHunter performance at known medically relevant loci.

Truncated figure excerpted from PMID 31134279.4 Original caption (as Supplemental Figure S3): "Analysis of Coriell samples harboring known repeat expansions. The blue, orange, and red rectangles define the expected size ranges for normal, premutation, and full expansion respectively for the corresponding repeat. Each dot corresponds to the size of the longest allele and its color is set according to the experimentally-determined status. GangSTR was run onlyon STRs for which predefined off-target loci were provided. GangSTR values were calculated using their 'genome-wide' mode for all of the genes except FMR1​ which was analyzed using 'targeted' mode which performed much better for this repeat. The repeat sizes were capped at 600bp."

Can DRAGEN STR identify new loci?

To greatly improve the range of diseases covered and to support researchers seeking to solve undiagnosed diseases, the ExpansionHunter team generalized the algorithm to be able to identify repeat expansions across the genome. Using this new tool, called ExpansionHunter Denovo, classic STR conditions like Friedreich ataxia and fragile X were “rediscovered” (Figure 7). Overall, 41 out of 44 known expansions were confirmed as positive using this approach. 

proof of concept ExpansionHunter Denovo
Figure 7. As a proof of concept, ExpansionHunter Denovo was used to retrospectively re-identify classic STR disorders.

Figure excerpted from PMID 32345345.5 Original caption: "Genome-wide analysis of anchored IRRs comparing cases with known pathogenic expansions in DMPK, FXN, FMR1, and HTT genes (top to bottom) to 150 controls"

How can STR detection be implemented?

The two major ways you can run ExpansionHunter are as part of a DRAGEN workflow or independently as stand-alone software. Versions of the DRAGEN DNAseq pipeline 3.7.5 or later (including the current version 3.10) include the option to perform ExpansionHunter analysis (see online help for details). DRAGEN can be run using physical hardware (“on-prem”) or as part of cloud-based workflows on multiple platforms. The software is also available as a stand-alone package on GitHub (Table 2).

Platform     Type     Description     Link
DRAGEN Bio-IT Platform     On-prem server*     Custom designed computer hardware optimized for accuracy and speed of secondary genomic analysis (alignment and variant calling).     https://www.illumina.com/products/by-type/informatics-products/dragen-bio-it-platform.html
Emedgene     Cloud     A cloud-based platform for end-to-end clinical genomic analysis, including panels, exomes, and whole genomes. This includes DRAGEN secondary analysis, annotation, filtering and tertiary analysis workflows, a knowledge database, robust reporting tools, and artificial-intelligence-based variant prioritization.    


BaseSpace Sequencing Hub     Cloud     A cloud-based bioinformatics platform designed for managing Illumina sequencing runs and analyses.    


Illumina Connected Analytics     Cloud     A cloud-based bioinformatics platform designed for data management and analysis across projects and types.    


TruSight Software Suite     Cloud     A cloud-based platform for end-to-end genomic analysis of exomes and whole genomes.     https://www.illumina.com/products/by-type/informatics-products/trusight-software-suite.html
Linux     Software     The original ExpansionHunter software package is available for research use and can be executed on your own server.    



*“On-prem” refers to physical “on premises” computer hardware that is installed in a server room/cabinet.

Table 2.


1.    Karczewski, K.J., Francioli, L.C., Tiao, G. et al. The mutational constraint spectrum quantified from variation in 141,456 humansNature 581, 434–443 (2020). https://doi.org/10.1038/s41586-020-2308-7

2.     Martorell, L., Nascimento, M., Colome, R. et al. Four sisters compound heterozygotes for the pre- and full mutation in fragile X syndrome and a complete inactivation of X-functional chromosome: implications for genetic counselingJ Hum Genet 56, 87–90 (2011). https://doi.org/10.1038/jhg.2010.140

3.     Loureiro, J.R., Oliveira, C.L., Sequeiros, J. et al. A repeat-primed PCR assay for pentanucleotide repeat alleles in spinocerebellar ataxia type 37. J Hum Genet 63, 981–987 (2018). https://doi.org/10.1038/s10038-018-0474-3

4.     Egor Dolzhenko, Viraj Deshpande, Felix Schlesinger, Peter Krusche, Roman Petrovski, Sai Chen, Dorothea Emig-Agius, Andrew Gross, Giuseppe Narzisi, Brett Bowman, Konrad Scheffler, Joke J F A van Vugt, Courtney French, Alba Sanchis-Juan, Kristina Ibáñez, Arianna Tucci, Bryan R Lajoie, Jan H Veldink, F Lucy Raymond, Ryan J Taft, David R Bentley, Michael A Eberle, ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regionsBioinformatics, Volume 35, Issue 22, 15 November 2019, Pages 4754–4756, https://doi.org/10.1093/bioinformatics/btz431

5.     Dolzhenko, E., Bennett, M.F., Richmond, P.A. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21, 102 (2020). https://doi.org/10.1186/s13059-020-02017-z