June 1, 2023
Each person’s genetic code harbors millions of variants that differ from one individual to the next, and account for differences in health and disease risk. The more human genomes are sequenced, the more data researchers have to compare and predict which variants are most likely to cause disease. Despite the collective efforts of scientists and clinicians worldwide, the function of the vast majority of these variants remains unknown.
And genetic risk prediction has suffered from an ethnicity bias. Seventy-eight percent of the data in the Genome-Wide Association Study Catalog comes from people of European ancestry; when primarily European genomic data is used to train polygenic risk scores, it results in uneven performance when applied to other ethnic groups.
Sequencing a greater diversity of people is part of the solution, but even that can only tell us so much. “The main issue is that humans are pretty bottlenecked,” explains Kyle Farh, vice president of Artificial Intelligence at Illumina. “Even though there are 8 billion of us, our genetic diversity still looks like the original population of 10,000 common ancestors we’re all descended from. There just isn’t enough information to glean from the human species. It became clear several years ago that, to really understand the human genome, the data contained in human genome sequencing was not enough.”
Homo sapien DNA records the evolutionary history of a few hundred thousand years. But to avoid bias and learn even more about ourselves, scientists are broadening the search by tens of millions of years to study our more distant family, the primates.
DNA as living history
Evolution is the world’s longest-running experiment. Generation by generation, nature has been testing genes through random mutation—variants that harm an animal’s fitness are quickly removed from the gene pool, and ones that are neutral or beneficial survive to be passed on. “The results of these experiments are documented in every species’ genome,” Farh says. “They’re right there. It’s a living document.”
The taxonomic order “Primates” comprises over 500 species, encompassing apes, monkeys, prosimians like lemurs and loris—and us. We’re all descended from the same ancestors, and despite our wildly varied forms, living primates share more than 90% of our DNA with one another. Mutations that occur in chimpanzees or bonobos also occur in humans, and work from Illumina scientists shows that if a variant is tolerated by natural selection in another primate, it’s 99% likely not to cause disease in us. This isn’t true for more distantly related mammals—a variation that’s harmless in mice or dogs, for instance, may be pathogenic in gorillas or humans.
For the millions of years that primate species have been evolving in parallel, mutations that cause disease have been eliminated by natural selection. So by sequencing modern primates, we can improve our knowledge of which variants don’t cause disease.
Scientists at Illumina, in collaboration with those from 24 countries, just published the results of a vast study of primate genomes in four papers in the journal Science. The study sequenced over 800 individuals from 233 species of nonhuman primates, representing all 16 families and over 86% of living genera. But the sequencing was just the first step: Once they had all this data, they needed a way to interpret it. So they developed PrimateAI-3D.
An algorithm trained by evolution
The large language model ChatGPT has garnered much attention for its ability to generate humanlike responses to any prompt. Its artificial intelligence is trained on a massive data set of existing writing, so it can accurately predict the next sentences that would sound most natural based on the conversation up to that point.
PrimateAI-3D is an algorithm built on deep-learning language architectures analogous to those used in ChatGPT, but designed to model genomic rather than linguistic sequences. By presenting it with variants that are ruled out for disease in our macaque and orangutan cousins, its developers have effectively leveraged natural selection to train its parameters. The neural network learns where benign variants are represented in a gene and, by process of elimination, which regions are likely to cause disease if mutated. In this way, it learned how to accurately predict pathogenic variants in humans better than any human could.
The study published in Science compared PrimateAI-3D against 15 other machine-learning methods on four patient cohorts—one for neurodevelopmental disorders, one for autism spectrum disorders, one for congenital heart disease, and the UK Biobank. The first three cohorts are some of the largest studies to date that sequenced both an affected child as well as their unaffected parents; by contrast, the half a million genomes in the UK Biobank are mostly from healthy members of the general population. The study also evaluated the algorithm in the National Institutes of Health’s ClinVar database and other datasets.
Across six different clinical benchmarks, PrimateAI-3D outperformed all other existing methods by a wide margin. These findings will help researchers prioritize a small handful of variants that are most likely to affect a person’s health.
Furthermore, PrimateAI-3D demonstrated impressive improvements in predicting people at increased risk for common diseases in the UK Biobank cohort, especially across non-European ethnic groups. “What we find is that 97% of otherwise healthy people in the general population carried highly actionable variants for clinically relevant conditions,” says Farh, one of the study’s principal authors. “Up to now we’ve learned that you need genome sequencing if you have a rare disease or cancer—but actually it looks like every healthy person in the population has highly impactful variants in our genomes that are clinically relevant and are important to be informed about.”
Giving back to the gibbon and baboon
On top of the benefit to human health, these efforts could also be instrumental for primate conservation. “We’re in a hurry to collect this data because the majority of these species are on a fast track to extinction,” Farh says. The genetic diversity recorded in an animal’s DNA tells us not only how many individuals remain in that species’ population, it also tells the story of that population’s size over time, back through the generations. “That tells us how quickly the species is declining, and how much time they have left. That’s in their genomes.”
PrimateAI-3D’s developers found that its performance scales directly with the size of the dataset used to train it, so the more primate species they can sequence, the better the tool can become. The monkeys and apes can help us, and we can help them. “I think we’re only at the beginning,” Farh says. “There’s a tremendous amount that can be learned here. And the idea that you can learn more about our own species from other species is, I think, deeply romantic.”
PrimateAI-3D will be made broadly available to the genomics community in an upcoming release of Illumina Connected Software products.