Single-Cell Analysis is Advancing Insights in Developmental Biology
Cole Trapnell, PhD is the principal developer of TopHat1, Cufflinks2, Monocle3, and other bioinformatics tools that have entered common use among computational biologists. He got his start in bioinformatics as a graduate student at the University of Maryland, where he earned his PhD in computer science. He originally wasn’t planning a career in biology, but his interest was piqued by colleagues who were using Solexa sequencers and just starting to see next-generation sequencing (NGS) data. The problem of mapping short reads back to the genome was, he realized, a high-throughput computational problem. As a post-doc in John Rinn’s lab in the Harvard University Stem Cell and Regenerative Biology department, he pioneered methods for analyzing cell differentiation using single-cell transcriptome sequencing.
Now in the Department of Genome Sciences at the University of Washington, Dr. Trapnell uses Illumina NGS data to specialize in transcriptome analysis and software for single-cell experiments. His lab focuses on single-cell genomics technology. His goal is to determine how the program of development is encoded in the genome by identifying the genetic circuits that transform a cell from one type into another. Identifying these circuits is critical to understanding human health and disease.
To do this, Dr. Trapnell relies on the NextSeq 550, the NovaSeq 6000, and an interdisciplinary team of scientists. “Most people in the lab are interdisciplinary,” he says, “with either computer scientists getting into benchwork, or hematologists and oncologists who are learning about computational techniques.”
Recently, Dr. Trapnell shared with us his views on the importance of understanding cell lineage, his lab’s experience with single-cell RNA sequencing (scRNA-Seq), and his application of combinatorial indexing, a way of analyzing the genome of individual cells without isolating them. He also spoke about his belief in the power of collaboration and how this has guided his investigative philosophy.
Cole Trapnell, PhD is an assistant professor in the Department of Genome Sciences at the University of Washington.
Q: How do you approach the study of development?
Cole Trapnell (CT): We want to understand the architecture of the genetic circuits cells use to transform from one type into another. That most prominently happens in development, but it also happens in disease. We’re interested in the program of development and how it’s encoded in the genome. That’s a very big question, and not all that concrete. It’s way too big for even a bunch of labs working together to answer.
My lab’s strategy for making progress toward answering this question, learning how the program of development is encoded in DNA, is to build technologies and software and put them in the hands of many other scientists around the world.
We develop advanced technologies around single-cell genomics. It could be experimental, like a new assay or a new protocol, or it could be computational, like an algorithm that extracts new insights from the kind of experiments we’re already doing. Then we write a paper where we showcase the technology and pair it with an application that is hard to do without the technological advance. We devote about 25% of our efforts to collaborating with people who have relevant questions about development or disease and work with them to apply our technology to their biology problem.
Q: All the software you've developed is open source. Why?
CT: People will happily work on a piece of software to solve a science problem just because it’s fun. If you try to charge money for a software tool that was meant for scientists, then somebody else will just do it for free. I want to help people. Even charging a nominal fee will dramatically reduce the size of the user base. The reason TopHat became widely used was because it was the first thing that solved the problem of mapping shotgun cDNA sequence reads to the genome. It took a long time for something better to come along.
Q: What is the importance of understanding cell lineage?
CT: Understanding development is a fundamental goal in biology and part of the value is in satisfying our immense curiosity about it. One of the examples that I find captivating is C. elegans. Every adult worm has the same number of cells of the same cell type. It’s a program that runs like clockwork. Every animal you get is the same. We don’t work like that. I have a different number of cells in my body than you, and they’re different types, and yet you and l probably look about the same. Understanding how the developmental program is reproducible, even in the face of mammalian variability in the number of cells that are produced is fundamentally interesting.
In terms of practical applications, many pediatric diseases, for instance, have a developmental component. Particularly for rare genetic diseases, there’s not much we can do. And yet, we’re starting to see success in areas where there’s a genetic component or a driver mutation that’s causing the disease. If you know exactly how the genetic circuit that controls tissue development works in healthy people, you can make a prediction about how it’s broken in people with disease and intervene.
Another application is in organ transplants. There’s a lot of diseases we can cure with an infinite supply of transplantable organs. If we want to make organs, we need to understand how they get made in development because we will want to make them consistently, reproducibly, and robustly.
"Single-cell RNA sequencing allows you to use a DNA sequencer as a microscope to determine which genes are active transcriptionally in individual cells."
Q: What is the unique value of scRNA-Seq?
CT: Single-cell RNA sequencing allows you to use a DNA sequencer as a microscope to determine which genes are active transcriptionally in individual cells. It’s a way of profiling the molecular contents of individual cells and, in practice, people are interested in doing that for many cells in one experiment.
The most basic use for scRNA-Seq to figure out what type of cell you’re looking at and how many cells you have. You can also discover new cell types if you have some that do not fit the expected classification. Another application is to see how the cells respond to a perturbation, such as drug exposure, environmental stimulus, introduction of disease, or gene editing. Usually what happens is that some genes change in response. Measuring which genes change helps you figure out how the perturbation is working mechanistically, so you can make some guesses about the molecular mechanism in the cell. If you’re trying to understand, for example, how a compound works to kill cancer cells, looking at gene expression can be very helpful.
Q: What is the role of scRNA-Seq in the study of development?
CT: The problem that the genome has to solve is that it’s given one cell and it needs to program the timing of cell divisions to make a whole animal. Cells have to proliferate in the right time and place to develop into limb, brain, heart, liver, and so on. They all use different genes. They all make different proteins. They all perform different tasks. And they all work together to function in life. When a cell divides into two cells, one or both of those cells will change what it’s doing and become a new type of cell. The timing of those fate decisions is encoded in the genome. If you do an scRNA-Seq experiment on a developing animal, you’ll capture individual cells that are in different points in the process of making fate decisions.
“Pseudotime” is the concept we use to organize the data into a picture that represents the sequence of fate decisions that are being made throughout development. With enough timepoints, you could assemble a comprehensive picture of how the developmental program is working from one cell all the way to the adult. By virtue of scRNA-Seq, you have a measurement of the transcription of every gene. You can make some inferences about which genes are active at which points in development, in which types of cells, and guess which genes are involved in the decision-making process at various stages. You can identify the genes that cause a developing cell in pancreas to become an insulin-secreting cell as opposed to a glucagon-secreting cell. That’s a therapeutically critical fate decision to understand.
Q: What are some of the challenges involved in scRNA-Seq?
CT: Single-cell data sets are enormous. You have tens of thousands of cells. My lab just published a paper where we studied more than half a million cells for a chemical biology perturbation experiment. You might run out of RAM to do your analysis. Some of that can be solved with software, but that means that bioinformatics people have to rewrite all the code to deal with the giant data sets.
Another challenge is what we call sparsity. In this context, it means that you have a cell expressing five copies of a single gene and you want to detect that. You want to know that there’s five copies, but the scRNA-Seq protocols don’t capture every mRNA in the cell. They capture a fraction and you hope you capture a large enough fraction that you can tell how your gene compares in expression with some other gene. If you don’t capture a large enough fraction, and there’s just five copies, you might happen to catch no copies of the gene in that cell. That would mean you think the gene is turned off when it’s not really off. It’s just that you didn’t detect it. Absence of evidence is not evidence of absence. There’s been a lot of discussion and a lot of work on what is the best strategy to deal with sparsity.
A third challenge is classification. It has a biology component and a bioinformatics component. Usually, the very first thing you want to do is figure out what types of cells and how many of each you have in your data set. You can tell by looking at that whether your experiment worked. The biology component happens when you prepare tissue. You have to create a suspension of cells. There are a lot of different ways of doing that. Some of them might chew up certain cell types, leave others intact, and leave others not fully dissociated. Then you do your sequencing experiment and you find out you’re missing your favorite neuron or you’re missing fibroblasts. That’s bad if you’re studying fibrosis.
We addressed the bioinformatics component with Garnett4 software. In the fibroblast example, there’s not one perfect gene that’s expressed in all types of fibroblasts and nowhere else. There’s a grey area. You find cells that are expressing four out of five genes that you would expect to see in fibroblasts, so they might be fibroblasts, but they might be something else. People would make diagrams where they cluster the cells, where each cluster was a cell type. This was problematic for three reasons. One, it was very slow and laborious. Two, since it isn’t systematic, if you change the clustering criteria, you have to go back and redo it. Three, if you cluster a data set and then apply the clustering algorithm to one of the clusters, that one cluster would split into three or four or five clusters. Do you have one cell type or five cell types? The assumption you’re making about how cell type is defined by the transcriptome and clustering is not really correct.
We want to allow a cell biologist with deep knowledge of the system to write down the genes that they expect to be expressed in each cell type ahead of time and apply it systematically to the data set, and then score each cell according to those expectations. There’s a whole lot of extra machine learning that goes into making it work well, but the result is Garnett. Garnett is a classifier that we hope will automate the process of counting cells according to type.
"We want to allow a cell biologist with deep knowledge of the system to write down the genes that they expect to be expressed in each cell type ahead of time and apply it systematically to the data set...the result is Garnett."
Q: What is trajectory analysis?
CT: When cells transition from one type to another, there’s a continuum in terms of which genes are expressed. The cells are not going to split into two discrete groups. Some genes turn on or off before others. Trajectory analysis tries to organize the cells in order of how far they are through the process of transitioning. It’s important to know that because the genes that turn on at the beginning are important early in the decision-making process and the genes that come on later might not be as important in making the decision. In the case of cardiomyocytes, that might be important for doing things that cardiomyocytes do, like beating, but maybe they’re not important for making the decision to become a cardiomyocyte.
Q: What is Monocle?
CT: Monocle is a software tool and also an active research project. It introduced the concept of trajectory analysis with scRNA-Seq. There have been three major versions of Monocle. The early version was able to process simple experiments with only a few hundred cells. Over the past five years, my lab has released better versions of Monocle with machine learning to organize cells according to their genes. It’s an open source program written in R and anyone can download it for free.
The second version focused on larger data sets and trying to identify the fate decision points in trajectories, where some cells go one way and some go another. The third version does the same things, but at the scale and complexity we needed to do mouse embryo experiments. In that study, there were hundreds of cell types differentiating at once, and there were some special problems that needed to be solved.
Q: Why was SCI-Seq considered a breakthrough?
CT: Single-cell combinatorial indexing and sequencing5, or SCI-Seq, is a scheme for doing single-cell genomics. You can measure RNA-Seq, ATAC-Seq4, which is an epigenetic assay that measures the DNA binding ability of chromosomal DNA, and other things with it. Darren Cusanovich and Risa Daza, a post-doc and staff scientist in Jay Shendure’s lab, respectively, were the first people to devise a combinatorial indexing–based single-cell protocol. They figured out that you could do single-cell genomics without actually physically isolating individual cells.
Conventionally, what people had been doing was putting one cell from a suspension into one well of a 96-well plate, putting another cell in the next well, and so on, and then make a library in each well. That’s fine, but it’s really laborious and it doesn’t scale very well.
Combinatorial indexing is very different. You populate each well with many cells, poke holes in them, and do the first step of library construction inside the cells. In RNA-Seq, that first step is reverse transcription. Then you label the product with a sequence that corresponds to the well in which the reaction is being performed. The cells are still intact, and you pool them together and add them to a new 96-well plate. In the case of SCI-Seq, you label them again at the PCR stage. That means that every RNA-Seq fragment that you put on the sequencer is now labeled twice, once from the first well and once from the second well, so you’ve got 96 times 96 possible pairs. If you only pushed 1000 cells through the workflow, when you see two reads that have the same pair of identifying barcodes, you can infer that they came from the same cell.You can do additional rounds of indexing. Instead of doing two labeling plates, you do three, and perform your experiment with hundreds of thousands of cells.
There are many different ways you can deploy this concept and measure different things. You can measure more than one thing in the same cell. Jay Shendure and I had a paper where we were doing both ATAC-Seq and RNA-Seq in the same cells. All this SCI-Seq work has been in collaboration with Illumina.
Q: What advice do you have for people first getting into single-cell genomics?
CT: I’ve been really impressed when I’ve gone on visits how quickly the new technologies have been mastered, particularly by the grad students and post docs. There’s a real ambition to adopt it. For labs that are thinking about doing their first experiment, what I would say is to prepare for the reality that it takes a few weeks to generate the data and a few months to analyze it. The data sets are very complex. The biology is invariably complicated. Particularly with RNA-Seq, making inferences about the kinetics of some signaling pathway based on transcription can be very challenging.
The materials are expensive. It’s very possible that if you don’t set up your experiment in just the right way, then you might not be able to draw conclusions and you’ve spent quite a bit of money. The temptation will be to do a small experiment first, but you might want to consider a larger experiment with more controls and better design. It might be cheaper in the long run.
Regarding bioinformatics, I would definitely come armed with a very clear idea of the genes that you expect to be expressed specifically on each of the cell types. You’re going to need to classify cells on that basis and be an expert in your system. The knowledge of the wider cell biology community has not been captured in a way that programs can label cell types from transcription data.
Be ready to do some programming. Be comfortable with R or Python, at least at a basic level. You’re going to have write a little bit of code. Use the forums. There are forums for each of the major tools and software developers can’t keep up with all the email.
"I would like to see all the things we can imagine measuring in single cells democratized and deployed throughout the worlds of biology and medicine. I think that you can extract insights that are really hard to come by with other techniques."
Q: What's next for single-cell genomics?
CT: A paper just came out on an extension to SCI-Seq called sci-Plex. It’s a way of looking at millions of cells from many different conditions, and it allows us to do drug screens. Rather than constructing atlases of all the cell types in an organism, we’re trying to do large perturbation experiments and build quantitative models of gene regulation that reveal mechanistically how the perturbations work. You can imagine using that to understand the mechanism of action of a compound that you know is a hit, but you don’t know how it works.
Q: What is your long-term vision for single-cell genomics?
CT: I would like to see all the things we can imagine measuring in single cells democratized and deployed throughout the worlds of biology and medicine. I think that you can extract insights that are really hard to come by with other techniques. I’m really mystified by the fact that our DNA encodes what is basically the most complex and beautiful program that we’ve ever encountered as a species, and I want to know how it can generate so many different cell types that do so many different things from a single, static program. Even if we can understand a small piece of it, like how the genome encodes the precise pattern of spatial organization of cells in an organ, that is a triumph.
Learn more about the products and systems mentioned in this article:
NovaSeq 6000 System, www.illumina.com/systems/sequencing-platforms/novaseq.html
NextSeq 500 System, www.illumina.com/systems/sequencing-platforms/nextseq.html
- Trapnell C, Pachter L, Salzberg S. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105-1111.
- Trapnell C, Roberts A, Goff L et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562-578.
- Trapnell C, Cacciarelli D, Grimsby J et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381-386.
- Pliner H, Shendure J, Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat Methods. 2019;16(10):983–986.
- Cao J, Packer JS, Ramani V et al. Comprehensive single cell transcriptional profiling of a multicellular organism. Science. 2017; 357(6352):661–667