One of the main challenges in single-cell data interpretation is cell type annotation. In this blog, we explain our process of collaborating with clients to analyze single-cell data and assign cell type identity labels.
To give a name to something is a central act in the scientific endeavor, whether it concerns Linnaeus’ binomial nomenclature of species, the HUGO gene naming system, or the act of labeling cell types and subtypes in a tissue sample.
When it comes down to single-cell RNA sequencing, we name and identify cell types based on their transcriptomes. Building on differences in their gene expression profiles, bioinformaticians group cells by clustering analysis and investigate them by differential gene expression analysis.
What follows is a process of sticking labels to clusters. It starts with identifying marker genes or gene sets that are highly present in those clusters and which are indicative of cell function. In this blog, we describe the steps from clustering and gene expression data to assigning cell labels.
If you want to read our data processing steps prior to cell type labeling, read How We Identify a Cell.
But first: How SHOULD WE define ‘cell type identity’?
Traditionally, biologists defined cell types in terms of morphology (think: eosinophil granulocytes) and physiology (think: stem cells). With the onset of antibody labeling, this got extended by defining cell types based on their cell surface markers. Later, RNA sequencing ushered in the possibility of defining cell types by their gene expression profiles. Now, in the era of single-cell biology, as we’re getting to know individual cells inside and out, the concept of cell type identity is observed to be evolving and remains actively debated.
Fundamentally, there is no general method for defining cell identity. This means that, with every publication, researchers propose a cell type identity and must deliver arguments for their labeling. To do so, they extract evidence from scRNA-seq data, inform scientific literature, and perform validation experiments. Programs like The Human Cell Atlas aim to harmonize and standardize such cell type identification efforts in the future.
The result of this complex situation is that our process of cell type annotation is majorly collaborative. Assigning cell type labels is not a default part of our preliminary data analysis, and in our custom data projects, we always collaborate with our clients to synergize our RNA-sequencing data expertise with their biological knowledge.
The PRACTICE of assigning cell type labels
Practical steps of cell type identification vary depending on the research question and the data. We can differentiate between four general settings: identifying established cell types, novel cell types, cells in different states or disease stages, and developmental stages:
Established cell types
The identification of established cell types is the most straightforward. It involves comparing the gene expression profiles of the clusters to established cell types or using (supervised) machine-learning methods to predict the cell type.
Some cell types have distinct marker genes. When differential gene expression analysis detects a cluster with canonical markers, the cells are likely of the accompanying cell type. For example, differentially expressed PFN1 may mark an osteocyte. For another example, differentially expressed CD1a and CD207 may mark a skin dendritic cell (Langerhans cell). There also exist negative markers in which the absence of a gene helps identify a cell type.
Other cell types have been firmly associated with a more complex yet distinct gene expression profile. In these cases, a statistical comparison between the discovered profile and the database can assign a cell type. We can perform this so-called gene set enrichment analysis to unearth cell types by their gene profiles. Since scRNA-seq collects data on the whole transcriptome, the technology suits such thorough profiling very well.
Automatic and advanced cell type annotation
There are many methods that automatically assign cell type labels based on differential gene expression analysis of single-cell data. Two relatively recent papers that benchmark such methods are Abdelaal (2019) and Pullin (2022), although we must note that the field is updating rapidly. It’s also possible to perform automated and interactive cell type calling in the user-friendly BioTuring BBrowserX software for single-cell data.
More interactively, you can compare your single-cell results with previously published papers and databases when they concern the same tissue or cell types. As mentioned above, this is also when the discussion can start on cell type proposition. This function, by the way, is also accessible in BBrowserX.
Well-annotated and accessible reference datasets improve the process of cell type annotation significantly. If this is available for your sample, the process is quick and easy. For example, if the sample data is compatible with the PBMC dataset (which includes scRNA-seq and CITE-seq data), we can use it for cell label transfer.
In our custom data projects, clients deliver the marker input for cell type identification. Eventually, they also define each cell cluster from the reported data analysis.
Novel cell types
There’s always the possibility of discovering a new cell type, although it’s extremely rare to do so in humans. To call a cluster as a new cell type is to propose that its distinction is biologically meaningful. It makes sense to identify a cell cluster as a new cell type if it performs different functions or originates from separate developmental pathways.
Differential gene expression analysis can reveal interesting aspects of a novel cell type’s physiology and phenotype. For example, in research on a childhood cancer of the nervous system, neuroblastoma, differential gene expression analysis revealed a sympathoblast-like gene expression profile. A sympathoblast is an embryonic nerve precursor with stem cell-like qualities. Hence, its transcriptomic similarity with malignant pediatric cancer cells helps generate hypotheses about the cancer cell of origin and the cellular mechanisms of disease progression.
After such analyses, researchers can explore the cells’ physiology with additional functional or genetic testing. To name just one example, a hypothesized novel cardiomyocyte subpopulation may undergo further study by characterizing its reaction to electrical impulses (single-cell electrophysiological characterization).
Cell states and disease stages
A cell can exhibit different phenotypes in response to stimuli. Immune cells, for example, can be activated by antigen presentation. Or dormant progenitor cells can be (re-)activated by growth factors. Moreover, cell states can be related to diseases. A cardiomyocyte subpopulation, say, may show signs of cardiac stress.
Complex diseases such as cancer and immune or neurodegenerative disorders exhibit notoriously heterogeneous and dynamic cell populations. Often, researchers have linked gene expression patterns to specific disease stages or tissue states, such as inflammation or metastasis. We can investigate the scRNA-seq data for these gene expression profiles to aid cell type annotation and help unravel disease characteristics.
These states are often associated with specific biomarkers that may be present in the scRNA-seq data. In these cases, the techniques of cell type identification explained above can be expanded by gene set enrichment, gene ontology, gene regulatory network, or co-expression analysis. The goal of such analyses is to get a better understanding of the cell’s physiology by its transcriptome. Here, scRNA-seq data analysis can also inspire hypotheses for further functional testing.
Developmental stages
Developmental biology and regenerative medicine studies often focus on the developmental or differentiation stages of cells. For example, a researcher might aim to identify different stages of a developing brain cell, from progenitor cells to immature neurons to several differentiated neuron subsets. Cell stages are also significant when working with development models such as organoids or induced pluripotent stem cells.
In these cases, scRNA-seq data can be pursued by trajectory inference or pseudotime analysis. For example, sampling an embryonic tissue at different time points enables you to reconstruct developing cells changing from stage to stage. Then, in silico trajectory inference methods can be performed on a scRNA-seq dataset.
Understanding cells’ development gives insight into their identity in the same way that you can get to know a country’s culture by learning its history. Hence, single-cell trajectory analysis can also support general cell type annotation.
Concluding remarks
Ultimately, robust cell type identification depends mainly on how valid the gene or gene set markers choice is. In practice, we leverage the biological expertise of our clients to achieve this.
It’s important to stress that the best practice is to follow up scRNA-seq experiments with validation experiments of another nature to further characterize the cells in your sample. Likewise, it can be vital to find out whether the found cell types represent a stable cell type or a transient molecular state.
Need some extra support? Read more on our data analysis approach here or contact our data experts by mailing to data_consulting@scdiscoveries.com.