How we tackle cell type annotation

Cell type identification or cell type annotation is one of the main goals of single-cell sequencing and single-cell transcriptomics. This image illustrates in vivid color the high resolution with which a scientist can examine individual cells.

Last updated August 26, 2025 – this blog reflects the latest strategies for single-cell RNA sequencing (scRNA-seq) data annotation and cell type identification.

Assigning cell type identities is one of the central challenges in interpreting single-cell data. In this blog post, we walk through our approach to transforming clusters of gene expression data into clear, meaningful biological insights.

To give a name to something is a central act in the scientific endeavor, whether it concerns Linnaeus' binomial nomenclature of species, the HUGO gene naming system, or labeling cell types and subtypes in a tissue sample. When it comes down to single-cell RNA sequencing (scRNA-seq), this tradition continues as we identify and name cell types based on their transcriptomic signatures.

The process begins by clustering cells according to their gene expression profiles. We then apply a combinatorial approach that integrates reference datasets, differential expression analysis, and manual validation of canonical marker genes. This allows us to map clusters to known cell types or biological functions, essentially assigning an identity to each single cell.

In this blog, we describe the steps from clustering and gene expression data to assigning cell labels.

If you want to read our data processing steps prior to cell type labeling, read How We Identify a Cell.

But first: How should we define ‘cell type identity’?

Traditionally, biologists defined cell types in terms of morphology (think: eosinophil granulocytes) and physiology (think: stem cells). With the onset of antibody labeling, this got extended by defining cell types based on their cell surface markers. Later, RNA sequencing unlocked the possibility of defining cell types by their gene expression profiles. Now, in the era of single-cell biology, as we’re getting to know individual cells inside and out, the concept of cell type identity is continuously evolving and remains actively debated.

Fundamentally, there is no general method for defining cell identity. This means that, with every publication, researchers propose a cell type identity and must deliver arguments for their labeling. To do so, they extract evidence from scRNA-seq data, inform scientific literature, and perform validation experiments. Programs like The Human Cell Atlas aim to harmonize and standardize such cell type identification efforts in the future.

Because of this complexity, cell type annotation in our work is highly collaborative. It’s not a default part of our preliminary analysis—instead, we partner closely with clients to combine our expertise in interpreting transcriptomic data with their domain-specific knowledge.

Depending on the nature of the dataset and your research, cell identities may fall into one or more of the following categories:

  • Established cell types
  • Novel cell types
  • Cell states and disease stages
  • Developmental stages

Established cell types

These are the most straightforward to identify, typically through reference datasets. Some cells have distinct markers (e.g., PFN1 for osteocytes, PECAM1 for endothelial cells), while others require broader gene set enrichment to match known profiles.

Novel cell types

New cell types are rarely identified, but when clusters are biologically distinct—based on function or developmental origin—they may represent something novel. In such cases, differential expression guides discovery, which can be followed by functional validation.

Cell states and disease stages

Cells can change state in response to perturbation. scRNA-seq can detect these transitions, with tools like enrichment or co-expression analysis helping to identify patterns tied to activation, stress, or pathology.

Developmental stages

In developmental contexts, scRNA-seq can reveal progression from progenitor to mature cell types. Trajectory and pseudotime analyses help reconstruct these paths, supporting both annotation and biological insight.

Study an example Exploratory Data Report

Explore our Exploratory Data Report for a real dataset, demonstrating how we connect count tables to t-SNE and UMAP visualizations, and take the first steps toward cell type annotation.

GET A REPORT 

Assigning Cell Type Identities

Practical steps of cell type identification vary depending on the research question and the data. In general, our approach can be streamlined as follows:

  1. In-depth preprocessing – including quality control, batch effect correction, and clustering analysis.

  2. Reference-based annotation – mapping clusters to known cell types using established datasets.

  3. Manual refinement – fine-tuning labels through expert curation of marker genes and biological context.

In-depth Preprocessing

High-quality data is the foundation of reliable cell annotation. We start with rigorous quality control to filter out low-quality cells or genes, followed by doublet detection to exclude multiplets from further analysis. Next, batch correction is applied to mitigate technical variation from differences in sample preparation or sequencing runs.

Finally, we perform a preliminary clustering analysis aimed at grouping cells with similar transcriptomic profiles, offering the first structural view of the dataset and laying the groundwork for accurate annotation.

Reference-based Annotation

In this step, we conduct an in-depth review of the literature and available cell atlases to identify the most suitable reference datasets, which serve as a ground truth for a first preliminary annotation. Clients are also welcome to specify any particular study or atlas they prefer for referencing their data.

We then align the gene expression profiles of each single cell with references from similar tissues, with tools such as SingleR or Azimuth. Notably, the Azimuth project provides cell type annotations at different levels—from broad categories to very detailed subtypes—so you can choose the level of detail that best fits your needs.

Following reference-based annotation, we check how the predicted cell types align with our clusters. If the reference indicates that two clusters represent the same cell type, we merge them. If it points to finer differences, we adjust the resolution to capture those subtypes. When possible, multiple reference datasets are used at this stage, allowing for the generation of a robust consensus annotation.

While this iterative refinement is time-consuming, the integration of reference-based annotation tools within popular single-cell analysis platforms like Seurat ensures a streamlined and efficient workflow.

Manual Refinement

This step adds a crucial layer of biological insight. Automated methods may miss subtle distinctions or misclassify cells in edge cases.

To address this, we carefully review the preliminary annotations against multiple sources of evidence. This includes verifying expression patterns of canonical marker genes, performing differential gene expression analyses to detect unique or unexpected signatures, and consulting relevant literature to contextualize findings. Most importantly, we integrate the client’s biological expertise, which is often essential for interpreting ambiguous clusters or edge cases.

Manual refinement not only corrects potential misclassifications but also allows for more precise labeling—such as distinguishing between closely related cell subtypes, identifying transitional cell states, or flagging novel populations for further investigation. This ensures that the final cell type assignments accurately reflect the true biological context of the study, ultimately leading to more robust interpretations and enabling discoveries that might otherwise remain hidden.

Concluding Remarks

Robust cell type identification depends on several key factors: the quality of your data, the availability of suitable reference studies, and the validity of the chosen marker genes or gene sets. In practice, we combine our computational expertise with the biological knowledge of our clients to ensure that annotations are both technically sound and biologically meaningful.

It’s important to stress that the best practice is to follow up scRNA-seq experiments with validation experiments of another nature to further characterize the cells in your sample. Likewise, it can be vital to find out whether the newly identified cell types represent a stable cell type or a transient molecular state. In our data consulting projects, clients play a central role by delivering important cell markers, potential reference, and, in general, their biological expertise, helping define the final identity of each cluster based on our analysis output.

For more details on our approach, explore our full data analysis methodology here or contact our data experts by emailing data_consulting@scdiscoveries.com.