1. Home
  2. Support
  3. Data analysis
  4. Our Data Analysis Support

Our Data Analysis Support

This single-cell data analysis support covers the following subjects: how you can download your data, what data files you can expect, and what to expect from our data mapping & exploratory data analysis.

How to download your data

When your data has arrived, and the exploratory analysis has been performed, you will receive an e-mail from our data team to download your data.

Your data will be sent with our secured Amazon Web Services (AWS) server. There are multiple ways to download:

  • Your institute uses AWS as a data portal.
    • In this case, you can fetch your data directly from our AWS server space to your own AWS server space.
  • Your institute is not using AWS.
    • Download your data by pasting the download link directly into your internet browser.
    • Download your data via a command-line session of your server cluster.

 

 

Data files

We send different data files depending on the service that you use.

Data files | 10x Genomics

We send parts of the standard Cell Ranger output, which consists of the files listed below, as well as a package of files and reports from our own preliminary clustering and differential gene expression analysis that was performed on all your samples that have to be compared.

The following files are always included:

  • HTML reports – contains .html files for each sample that contain basic Quality Control (QC) metrics and clustering info, as provided by the automated Cell Ranger software.
  • Raw_counts – contains the cellranger output (mapped count tables). These are the raw count matrices with all barcodes.
  • Filtered_counts – same as above, but filtered for empty barcodes (this is used for downstream analysis).
  • Metrics – contains the QC metrics from the HTML reports in .csv format
  • Clustering – contains a preliminary clustering and differential gene expression analysis we have done with Seurat. You can find the description of the folders and files in our Seurat manual.
  • cloupe_file  contains the cloupe files needed to load the data into the 10x Loupe Cell Browser.
  • H5 files – contain the file structure information as generated by Cell Ranger.
  • Fastq – contains the raw FASTQ files.

Optional: BAM files – if you want to receive .bam files as resulting from cell ranger, we can include this in the data transfer as well.

 

Data files | SORT-seq

The diagnostics folder contains the Quality Control and diagnostic plots for the plates’ data (see the figure below for examples). Here you can check how well each cell worked by looking at the endogenous reads, UMIs, and genes per cell. You can also check how well the technical handling of the plate worked by looking at the ERCC spike-in reads. If you compare spike-in with endogenous reads (as in the third plate-layout plot), you can see in which wells the reaction worked. Wells with no or little endogenous reads are red.

The graph “total unique reads w/o spike-in reads” shows the number of transcripts detected per cell on the x-axis (log10 scale) and frequency on the y-axis. You want a lot of cells with an adequate number of transcripts (preferably >1000 UMIs per cell). A cumulative plot also represents the number of genes detected per cell. In the example, we also see that >1000 genes are detected in most wells. 

The graph “top expressed genes” shows the genes with the highest expression detected in all cells across the plate. The graph “top noisy genes” reports the genes which vary the most in all the cells across the plate.

Example SORT-seq data analysis quality control and diagnostic plots. First row: total endogenous reads, total spike-in reads, ratio spike-in over endogenous reads. Second row: total unique reads without spike-in reads, cumulative dist genes, oversequencing molecules. Third row: top expressed genes, top noisy genes.
Example SORT-seq data analysis quality control and diagnostic plots

The Counts folder contains count tables. These come in three flavors:

  • Read counts (raw mapped reads);
  • Barcode counts (UMI-corrected version of 1);
  • Transcript counts (Poisson counting statistics corrected version of 2) – this is the file we use as input for downstream analysis since it comes closest to the real situation in the cell.

It is up to you to decide which one you are most interested in. For clustering and downstream analysis, we usually use the transcript counts.

Fastq contains the raw sequencing FASTQ files. These typically are a set of eight files: two reads (Read 1 and Read 2) from each of the four lanes of the Nextseq500. R1 and R2 indicate the read type. L00X indicates the lane. To map, we concatenate all read 1 / read 2 files into two files, and use this as input for data mapping with BWA. See below for more info about data mapping with BWA.

The Clustering folder contains a preliminary clustering analysis performed with Seurat.

 

Data files | Bulk RNA sequencing

The Diagnostics file contains basic Quality Control and diagnostic plots for your samples. It contains the following information:

  • Reads per sample – this shows the total number of mapped reads for all samples.
  • Genes per sample – this shows the total number of genes detected for all samples.
  • Correlation heatmap – this shows a correlation heatmap indicating how similar samples are to each other (Red = similar, blue = different). Samples are clustered by unsupervised clustering based on their similarity.

The Counts folder contains a count table with the expression matrix of your data. Rows are genes and columns are samples.

The *_raw file contains the raw, not normalized reads

Fastq contains the raw sequencing FAST files for all samples.

 

Data files | VASA-seq

The diagnostics file contains the Quality Control and diagnostic plots for each plate. Here you can check how well each cell worked (endogenous reads, UMIs, and genes per cell) and how well the technical handling of the plate worked (ERCC spike in reads). It also includes a plot to estimate oversequencing, by looking at how many times each molecule was sequenced (by comparing UMI corrected with raw reads). This tells us if enough sequencing depth was assigned to the sample. It also reports the most highly expressed genes and the most variable genes in the library. You can see an example of such a diagnostic plot here.

The Counts folder contains count tables. These come in three flavors:

  • Read counts (raw mapped reads);
  • Barcode counts (UMI-corrected version of 1);
  • Transcript counts (Poisson counting statistics corrected version of 2) – this is the file we use as input for downstream analysis since it comes closest to the real situation in the cell.

Each of these threesomes of files is then also split into three separate kinds of mapped reads:

  • Exonic reads – reads found in exons
  • Intronic reads – reads found in introns
  • Total reads – a combination of both intronic and exonic reads

It is up to you to decide which one you are most interested in. For clustering and downstream analysis, we usually use the transcript counts of the exonic reads.


Fastq contains the raw sequencing FASTQ files. These typically are a set of eight files: two reads (Read 1 and Read 2) from each of the four lanes of the Nextseq500. R1 and R2 indicate the read type. L00X indicates the lane. To map, we concatenate all read 1 / read 2 files into two files and use this as input for data mapping with BWA. See below for more info about data mapping with BWA.

The clustering folder contains a preliminary clustering analysis performed with Seurat.

 

 

Data mapping

When performing a mapping assembly, the generated data will be ‘mapped’ against the reference genome. The genetic information of your species of interest is stored in the reference genome, e.g., cell types and positions of genes on the chromosome. During this procedure, the reference genome is scanned by the algorithm for the perfect spot to map a read. When the algorithm finds a match between your reads and the reference genome, this additional biological information within your data will be saved. This provides you with information about the presence or absence of transcripts of specific genes within your data.

Mapping is most efficient when the mapping software indexes the genome. Two widely used methods to perform the mapping procedure are Spliced Transcripts Alignment to a Reference (STAR) and Burrows-Wheeler Aligner (BWA). The big difference between these two methods is the extra information from STAR about spliced and unspliced transcripts in your data.

Database of genomes

You can find an overview of the genomes we currently have available in the table below. Is your species not listed here? Please provide us with the genome so we can map your data. On top of this list, we work with project-specific genomes. You can upload your sequence during sample submission

Trivial name Genome
Mosquito species Anopheles_gambiae
Arabidopsis Arabidopsis_thaliana_TAIR10
C. elegans Caenorhabditis_elegans.WBcel235
Dog CanFam2011) – ERCC
Chimpanzee Chimpanzee (PanTro)
Single cell green alga Chlamydomonas_reinhardtii
Fruit fly Drosophila Dm6
Oral bacterium Fusobacterium_nucleatum
Chicken Gallus gallus
Hamster CHO_CriGri-PICR_refseq_LONMF
Human Human hg38
Mouse Mouse mm10 + mito
Protozoan parasite Plasmodium_falciparum
Rabbit OryCun2.0
Rainbow trout oncorhynchus_mykiss
Rat Rattus norvegicus
Salmon Salmo_salar.ICSASG_v2.105
Wild boar/pig Sus scrofa
Frog Xenopus 9.1
Zebrafish Zebrafish zv9/zv11

 

Exploratory data analysis

Which steps are taken during our standard data analysis?

For our exploratory analysis, we follow the steps indicated in the figure below. This workflow can roughly be divided into three components: pre-processing, dimensionality reduction, and clustering. 

 

Pre-processing

Quality Control

Pre-processing starts with quality control, where several quality metrics are visualized. These metrics include the number of genes per cell, the number of UMIs per cell, the percentage of mitochondrial genes per cell, and the percentage of ribosomal genes per cell. 

Filtering

A cut-off value is selected to filter for cells of adequate quality. We filter cells based on the number of UMIs for our exploratory analysis. The selection of this cut-off is relatively arbitrary, so we recommend re-evaluating this for your own analysis. Cells can also be filtered on other metrics, such as the percentage of mitochondrial genes per cell. 

Normalization

For normalization and further steps, we use functionalities provided by the Seurat package. Gene expression values for each cell are normalized by the total transcript counts and scaled using the default normalization method. The outcome is log-transformed. 

Determining variable features

The genes considered most informative for the variability in the data are selected by applying variance-stabilizing transformation. The genes with the highest variance-to-mean ratio are selected. The 2,000 most variable genes are used for further analysis. These are the genes that show the largest differences in expression between cells. 

Scaling

To improve comparison between genes and prevent highly expressed genes from dominating the analysis, the relative gene expression abundances between cells are corrected by a linear transformation. During this step, the gene counts are scaled to have 0 mean expression across cells and 1 variance across cells. 

Dimensionality reduction

Principal Component Analysis

The dimensionality of the dataset is reduced by principal component analysis (PCA). To put it briefly, PCA simplifies further analysis by creating components that reflect the variation in the dataset. 

Selecting principal components

The principal components that explain most of the variance are used for clustering and UMAP/tSNE embedding. The number of principal components to include is determined in an automated fashion. We recommend re-evaluating the selected number of PCs for your own analysis.

Creating UMAP and tSNE

For visualization, the cells are embedded in a two-dimensional space by UMAP or tSNE. These non-linear algorithms try to capture the underlying manifold of the data and place similar cells close to each other. Cells that are grouped together during clustering, are expected to co-localize on UMAP and tSNE plots. In our analysis, the UMAP and tSNE embeddings are used to visualize clusters, libraries, gene expression, and quality metrics. 

The most relevant parameter is the number of principal components used. 

Clustering

Determining clusters

The cells are clustered into subpopulations to find biologically meaningful trends in the single-cell data. They are grouped together based on how alike their gene expression profiles are. Relevant parameters are the resolution and the number of principal components used. The number is the same as the one used to create the UMAP and tSNE. 

Differential expression analysis between clusters

Differential expression analysis is performed to find the genes that mark the difference between clusters. The p-values are computed for differentially expressed genes between cell clusters in this step. The enriched genes characterize the cluster. The statistical test used is the Wilcoxon rank-sum test, in which genes are ranked by their difference in expression between clusters. Other relevant parameters are: 

  • whether to analyze only upregulated expression in the selected cluster with regards to the rest of the clusters or both up- and downregulated cells;
  • the minimum percentage that a gene is expressed in either the selected cluster or the remaining clusters;
  • the threshold for how much log-fold difference the genes needs to have to be included in the results;
  • the threshold for the p-value that a gene needs to have to be included.

 

 

Data storage

Once you have received your data and analysis results, we will store them in an Amazon S3 Standard Bucket for 30 days. This gives us enough time in case we need to help you with any questions you may have. After 30 days, we move your data to Amazon’s Glacier Deep Archive, which provides long-term and secure data storage.

Book a meeting with one of our specialists to discuss your project.

Book a call

Was this article helpful?

Related Articles