Introduction

Recent advances in single cell RNA-seq (scRNA-seq) have enabled an unprecedented level of granularity in characterizing gene expression changes in disease models. Multiple single cell analysis methodologies have been developed to detect gene expression changes and to cluster cells by similarity of gene expression. However, the classification of clusters by cell type relies heavily on known marker genes, and the annotation of clusters is performed manually. This strategy suffers from subjectivity and limits adequate differentiation of closely related cell subsets. Here, we present SingleR, a computational method for unbiased cell type recognition of scRNA-seq. SingleR leverages reference transcriptomic datasets of pure cell types to infer the cell of origin of each of the single cells independently. SingleR’s annotations combined with Seurat, a processing and analysis package designed for scRNA-seq, provide a powerful tool for the investigation of scRNA-seq data. We developed an R package to generate annotated scRNA-seq objects that can then use the SingleR web tool for visualization and further analysis of the data – http://comphealth.ucsf.edu/SingleR.

Here we explain in details the SingleR pipeline and present examples of applying SingleR on publicly available mouse and human scRNA-seq datasets.

SingleR specifications

Reference set: A comprehensive transcriptomic dataset (microarray or RNA-seq) of pure cell types, preferably with multiple samples per cell type.

Single-cell set: Single-cell RNA-seq dataset. It is a good practice to filter-out cells with non-sufficient genes identified and genes with non-sufficient expression across cells. As default we filter-out cells containing less than 500 genes and consider genes which were found in at least one sample. The effect of the gene number threshold on accuracy is discussed in Supplementary Information 2.

Annotation: SingleR runs in two modes: (1) Single cell: the annotation is performed for each single cell independently. (2) Cluster: the annotation is performed on predefined clusters, where the expression of a cluster is the sum expression of all cells in the cluster.

Step 1: Spearman correlations

Spearman coefficient is calculated for single cell expression with each of the samples in the reference dataset. The correlation analysis is performed only on variable genes in the reference dataset. The example below shows a correlation between the expression of a single cell (x-axis) and a reference sample (y-axis). Each dot in this scatter plot is a gene:

load (file.path(path,'GSE74923.RData'))
SingleR.DrawScatter(sc_data = singler$seurat@data,cell_id = 10, 
                    ref = immgen, sample_id = 232)

  • Variable genes: SingleR supports two modes for choosing the variable genes in the reference dataset.
  1. ‘sd’ - genes with a standard deviation across all samples in the reference dataset over a threshold. We choose thresholds such that we start with 3000-4000 genes.
  2. ‘de’ - top N genes that have a higher median expression in a cell type compared to each other cell type. We use a varying N, depending on the number of cell types used in the analysis (K):

\(N=round(333 *\frac{2}{3}^{log_2K})\)

This function creates the distribution below. The y-axis shows the number of differentially expressed genes used for each pair of cell types used in the analysis. In the first round, when there are many cell types to scores, SingleR uses only a small number of genes, but in the last rounds of the fine-tuning, this number increases up to 500 in the last iteration:

n.cells = seq(2:500)
n.genes = unlist(lapply(n.cells,FUN=function(x) 
  round(500*(2/3)^(log2(x)))))
df = data.frame(n.cells,n.genes)
ggplot(df,aes(x=n.cells,y=n.genes))+geom_point()+
  xlab('# of cell types')+ylab('# of genes')+theme_classic()

More details can be found in the SingleR code. This mode is the default mode and is used in all the studies presented in the manuscript.

Step 2: Aggregation of scores by cell types

Multiple correlation coefficients per cell type are aggregated according to the named annotations of the reference dataset to provide a single value per cell type per single cell. As described above, the samples are aggregated by broad cell types (‘main’) or with higher granularity of cell subsets. The default is the 80th percentile of the correlation values per cell type.

Below is an example for the annotation process for one human single cell. Here the dots are Spearman coefficients of all reference samples (using the Blueprint+Encode reference) with one single cell. The Spearman coffiecients were aggregated by cell types (here, a reduced set of the main cell types for simplicity). The SingleR score for each cell type is the 80 percentile in each of the boxplots. This cell type is clearly a T-cell or an NK cells, but its not clear what type exactly.

# for visualization purposes we present a subset of cell types (defined in labels.use)
load (file.path(path,'SingleR.Zheng.200g.RData'))
id = 9941
out = SingleR.DrawBoxPlot(sc_data = singler$seurat@data,cell_id = id, 
                          ref = blueprint_encode,main_types = T,
                          labels.use=c('B-cells','CD4+ T-cells','CD8+ T-cells', 'DC', 
                                       'Monocytes', 'NK cells', 'Macrophages',
                                       'Mast cells','Neutrophils',
                                       'Fibroblasts','HSC',
                                       'Endothelial cells'))
## [1] "Number of DE genes:2375"
## [1] "Number of cells: 2"
print(out$plot)

The analysis above grouped together cell subsets and states to main cell types. SingleR allows a more granular view of cell types (showing only top scoring cell types):

out = SingleR.DrawBoxPlot(sc_data = singler$seurat@data,cell_id = id, 
                          ref = blueprint_encode,main_types = F)
## [1] "Number of DE genes:3782"
## [1] "Number of cells: 2"
out$plot$data = out$plot$data[out$plot$data$Spearman>0.27,]
out$plot$data$Types = factor(out$plot$data$Types)
print(out$plot)