Single-Cell Sequencing with Seurat: Full Workflow from Raw Counts to UMAP

2022年3月31日 English Articles

Single-cell RNA sequencing (scRNA-seq) is one of the fastest-growing life science technologies, and Seurat is the most popular R package for its analysis. This guide walks you through the complete workflow from raw count matrix to UMAP clustering, with clear annotations for each step—perfect for beginners.

Installation and Loading

# Requires R 4.1 or higher, Seurat v5
install.packages("Seurat")
library(Seurat)
library(ggplot2)

Reading Data

# 10x Genomics output directory should contain:
# barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz
counts <- Read10X(data.dir = "data/sample1/")

seurat <- CreateSeuratObject(
  counts      = counts,
  project     = "my_project",
  min.cells   = 3,    # gene expressed in at least 3 cells (filter noise)
  min.features = 200  # cell expresses at least 200 genes (filter empty droplets)
)

Quality Control (QC)

# Calculate mitochondrial gene percentage (high % often indicates dying/lysed cells)
seurat[["percent.mt"]] <- PercentageFeatureSet(seurat, pattern = "^MT-")
# For mouse data use "^mt-"

# Visualize QC metrics
VlnPlot(seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

# Filter (thresholds depend on your data; these are examples)
seurat <- subset(seurat,
  nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20)

Normalization, Variable Features, and Scaling

seurat <- NormalizeData(seurat)  # Normalize total UMI per cell to 10,000, then log-transform

seurat <- FindVariableFeatures(seurat, nfeatures = 2000)  # Identify highly variable genes

seurat <- ScaleData(seurat)  # z-score scaling to eliminate library size differences

Dimensionality Reduction and Clustering

seurat <- RunPCA(seurat)        # PCA dimensionality reduction
ElbowPlot(seurat)               # Check elbow to decide number of PCs (usually 10-20)

seurat <- FindNeighbors(seurat, dims = 1:15)     # Build KNN graph
seurat <- FindClusters(seurat, resolution = 0.5) # Higher resolution = more clusters
seurat <- RunUMAP(seurat, dims = 1:15)           # UMAP visualization

DimPlot(seurat, label = TRUE)   # Plot UMAP, each point is a cell

Finding Marker Genes and Annotating Cell Types

# Find specifically upregulated genes for each cluster
markers <- FindAllMarkers(
  seurat,
  only.pos = TRUE,   # only upregulated genes
  min.pct  = 0.25,   # expressed in at least 25% of cells
  logfc.threshold = 0.25
)

# View top 10 markers for cluster 0
head(subset(markers, cluster == 0), 10)

# Visualize known markers (T cells, monocytes, etc.)
FeaturePlot(seurat, features = c("CD3D", "CD14", "MS4A1"))

Runtime and Memory Reference

10,000 cells: ~10-15 minutes on a standard laptop, 8GB RAM sufficient
50,000 cells: 32GB+ RAM recommended; ScaleData and FindClusters are bottlenecks; run on a server
100,000+ cells: Consider Seurat v5 sketch workflow or switch to AnnData/Scanpy (Python)