SingleCellProjections.jl - Fast Single Cell Expression analysis

07/28/2023, 7:30 PM — 8:00 PM UTC

32-G449 (Kiva)

Abstract:

We present an easy to use and powerful package that enables analysis of Single Cell Expression data in Julia. It is faster and uses less memory than existing solutions since the data is internally represented as expressions of sparse and low rank matrices, instead of storing huge dense matrices. In particular, it efficiently performs PCA (Principal Component Analysis), a natural starting point for downstream analysis, and supports both standard workflows and projections onto a base data set.

Description:

Using Single Cell RNA sequencing, it is today possible to generate data sets with gene expressions levels for >30k genes and hundreds of thousands of cells. Explorative analysis of these rich data sets is important - but challenging using existing tools that store the transformed and normalized data as dense matrices. (A single dense matrix with 30k genes and 500k cells takes 120GB RAM.) SingleCellProjections.jl is the first comprehensive Julia Package for processing Single Cell Expression data. It is at least 5 times faster and uses less than 1/5 of the memory, when doing the same analysis as existing packages written in other languages.

SingleCellProjections.jl supports the standard workflow for Single Cell Expression data: Sparse matrices with raw gene expressions counts (~5-10% nonzeros) are transformed (e.g. using SCTransform or log transform) and then normalized (e.g. mean-center, regress out covariates). Next, a truncated SVD (i.e. Principal Component Analysis) is computed to bring the data down from 30k dimensions to ~100 dimensions, which also serves as noise reduction. This is a great starting point for downstream analysis, partly because the data set is now much smaller. UMAP and t-SNE visualization are supported using external packages, and Force Layout (also known as SPRING plot) support is built-in.

As the name indicates, SingleCellProjections.jl is also built for projections. A common use case is that there is good, well-annotated reference data set (e.g. healthy cells), that the user wants to relate their own, newly generated data set (e.g. cancer cells) to. By projecting the new data onto the reference data set, similarities and differences can be interpreted in terms of the reference data set. Projection is here used in a broad sense, describing all steps after loading the raw count data until the analysis is done. Computing the projection in SingleCellProjections.jl is very simple:

new_data = load_counts(filepaths)
project(new_data, base)

Under the hood, SingleCellProjections.jl has stored a ProjectionModel for each analysis step that was applied to the base data set, with all the information needed to compute the projection for that step, and applies the models one by one to project the new data. Note that projecting is rarely the same as running the same analysis step independently, as the model is built from the source data. This can be subtle and easy to forget, but the simple interface hides this complexity from the user, minimizing the risk for mistakes. At the same time, it is easy to customize some steps if needed.

The key to performance and low memory usage in SingleCellProjections.jl is to never store large, dense matrices. Instead, matrix expressions are created and manipulated to implicitly represent the same information internally. As a motivating example, consider a sparse matrix S and let A := S - m1ᵀ be the matrix after mean-centering. Here, m is a vector with the mean for each gene (variable) and 1 is a vector of ones. SingleCellProjections.jl avoids materializing the large dense matrix A by working with the expression object directly. This strategy generalizes to more advanced transforms and normalizations, yielding slightly more complicated matrix expressions, consisting of sparse and/or low-rank terms and factors. Continuing the example from above, note that it is much more efficient to compute AX for some matrix X by distributing over the sum and evaluating SX - m(1ᵀX), than to work with the materialized matrix A. Randomized subspace iterations (Halko et al) are used to compute the truncated SVD (i.e. PCA), relying only on such matrix-matrix products. To efficiently compute the matrix-matrix products, SingleCellProjections.jl internally solves a generalized Matrix Chain Multiplication problem, taking both size and structure of the matrices into account.

Julia is very well suited for working with complicated, high-dimensional, biological data. In particular, the ability to write both high level code and efficient low-level code has been immensely useful when implementing this package. Reproducibility, which is very important for scientific analyses, is also greatly improved by Julia - in part by Manifests, but also by execution speed, since the user is more likely to rerun an analysis rather than loading some partial result from disk. We hope that SingleCellReductions.jl is a good starting point for anyone who wants to perform Single Cell expressions analyses and benefit from the Julia language and its ecosystem.

References: Halko et al, "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions"