scvi-tools 0.9.0 release

Today we officially released scvi-tools version 0.9.0 (changelog). This release marks the culmination of five months of work on the backend of the codebase, which came after three months of work on the frontend. In this short note, we officially introduce scvi-tools as a readily usable codebase that contains many implementations of probabilistic single-cell omics methods, and also features a high-level interface to accelerate the model development process. We start with some historical notes about our previous codebase, which was mostly used for internal developments in the last three years. We then describe the obstacles we found to its external adoption, and the foundational idea behind the new scvi-tools work: a high-level deep probabilistic programming library specialized for single-cell omics data.

Hyperparameter search for scVI

While stochastic gradient-based optimization is highly successful for setting weights and other differentiable parameters of a neural network, it is in general useless for setting hyperparameters -- non-differentiable parameters that control the structure of the network (e.g. the number of hidden layers, or the dropout rate) or settings of the optimizer itself (e.g., the learning rate schedule). Yet finding good settings for hyperparameters is essential for good performance for deep methods like scVI. Furthermore, as pointed out by Hu and Greene (2019) selecting hyperparameters is nessary in order to compare different machine learning models, especially if those are substantially sensitive to hyperparameter variations.

Should we zero-inflate scVI?

Droplet- based single-cell RNA sequencing (scRNA-seq) datasets typically contain at least 90% zero entries. How can we best model these zeros? Recent work focused on modeling zeros with a mixture of count distributions. The first component is meant to reflect whether such an entry can be explained solely by the limited amount of sampling (on average ~5% or less of the molecules in the cell). The second component is generally used to reflect "surprising" zeros caused by measurement bias, transient transcriptional noise (e.g., "bursty" gene with a short mRNA half life), or true longer-term heterogeneity that can not be captured by a similified (low dimensional) representation of the data. Among others, zero-inflated distributions (i.e., zero-inflated negative binomial) have been widely adopted to model gene expression levels (1, 2).