Today we officially released
scvi-tools version 0.9.0 (changelog). This release marks the culmination of five months of work on the backend of the codebase, which came after three months of work on the frontend.
In this short note, we officially introduce
scvi-tools as a readily usable codebase that contains many implementations of probabilistic single-cell omics methods, and also features a high-level interface to accelerate the model development process. We start with some historical notes about our previous codebase, which was mostly used for internal developments in the last three years. We then describe the obstacles we found to its external adoption, and the foundational idea behind the new
scvi-tools work: a high-level deep probabilistic programming library specialized for single-cell omics data.
Taking a step back: the original
Many members of the Yosef Lab, and in particular Jeff Regier, Edouard Melhman, Romain, and Adam helped conceive, develop, and maintain the
scvi codebase. The initial philosophy was to make the code available for users to run scVI, but also to have a proper codebase for developing novel algorithms for single-cell omics analysis. Over the last three years, we have hosted seven visiting graduate students who wrote their Master's theses in the Yosef Lab by building new functionalities, as well as new algorithms (including scANVI, AutoZI and gimVI) directly into the
scvi codebase. Even our most recent work, such as totalVI for CITE-seq modeling, and our decision-making procedure for differential expression was also developed this way.
However, our ambition at the time was to solely use this infrastructure for internal developments. Consequently, some mistakes were made, in which we did not make use of the existing single-cell analysis infrastructure. First, we had built manual data formatting boilerplate code to read many different single-cell omics input formats. We also had incorporated customized plotting code while working on the dataset integration problem. Ideally, all these functions would come from another package, such as
Scanpy. As a second striking example, each algorithm was not straightforward to use as it often required the end user to create multiple objects, such as dataloaders, trainers, and models. This was confusing for the user and also hard to maintain on our end. Indeed, over a year ago we encountered a bug at benchmarking time, in which scVI by default was trained on one single-cell (instead of the whole training set)!
Identifying key improvements, creation of
Even though we were actively branding the
scvi codebase as a framework for creating new probabilistic models, we did not encounter clear success in this area. A notable exception is the
LDVAE model, which was the first model in
scvi developed by an external user of the codebase (thanks Valentine!). Later, we became aware of some suboptimal API choices we had made, and decided to work on improving the user experience in order to make the codebase more attractive. At this point in time, Galen joined the Yosef lab and we rethought the foundations of
- Host reimplementations of existing methods that are currently difficult to use.
- For all methods, provide a consistent and simplified user experience, and provide tutorials that walkthrough a meaningful application.
- Focus on
AnnDatafor the input data format (potentially reducing almost half of the code in the codebase), and use
Scanpyfor all other processing.
- Build explicit tutorials for interaction with the R ecosystem (
- Rewrite all the training code, relying instead on PyTorch Lightning.
- Add an interface to Pyro, in order to further automate inference.
- Build tutorials for model developers.
Today, most of those features are readily usable. Visit our landing page! We detail below some of those developments, present in
In this release, we introduced a new paradigm for building single-cell focused probabilistic models in which model development is hyper-focused on the model at hand. Based on our experience of building variational autoencoders for single-cell data, we identified several opportunties to abstract boilerplate code in a reusable way. Therefore, we built objects in the scvi-tools codebase to handle auxiliary tasks such as data loading, training, save/load, and device management. As an example, we wrote the
scvi.data module to handle
AnnData objects, including registration of model-specific tensors and generic "ann"-data loading into models. Consequently, model development solely focuses on (1) defining a probabilistic model/inference scheme and (2) expressing it in a structured way based on our abstract classes. Learn more with our tutorials.
With these new model building blocks, we were able to implement models external to our lab with relative ease. This included Stereoscope for deconvolving spatial transcriptomics data, Solo for detection of doublets in scRNA-seq data, and CellAssign for reference-based annotation of scRNA-seq data. These implementations required significantly fewer lines of code with
For the example of Stereoscope, it took one afternoon and nearly 600 lines of code for its reimplementation in
scvi-tools (the original codebase has 6,000 lines of code). Also, the algorithm may now be run directly from AnnData objects in a Jupyter notebook or in Google Colab. This may be more attractive to certain users compared to the original codebase, which was only usable from the command line interface.
Another feature we are excited about is the integration with Pyro, which further abstracts the process of manually deriving optimization objectives. The core Pyro team, who joined the Broad Institute a couple of months ago, recently released a simple reimplementation of our scANVI model in Pyro. We therefore highly encourage using Pyro for new model developments, although relying on Pyro to power a model remains completely optional. We anticipate that Pyro will be especially useful for automating inference for complex hierarchical Bayesian models, since writing the automatic differentiation variational inference (ADVI) recipe manually would require many lines of code, and the evidence lower bound would potentially be tedious to write.
We also wrote a template GitHub repository to accelerate the package creation process. This includes template code to setup documentation, continuous integration testing, and popular code styling practices. Additionally, we incorporated example implementations for scVI in both PyTorch and Pyro.
Finally, thanks to our refactoring effort while implementating of all these models in the same codebase, we were able to broadcast new features across models. This included support for multiple (continuous or categorical) covariates when integrating data with scVI, scANVI, or totalVI. We are excited to see the impact of non-linear dataset integration extended in this way, and have already seen promising results in correcting, e.g., cell cycle effects. We also have extended the scArches method for query/reference dataset integration to the scVI, scANVI, and totalVI models. This required implementing one
Mixin class with the core transfer learning logic.
We are actively looking for users, as well as feedback! Integration in scvi-tools may be possible in at least two different forms. For example, methods developers may choose to either have their method directly present in the external module of our codebase (such as Stereoscope, gimVI, and CellAssign so far), or clone our template and host their independent repository.