Blog | scvi-tools

scvi-tools 1.3 release

July 3, 2025 · 9 min read

Introduction#

We’re proud to introduce scvi‑tools v1.3, encompassing major advances in modeling, data loading, computational scalability, metric integration, and interpretability in single-cell analytics.

Featuring nine new or enhanced models, optimized for spatial, cytometry, methylation, perturbation, and multi‑omic data, it also introduces custom data loaders for large-scale datasets, multi‑GPU model training, on-the-fly metric tuning, and integrated model interpretability.

This article delves into each enhancement with depth, including detailed insights, illustrative figures, and manuscript references.

1. 🔬 New Models#

ResolVI#

ResolVI¹ is a spatial transcriptomics denoising model that reallocates mis-assigned gene counts among true cells, neighborhood leakage, and background. It employs a Gaussian-mixture latent prior to learn corrected counts and interpretable embeddings. This approach is highly scalable (handling >1 million spots) and offers downstream capabilities like differential expression and transfer learning on corrected data

In tutorial, it has been shown to markedly enhance spatial expression accuracy in noisy segmentation settings, enabling reliable differential expression especially in high-throughput ST datasets.

Figure 1: ResolVI cell type annotations based on noisy cellular segmentation Xenium data of a mouse brain. The left hemisphere for model training and the right hemisphere for transfer mapping.

scVIVA#

scVIVA² augments spatial transcriptomics analysis by jointly modeling each cell’s own expression and its micro-environmental context (neighborhood composition and gene counts). This niche-aware VAE embeds both cellular identity and environmental features, revealing tissue-specific patterns and environment-driven variation. Its latent embeddings delineate tissue-specific structures - ideal for spatial differential abundance or niche-focused clustering studies.

Dedicated tutorial showcase how scVIVA enables niche-focused clustering and differential abundance analyses

Figure 2: scVIVA results: median Log-Fold Change (LFC) of upregulated genes in $\textit{G1}$ vs $\textit{G2}$ displayed on the x-axis, while we compare differential expression computed between $\textit{N1}$ and $\textit{G2}$ on the y-axis. Genes are colored by their marker label (yellow=significantly upregulated in $\textit{G1}$ vs $\textit{N1}$ , green otherwise). We also display the classifier decision boundary (the predicted probability of being in the yellow class).

CytoVI#

CytoVI³ brings totalVI-inspired modeling to cytometry and mass cytometry data. It models protein-marker distributions, corrects for dropouts and technical batch variation, and generates embeddings for downstream clustering and abundance inference.

Early tutorial already demonstrate clear delineation of immune subpopulations across batch-affected datasets.

VIVS#

Variational Inference for Variable Selection (VIVS⁴) identifies associations across modalities such as gene–protein couplings—while rigorously controlling false discovery rates using conditional randomization. VIVS achieves interpretable and scalable feature selection, enabling discovery of biologically meaningful links in paired datasets

Tutorial in scvi-tools soon to be updated.

SysVI#

SysVI⁵ tackles major batch effects, such as those arising from cross-species or organoid-versus-tissue studies—using latent cycle-consistency and VampPrior regularization. Compared to Harmony or regular scVI models, SysVI excels at aligning technical systems while preserving true biological variance, producing embedding spaces where analogous cell types across batches cluster coherently.

in the tutorial, we show the power of sysVI with data integration between human and mouse immune cells.

Figure 3: Example results of integration between human and mouse immune cells

Decipher#

Decipher⁶ Designed to dissect perturbation effects (e.g., disease versus control), Decipher disentangles shared and condition-specific variation within a VAE framework. Demonstrated on AML (acute myeloid leukemia) data, it uncovers latent axes aligning with known disease signatures and identifies corresponding differentially expressed markers - bridging latent space analysis and functional biology.

MethylVI#

MethylVI⁷ is a VAE tailored for single-cell bisulfite sequencing (scBS‑seq). By modeling methylation probabilities at genomic regions, it captures epigenetic heterogeneity and learns latent spaces that integrate multiple batches. Tutorial show it outperforms linear methods like PCA in retaining biologically meaningful methylation structures.

Figure 4: MethylVI integration of cell types from different single-cell bi-sulfite sequencing platforms

MethylANVI#

MethylANVI extends the MethylVI framework with annotation-aware modeling: it jointly integrates methylation profiles with metadata-driven cell-type labels. This supervised model supports both clustering and label transfer, all while capturing latent biological variation across methylome profiles, ideal for epigenetic atlas-building.

totalANVI#

totalANVI brings supervised annotation to CITE‑seq-style multi-omic integration. Leveraging a VAE for RNA + protein, it jointly learns latent embeddings and cell-type classifiers. The model simultaneously performs dimensionality reduction, denoising, differential expression, and accurate cell-type annotation in one cohesive model.

Tutorial in scvi-tools soon to be updated.

mrVI in pyTorch#

MrVI⁸ (Multi-resolution Variational Inference) is a model written in Jax for analyzing multi-sample, multi-batch single-cell RNA-seq data. MrVI is particularly suited for single-cell RNA sequencing datasets with comparable observations across many samples. It conducts both exploratory analyses (locally dividing samples into groups based on molecular properties) and comparative analyses (comparing pre-defined groups of samples in terms of differential expression and differential abundance) at single-cell resolution.

In recent scvi-tools releases we added a Pytorch implementation of mrVI, to be along with the Jax one, to support broader options on how to use this popular model.

Tutorial of mrVI in torch running on a subset of Tahoe100M cells dataset can be found here

Figure 5: mrVI Integration on Covid dataset

2. 🧩 Custom Dataloaders#

scvi‑tools v1.3 introduces three scalable custom dataloaders: LaminDB, Census, and AnnCollection, enabling out-of-core and federated training without memory overload. Custom Dataloaders are only supported in SCVI & SCANVI models, but it should be easy to expand them to other models. These backends support full compatibility with scvi‑tools data registration and training workflows, offering both scale and convenience to large projects.

LaminDB#

Integrates with Lamindb, enabling out-of-core training from disk-backed collections. Users can register collections and seamlessly train models like SCVI using lamin's MappedCollection, benefiting from disk efficiency while maintaining full API compatibility with in-memory datasets. For more information see this link.

The next tutorial shows demonstration of a scalable approach to training an scVI model on PBMC data using Lamin dataloader.

Figure 6: SCVI Integration achieved using LaminDB dataloader, on 2 distinct PBMC data.

This tutorial shows the analysis of mrVI in its PyTorch version together with Lamin Custom dataloader over a subset of Tahoe100M cells dataset.

Figure 7: SCVI (bottom) & MRVI (top) Integration achieved using LaminDB dataloader, on a subset of Tahoe100M cells data.

Census#

employs TileDB-SOMA for atlas-scale tensor-backed data, offering similar streaming capabilities but enhanced support for multidimensional genomic inputs and federated study designs. This custom dataloader directly read cellXgene dataset from S3 and train the SCVI model without the need to first download it, thus very suitable for few shots learning.

The next tutorial shows demonstration of a scalable approach to training an scVI model on mus_musculus data using the Census dataloader

Figure 8: SCVI Cell Integration achieved using Census dataloader, based on 4 type of batches: dataset_id, donor_id, assay and tissue_general

AnnCollection#

This dataloader allows training on multiple AnnData objects simultaneously, without merging them into one dataset. AnnCollection handles disparities in features or layers internally and aligns them during training, empowering federated or multi-study analyses

The next tutorial shows how to apply the annCollection wrapper in scvi-tools to load and train SCANVI model on several adata's that are stored on disk. Another link shows how the Tahoe100M cells dataset was trained in SCVI using the annCollection wrapper and its minified version was stored on scvi-hub for further analysis.

3. ⚙️ Core Enhancements#

Multi‑GPU Training#

Built on PyTorch Lightning, v1.3 empowers all major models to run across multiple GPUs with a single API flag. Training benchmarks times reduced by number of GPU exists, with full gradient synchronization and no code modifications needed. See the following tutorial and info page

Figure 9: comparison of SCVI training time between single and X2 multi-GPU machines as data increase.

scIB‑Metrics Optimization#

With the integration of ScibCallback and the AutotuneExperiment class, users can now monitor scIB metrics on the validation set during training and automatically tune hyperparameters (model, training and architecture parameters) based on these metrics—directly optimizing for clustering and batch mixing performance, making the model training process more principled and outcome-driven. See the following tutorial

Explainability & Interpretability#

v1.3 brings native support for Captum's Integrated Gradients (IG) across semi-supervised generative models like Scanvi and totalANVI. Users can compute marker-level attribution scores tied to latent dimensions or differential axes. This complements get_normalized_expression, delivering a full pipeline that links model representations back to biologically interpretable molecular mechanisms, offering transparency and interpretability in deep generative modeling

See the following tutorial as an example of scanvi model ran on a PBMC dataset from 10X.

Figure 10: Integrated gradients total contribution per gene per cell type, over data of PBMC.

Summary#

scvi‑tools v1.3 is a landmark release that advances the field across three foundational pillars:

Innovative modeling across ten tailored VAEs for spatial, protein, methylation, perturbation, and multi‑omic data.
Scalable data processing with three new custom dataloaders in the backend, enabling efficient handling of federated, out-of-core, atlas-scale datasets, such as the Tahoe100M cells.
Infrastructure and transparency with multi-GPU training, metric-aware tuning, and demonstrable model interpretability.

Together, these developments empower researchers to build, train, and interpret probabilistic models at scale-in a reproducible, transparent, and biologically meaningful way.

References#

ResolVI: addressing noise and bias in spatial transcriptomics / Ergen et al.↩
scVIVA: a probabilistic framework for representation of cells and their environments in spatial transcriptomics / Levy et al.↩
CytoVI: Deep generative modeling of antibody-based single cell technologies / Ingelfinger et al.↩
VI-VS: calibrated identification of feature dependencies in single-cell multiomics / Boyeau et al.↩
sysVI: Integrating single-cell RNA-seq datasets with substantial batch effects / Hrovatin et al.↩
Decipher: Joint representation and visualization of derailed cell states with Decipher / Nazaret et al.↩
MethylVI: A deep generative model of single-cell methylomic data / Weinberger et al.↩
MrVI: Deep generative modeling of sample-level heterogeneity in single-cell genomics / Boyeau et al.↩

Mini-batch size in destVI

May 29, 2022 · 15 min read

Can Ergen, Romain Lopez, Nir Yosef

The task of deconvolution of spot-based spatial transcriptomics (ST) data consists in finding the cellular composition in spots of ST assays. Indeed, by design each spot consists of several individual cells. Our lab developed destVI as a tool to study cell type composition of spots (1). In the case where one expects within cell type variation (i.e., activation states), DestVI was designed to infer the activation state of individual cell types in each spot and therefore gives additional resolution of cell composition over competing algorithms. In our own benchmarking of destVI, we found comparable performance in prediction of cell type proportions to other state of the art algorithms like Cell2Location (2). A recent benchmarking study compared the performance of several gene imputation as well as deconvolution methods. Given the ever increasing number of deconvolution methods, the effort to benchmark those tools on a variety of simulated data is timely (3). In two of the benchmarking cases, DestVI had the worst performance. Along with the recent acceptance of the manuscript, we published several fixes to the codebase in scvi-tools v0.16.0, repeated those experiments and analyzed the reasons for failure of spot deconvolution in the given experiments. Briefly, we show that the poor performance was due to the mini-batch size used during optimization, and resolving this help recovering competitive accuracy. Additionally, we provide the user with a heuristic for future use.

More specifically, variational autoencoders are a specific type of neural networks (NN) that, like most NN, are trained by iteratively presenting the network mini-batches of data, that is a small fraction of the whole dataset. The mini-batch size refers to number of distinct samples used for updating the parameters of the neural networks during one training step.

Methodology#

Models#

The DestVI model (1) uses a first training phase on a reference single cell dataset to learn a cell-type specific latent space. Then, it learns a second model using as input the spot data, and encoding them in the reference latent space. Finally, the model learns cell-type proportions for each spot. In this blog post, we only address the cell-type proportion estimate as was done in the original benchmarking study (3). We haven't checked the activation state of those cells as the simulation was designed in a way where the mean activation state of each cell-type in each spot is not known and ground-truth therefore is missing. For all experiments here, we kept the parameters for the CondSCVI model on which the single-cell data is learned constant and highlight every parameter we changed in the respective run. The only two parameters that we highlight here are the amortization scheme, where destVI allows a neural network for amortization of cell-type proportions and activation state or models them as free parameters, and the mini-batch size used for training.

The selection of genes for destVI used in the original benchmarking study is a point we wanted to mention here. The authors first took the intersection of the genes in spatial assay and single cell assay (882 genes overlapping for STARmap dataset) and then took the 2000 highly variable genes. This second step was without effect (2000 is higher than 882). Nevertheless, in case a FISH based experiments was designed specifically for an organ and all genes were carefully selected, we generally would recommend against additional filtering for over-dispersed genes in a single cell reference but train the model directly on the overlap of spatial and single cell genes skipping the step of highly variable genes.

We compare destVI throughout this blog post with the benchmark version of Cell2Location. Our purpose is not to prove that we outperform other existing methods but to analyze why destVI showed poor performance in the benchmarking study. Cell2Location showed overall good performance in the benchmarking study and is implemented using the scvi-tools framework but relies on a marker gene related approach instead of an unbiased deconvolution approach.

Hyperparameter selection#

Most experiments before the destVI publication, as well as the benchmarking study were performed using cell-type activation state amortization only (called latent amortization hereafter) and treat the cell-type proportions as a free parameter. However, in the recent version of the code we generally recommend to use the amortized version of both (called both amortization hereafter). In this blog post, we checked performance in all experiments here for both amortization schemes. The size of the training mini-batch size varied as {4, 8, 16, 32, 64, 128, 256, 512, 1024}. For comparison with the benchmarking study we left out batch sizes 256 or higher because the number of spots was 189. For our comparison, in a large scale dataset we left out batch_sizes 16 and lower because the training time drastically increases when using small batch sizes on large datasets.

Results#

Datasets#

We reused the STARmap dataset from the original publication (mouse brain cortex). The single-cell reference contains 14,249 cells by 34,041 genes. It contains data from several mouse lines, both sexes and from various time points that were sorted by flow cytometry and sequentially sequenced. The STARmap dataset contains 1,523 cells by 981 genes. The simulation is to sum all counts over a window size of 750 pixels. This gives a pseudo-spot count matrix with 189 spots and 1-17 cells per spot. For further reference, we refer to the original publication (3). We subset this dataset to 57 spots to see how destVI performs in the case of even lower number of spots by subsetting this data to the first three column as diversity of cell-types is mainly along the cortical axis and subsetting to columns keeps the complexity of the dataset similar.

For a regime with more spots, we decided to keep the organ of interest (mouse brain) and use a dataset with a much higher number of cell captured. We reproduced for this matter the analysis provided by Vizgen using MERFISH technology, which provides a walk-through tutorial on Google Colab.

Figure 1: Overview of MERFISH brain dataset from (https://colab.research.google.com/drive/1OxJRO19cPsDW0JGAh4tLJjgOl7EMxQbP?usp=sharing&__hstc=30510752.37206d737856c71bb0a5d1c8f6764b63.1652985789816.1653807477271.1653882474080.8&__hssc=30510752.1.1653882474080&__hsfp=455698764&hsCtaTracking=070f4af1-2595-44c8-9779-4da89d538482%7Cf4313de5-25c4-4677-9fd6-82cf71d4fdc4).

This analysis yields a single-cell reference dataset with 160,796 cells by 27,998 genes and a spatial dataset with 83,546 cells by 483 genes. The simulation here is again to sum all counts of cells with a center over a window of 40 µm. We choose the size of this window to have an equal number of cells per pseudo-spot compared to the STARmap dataset. This gives a pseudo-spot count matrix with 27395 spots and 1-16 cells per spot.

Results on STARmap dataset#

First, we verified that by using the updated version of the code in scvi-tools v0.16.0, we get similar results to the benchmarking study. In agreement with the original benchmarking study, the layer structure of neurons in different cortical layers wasn't visible.

Then, we ran different amortization variants of DestVI. Modeling cell-type proportions as a free parameter leads to no visible structure at all (as was run by the study). When using amortized cell-type proportions, which uses a neural network to estimate cell-type proportions, destVI predicted astrocytes correctly, while it wasn't capable of differentiating different excitatory neurons but were classifying all neurons as a single mixture.

By changing the batch-size, we noticed that decreasing the training mini-batch size drastically improves the performance of both algorithms. Notably, the model with a mini-batch size of 32 and both parameters amortized performs badly. We see good deconvolution of the different neuronal layers with a batch size of 8, 12 and 16. Additionally, for these batch size there was no qualitative difference between both amortization and latent amortization.

When comparing the results of destVI with Cell2Location it becomes clear that Cell2Location outperforms destVI for cell-types like Pvalp or Smc cells while both algorithms fail on microglia. The reason for better performance of Cell2Location is most likely low number of those cell-types in the spatial dataset and therefore low percentage in the respective spot. The bad performance for microglia might be based on the selection of FISH probes giving a low coverage of myeloid cell heterogeneity.

Figure 2: Results on benchmarking dataset. Left-to-right ground truth, Cell2Location, DestVI with both amortization and batch-size 8, 16, 48, 128 and DestVI with latent amortization with same batch size. Increase in matching proportions for destVI with decreasing mini-batch size. Cell2Location outperforms destVI for Oligodendrocytes. Quantitative measurues (PCC=Pearson Correlation Coefficient, SSIM=Structural Similarity, RMSE=Root Mean Squared Error, JSD=Jensen-Shannon-Divergence) show on par performance for destVI with a mini-batch size below 16.

cell-type	freq_sc	freq_spatial
ExcitatoryL6	3190	287
ExcitatoryL5	1786	94
Sst	1741	42
Vip	1728	15
ExcitatoryL4	1401	198
Pvalb	1337	42
ExcitatoryL2and3	982	258
Astro	368	141
Endo	94	150
Olig	91	200
Smc	55	13
Micro	51	23

When checking quantitative results, we find on par performance of destVI and Cell2Location in Pearson Correlation Coefficient based on spots while when checking for correlation across cell-types Cell2Location outperforms destVI. The reason for this improved performance are as described above lowly abundant cell-types.

As demonstrated here, by reducing the size of the training mini-batch destVI yields overall similar performance to Cell2Location for cell-type deconvolution. We asked next whether this is also the case when even further reducing the number of spots. For this study, as described above we subset the number of pseudo-spots and retrained Cell2Location and destVI. Overall, we find better agreement with both amortization for different sizes of training batch size. Oligodendrocytes and Astrocytes are correctly predicted in all version with both amortization. Only the models with a batch size of 4 differentiate between the different layers of excitatory neurons. Of note, latent amortization outperforms both amortization here for a mini-batch size of 4. It might be an effect of small training size, so that the amortization network can not be trained well. We generally wouldn't recommend to use spatial deconvolution techniques for such low number of spots. Given the higher stability over a various number of mini-batch sizes, we prefer to recommend using both amortization scheme. In cases with very few examples and known ground-truth we advise training both models and comparing the results.

Figure 3: Results on subset of benchmarking dataset. Subset on first three columns in original dataset. Displayed are only neuron layers as structure in other celltypes is hardly detected with only three columns. For ground-truth, we display all columns to allow easier comparison. Left-to-right ground truth, Cell2Location, DestVI with both amortization and batch-size 4, 8, 12, 16, 32 and DestVI with latent amortization with same batch size. On par performance with mini-batch size 4 and latent amortization is visible with slightly reduced performance in both amortization.

Results on MERFISH dataset#

To check the effect of training mini-batch size on the performance of large datasets, we simulated a second pseudospot matrix. The training time increases drastically with reducing the mini-batch size as the GPU is used less efficiently. We therefore restricted to models with a mini-batch size above 32 (trained more than 3 hours on a Nvidia RTX 3090). We therefore think that the mini-batch size should be 128, in which case we haven't seen major speed improvement (most likely depends on the GPU architecture).

model	computation time
Cell2location	2h 3min 13s
batchsize_32 both_amortization	3h 31min 27s
batchsize_48 both_amortization	1h 44min 21s
batchsize_64 both_amortization	1h 26min 5s
batchsize_128 both_amortization	36min 17s
batchsize_256 both_amortization	20min 48s
batchsize_512 both_amortization	10min 55s
batchsize_1024 both_amortization	10min 03s

Overall, for all combinations of parameters we see improved performance of destVI over standard Cell2Location. This is especially visible in Di- and mesencephalon excitatory neurons and Telencephalon inhibitory inter-neurons where Cell2Location doesn't uncover the tissue distribution of this cell-type. We see here no correlation to small size of those cell-types, and the reason for this reduced performance is not clear. As we haven't set out a benchmarking study here, but to study the performance of destVI, we haven't changed the hyper-parameters for Cell2Location to increase performance.

cell-type	freq_sc	freq_spatial
Oligodendrocytes	30253	10244
Astrocytes	19377	9476
Telencephalon projecting exc. neurons	18799	22345
Telencephalon inh. interneurons	8637	4451
Mesencephalon exc. neurons	6455	8066
TE proj. inh. neurons	5691	3569
Microglia	5425	477
Vascular endothelial cells	3805	6188
Vascular smooth muscle cells	1628	2018
Vascular and leptomeningeal cells	1501	1905
Ependymal cells	1257	900
Hindbrain neurons	1144	43
Cholinergic and monoaminergic neurons	1071	6163
Oligodendrocyte precursor cells	820	4524
Choroid epithelial cells	458	477

Overall we see that performance is stable up to a batch size of 256 with decreasing performance for both amortization and batch size 512 and 1,024, while performance of latent amortization is stable with increasing batch size. We postulate that mini-batch training of the cell-type amortization network is essential for performance. As above we have seen speed improvement by using a bigger batch size, we asked whether bigger batch sizes are good in performance, when we train them for more epochs. Indeed increasing the number of epochs for mini-batch size 512 lead to on par performance using 5,000 instead of the default 2,500 training epochs, while performance of mini-batch size 1,024 was still inferior when checking with 10,000 training epochs.

Figure 4: Results on MERFISH brain dataset. Left-to-right ground truth, Cell2Location, DestVI with both amortization and batch-size 32, 128, 256, 1,024 and DestVI with latent amortization with same batch size. Cell type proportion estimates are improved over Cell2Location in all destVI models. There is a decrease in performance for models with batch_size 1,024 for endothelial cells, that are low abundant in every spot.

Conclusion#

In the analysis of destVI from the main paper, we had limited our benchmarking to standard spot based assays, in which both spatial and single cell data sets are based on whole transcriptome sequencing and especially contain more than 1,000 spots. We found that DestVI performance from the newly published benchmarking study was mediocre because the number of spots was close to the training mini-batch size and therefore the underlying composition of the spots was not learned adequately. We verified this by proving that by decreasing training mini-batch size, destVI can yield on par performance to other methods for cell type deconvolution.

DestVI yields not only cell-type proportion estimates but also cell-type activation estimates, the benchmarking study was designed to only study cell-type proportion estimates and we kept the same design here. Generally, we think the additional output of cell-type activation is a major benefit of DestVI over competing algorithms. We nevertheless thank the authors of the original benchmarking study to discover deficiencies of destVI with small number of spots.

Practical recommendations#

We demonstrated that destVI also yields those results with a subset of the original dataset with just 57 spots. We also checked 19 spots here and destVI and Cell2Location weren't discovering the different layers of cortical neurons. We therefore recommend users to not run destVI with less than 50 spots.

Over the course of our experiments, we found a training mini-batch size of max(dataset_size/10, 128) to perform well in deconvolution. We set the maximum batch size to 128 as we saw decreasing performance with a batch size of 512 for the brain dataset. Most likely it is safe to increase the mini-batch size for big datasets and getting runtime benefits. However, we have most experience from experiments with a batch size of 128 and limit the maximum batch size to this value. If runtime is a big concern, manual increase of this parameter is possible. The version with batchsize 128 was already several times faster than Cell2Location.

DestVI with latent amortization showed superior performance in the setting with optimal mini-batch size in small datasets but performance was inferior for other mini-batch sizes. We continue suggesting both amortization and in the case of the brain dataset both amortization schemes were similar in performance.

Please share any feedback with us via twitter (@YosefLab), through the comment section below or through scverse discourse webpage (https://discourse.scverse.org/).

Acknowledgements#

We acknowledge members of the Yosef Lab. We thank Adam Gayoso for reviewing the changes to destVI and bringing the benchmarking study to our attention.

Bibliography#

(1) Romain Lopez, Baoguo Li, Hadas Keren-Shaul, Pierre Boyeau, Merav Kedmi, David Pilzer, Adam Jelinski, Ido Yofe, Eyal David, Allon Wagner, Can Ergen, Yoseph Addadi, Ofra Golani, Franca Ronchese, Michael I. Jordan, Ido Amit and Nir Yosef. DestVI identifies continuums of cell types in spatial transcriptomics data. Nature Biotechnology. 2022.

(2) Vitalii Kleshchevnikov, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W. King, Tong Li, Rasa Elmentaite, Artem Lomakin, Veronika Kedlian, Adam Gayoso, Mika Sarkin Jain, Jun Sung Park, Lauma Ramona, Elizabeth Tuck, Anna Arutyunyan, Roser Vento-Tormo, Moritz Gerstung, Louisa James, Oliver Stegle and Omer Ali Bayraktar. Cell2location maps fine-grained cell types in spatial transcriptomics. Nature Biotechnology. 2022.

(3) Bin Li, Wen Zhang, Chuang Guo, Hao Xu, Longfei Li, Minghao Fang, Yinlei Hu, Xinye Zhang, Xinfeng Yao, Meifang Tang, Ke Liu, Xuetong Zhao, Jun Lin, Linzhao Cheng, Falai Chen, Tian Xue and Kun Qu. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nature Methods. 2022.

scvi-tools 0.9.0 release

March 3, 2021 · 7 min read

Adam Gayoso, Romain Lopez, Galen Xing, Nir Yosef

Today we officially released scvi-tools version 0.9.0 (changelog). This release marks the culmination of five months of work on the backend of the codebase, which came after three months of work on the frontend. In this short note, we officially introduce scvi-tools as a readily usable codebase that contains many implementations of probabilistic single-cell omics methods, and also features a high-level interface to accelerate the model development process. We start with some historical notes about our previous codebase, which was mostly used for internal developments in the last three years. We then describe the obstacles we found to its external adoption, and the foundational idea behind the new scvi-tools work: a high-level deep probabilistic programming library specialized for single-cell omics data.

Hyperparameter search for scVI

July 5, 2019 · 20 min read

Gabriel Misrachi, Jeffrey Regier, Romain Lopez, Nir Yosef

While stochastic gradient-based optimization is highly successful for setting weights and other differentiable parameters of a neural network, it is in general useless for setting hyperparameters -- non-differentiable parameters that control the structure of the network (e.g. the number of hidden layers, or the dropout rate) or settings of the optimizer itself (e.g., the learning rate schedule). Yet finding good settings for hyperparameters is essential for good performance for deep methods like scVI. Furthermore, as pointed out by Hu and Greene (2019) selecting hyperparameters is nessary in order to compare different machine learning models, especially if those are substantially sensitive to hyperparameter variations.

Should we zero-inflate scVI?

June 25, 2019 · 22 min read

Oscar Clivio, Pierre Boyeau, Romain Lopez, Jeffrey Regier, Nir Yosef

Droplet- based single-cell RNA sequencing (scRNA-seq) datasets typically contain at least 90% zero entries. How can we best model these zeros? Recent work focused on modeling zeros with a mixture of count distributions. The first component is meant to reflect whether such an entry can be explained solely by the limited amount of sampling (on average ~5% or less of the molecules in the cell). The second component is generally used to reflect "surprising" zeros caused by measurement bias, transient transcriptional noise (e.g., "bursty" gene with a short mRNA half life), or true longer-term heterogeneity that can not be captured by a similified (low dimensional) representation of the data. Among others, zero-inflated distributions (i.e., zero-inflated negative binomial) have been widely adopted to model gene expression levels (1, 2).