Hyperparameter search for scVI

While stochastic gradient-based optimization is highly successful for setting weights and other differentiable parameters of a neural network, it is in general useless for setting hyperparameters -- non-differentiable parameters that control the structure of the network (e.g. the number of hidden layers, or the dropout rate) or settings of the optimizer itself (e.g., the learning rate schedule). Yet finding good settings for hyperparameters is essential for good performance for deep methods like scVI. Furthermore, as pointed out by Hu and Greene (2019) selecting hyperparameters is nessary in order to compare different machine learning models, especially if those are substantially sensitive to hyperparameter variations.

Generally, hyperparameter search is known to be challenging for the end-user and time-consuming. Therefore, we added in scVI a module based on the Bayesian optimization framework hyperopt. This new feature makes effective use of multiple GPUs and is a ready-to-use solution for tuning hyperparameters in scVI. This blog post presents this feature and investigates how end-users will benefit from it.

Theory : Bayesian Optimization and Hyperopt#

First and foremost, one needs to carefully select a metric for which to search for optimal hyperparameters, keeping in mind that this one might depend on the downstream task, as explained in Theis et al. (2016). Since we train a generative model, the held-out negative log-likelihood is a natural criterion for model selection. We therefore rely on it for hyperparameters tuning.

Historically, the very first methods introduced to tackle hyperparameter search were grid search and random search. Such approaches are limited in the sense that they scale exponentially with the size of the search space. A more scalable and elegant approach to search is Bayesian Optimization (BO), whose main idea is to learn and use the structure of the objective function. The overall process can be decomposed into two unordered tasks. One is to define a tractable surrogate prior for the underlying black-box objective (e.g., a Gaussian Process). The other is to define a criterion for which to optimize this surrogate process.

In the special case of hyperopt the criterion used is the expectation of the improvement over a certain baseline which can be the current best or lower. The surrogate is called a Tree-strctured Parzen Estimator. The TPE divides the sample set into two subsets based on their value compared to the current baseline and estimates two corresponding conditional densities over the search space using non-parametric kernel-density estimators. Then, good samples should have a high conditional density for improving samples and vice versa.

As suggested in the original article Bergsta et al. (2011), this hyper-optimization process can be used asynchronously by mocking the result for the pending trials with the mean of the available samples. Still, it should be noted that if the budget is a single evaluation per process running in parallel, then a random search should be preferred.

There are a certain number of recent approaches competing with hyperopt. Li et al. (2018) proposes Hyperband: an alternative to BO with bandit-based principled early stopping, which scales better with the number of workers. These can be also combined for trial selection and principled early stopping, as explained in Wang et al. (2018). Interested readers might also take a look at two black-box optimizers based on different paradigms, namely the CMA-ES and DFO-TR algorithms - both have available code. BO enthusiasts can take a look at scikit-optimize as well as Ax, Facebook's brand new pytorch-powered ML platform .

Practice : scVI's autotune module#

We are excited to release our new autotune module for scVI, based on hyperopt. We thank hyperopt developers for their maintained codebase as well as their intelligent MongoDb-based communication between workers. Using this new feature on the Cortex dataset is as simple as running the code snippet below.

import logging
logger = logging.getLogger("scvi.inference.autotune")
logger.setLevel(logging.DEBUG)
if __name__ == "__main__":
best_trainer, trials = auto_tune_scvi_model(
gene_dataset=dataset,
parallel=True,
exp_key="cortex_test",
max_evals=100,
)

Note that the default behaviour of auto_tune_scvi_model is as follows:

  • A parameter search space containing the learning rate, the network architecture and the dropout rate is provided.
  • The objective metric is set to be the held-out log-likelihood of the data computed on a test set which size equals 25% of the whole dataset.
  • All available GPUs are used.

Of course, the behaviour of the auto-tuning process can be widely customized. For example, one can select the number of CPU processes, the ids of the GPUs to use, one can also input their own hyperparameter search space or even their own objective function. For a more advanced usage, we expose functions to launch your own workers, which could be used to run the search on multiple machines.

Results#

In order to evaluate the autotune procedure, we ran it on several standard datasets, directly available using our codebase. Namely, we used the Cortex, Pbmc and Brain Large datasets.

Datasets#

Let us first describe these three datasets in more detail.

  • CORTEX The mouse cortex dataset from Zeisel et al. (2015) contains 3005 mouse cortex cells and gold-standard labels for seven distinct cell types. We retain the top 558 genes ordered by variance.
  • PBMC As in the scVI manuscript, we concatenated two datasets of peripheral blood mononuclear cells (PBMCs) from a healthy donor from Zheng et al. (2017). After filtering, we extract 11,990 cells with 3446 genes.
  • BRAIN LARGE This dataset contains 1.3 million brain cells from 10x Genomics. We retain the 720 most variable genes.

For each of these, we ran our autotune process using a default search space and a budget of 100 trainings. The raw outcome of these experiments can be found here in the form of a table containing the parameters used and the resulting performance.

Runtime information#

We report the runtime metadata of our experiments in Table 1. This should give the user an idea of the time it takes to use our new feature, depending on the size of the dataset, the number of evaluations, the max number of epochs allowed and the number of GPUs available.

Dataset NameNb cellsNb genesWall timeAvg epochsN° of GPUsMax epoch
cortex30055589 hours53211000
pbmc1199033463 hours387161000
brain large130318272021.5 hours431650

Table 1: Runtime Table for 100 trainings. Wall time is the total time the experiment took. Avg epoch is the average number of epochs per training.

Hyperparameter sensitivity#

Our first concern was to investigate scVI's sensitivity to hyperparameters. One ideal way to investigate this matter would have been to study the variance of the performance for each set of hyperparameters and collapse the runs who are not significantly different before analyzing the remaining representants. As this is too computationally intensive, we instead focus on the ten best and ten worst runs for each dataset.

marginal_lln_layersn_hiddenn_latentreconstruction_lossdropout_ratelrn_epochsn_paramsrun index
11218.52125610zinb0.10.0124829081692
21218.7112812zinb0.10.0138214592080
31219.7125610zinb0.10.0136529081685
41220.06125610zinb0.10.0127529081691
51223.09112810zinb0.10.0144014540883
61223.2112812zinb0.50.00570314592038
71223.53125610zinb0.10.00151429081697
81223.94112812zinb0.50.0154214592074
91224.37112812zinb0.50.0152414592076
101224.37112812zinb0.50.0149714592071
911554.9826410zinb0.70.0052568089645
921559.015647nb0.50.000145710508837
931601.5336410nb0.70.001888908815
941612.946414zinb0.70.005719779249
951615.2222569nb0.90.000119742137620
961746.25312812zinb0.90.00113421145652
971818.8216412zinb0.90.0005547296060
986574.5711288zinb0.50.0001414489661
9910680.456412zinb0.30.000121057281
100NaN2646zinb0.90.0001318038413

Table 2: Hyperoptimization results for the Cortex dataset.

marginal_lln_layersn_hiddenn_latentreconstruction_lossdropout_ratelrn_epochsn_paramsrun index
11323.44125614zinb0.50.01170172032070
21323.8812568zinb0.50.01130171724854
31324.12125614zinb0.50.01178172032085
41324.1812568zinb0.50.01155171724865
51324.2125615zinb0.50.01140172083284
61324.2312568zinb0.50.005170171724842
71324.24125614zinb0.50.01227172032087
81324.2512568zinb0.50.01176171724867
91324.29112815zinb0.30.0117686041673
101324.3212568zinb0.50.01133171724869
911347.33646nb0.70.0128544544037
921350.02212813zinb0.90.0013508926728
931350.541645zinb0.90.00054924289280
941350.59112813nb0.90.01558599045
951350.9212813zinb0.90.000571089267215
961352.32325614zinb0.90.01201198246488
971355.6832565zinb0.90.001668197785649
981365.46325610zinb0.90.0142198041675
991367.7212815zinb0.90.000199989318413
1001370.332568nb0.90.0155197939281

Table 3: Hyperoptimization results for the Pbmc dataset.

marginal_lln_layersn_hiddenn_latentreconstruction_lossdropout_ratelrn_epochsn_paramsrun index
11143.79125615zinb0.10.00055037632092
21144.11125615nb0.10.00054237632072
31144.22125614nb0.10.0054837580870
41144.84125614nb0.10.0054337580886
51144.97125614nb0.10.0054837580868
61145.08125614nb0.10.0054337580851
71145.2125614nb0.10.0054737580885
81145.86225613nb0.10.00014950636855
91146.05225613nb0.10.00014950636842
101146.11225613nb0.10.00014950636858
911185.27512813zinb0.50.00054931872040
921188.06525614nb0.70.00014490009624
931188.98412815nb0.50.005492864644
941193.2152568zinb0.70.0054789702430
951202.7512811nb0.50.0014631820811
961203.3526413nb0.70.00014910201656
971205.62649nb0.70.00014810150482
981206.69412812nb0.70.00014728569661
991211.3152567nb0.70.0053889651274
1001232.1256415nb0.70.00012612684852

Table 4: Hyperoptimization results for the Brain Large dataset.

From Table 2, 3 and 4 we note that the obtained parameters agree on certain aspects such as the number of layers or the learning rate. Additionally, it is worthwhile noting that the best results often come from configurations with the maximal number of hidden neurons (256), in particular for the PBMC and Brain Large datasets. Perhaps, better performance could be obtained by increasing the search space on this component.

We purposedly do not use these results to conclude on which type of conditional distribution for the count distribution (NB, ZINB or other; parameterized by reconstruction_loss). Surprisingly, another tuning process run on the PBMC dataset consistently selected the negative binomial loss in the ten best runs. This likely is a bias introduced by the exploration process during the optimization procedure. See our group's blog post by O. Clivio and P. Boyeau for a complete analysis of the conditional distribution choice.

Overall, the best performances are stable even though we notice that small changes can have large impacts when we look at the full results (e.g, the number of layers can lead to a sensible difference in held-out log-likelihood). Such stability is in accordance with our extended use of scVI and scANVI in our harmonization preprint Xu et al. (2019), where we keep the hyperparameters fixed and obtain competitive performance with state-of-the-art methods. However, we recognize that hyperparameter tuning is an important practice in machine learning method developements and benchmarking and advocate the use of our autotune module for model selection and comparaisons.

The case of the Brain large dataset#

Given the amount of data available for this dataset, we were expecting to reach optimality with more hidden layers in our neural networks (as shallow networks with bounded width represent a potentially small function class). Suprisingly enough, single-hidden-layer models came out as the best contenders. In Figure 1, we investigate in further details the relationship between our n_layers parameter and the observed performance in terms of marginal negative log-likelihood.

Synthetic

Figure 1: Marginal negative log-likelihood as a function of the number of hidden layers in scVI's encoder/decoder, colored by the number of epochs. The number of epochs was limited to 50 or less if the early stopping criterion was triggered.

It is clear from Figure 1 that the shallow versions of scVI reach better performances on average. Still, a significant portion of hyperparameter configurations reach the maximum number of epochs. Potentially, allowing for a higher budget in terms of the number of epochs could change the outcome of our experiment. Additionally, the fact that we keep only the 720 most variable genes may be too big of a restriction on the amount of signal we keep, making this filtered version of the dataset simple enough to be better and faster captured by a single-hidden-layer model.

Benchmarking#

Our next concern was to investigate the kind of performance uplift we could yield with auto-tuning. Namely, we investigated how optimizing log likelihood could increase other metrics like imputation for example. Our results are reported in the table below.

DatasetRunLikelihoodImputation score
Marginal llELBO trainELBO testMedianMean
cortextuned1218.521178.521231.162.081552.87502
default1256.031224.181274.022.303173.25738
pbmctuned1323.441314.071328.040.839420.924637
default1327.611309.91334.010.8406170.925628
brain largetuned1143.791150.271150.71.035421.48743
default1160.681164.831165.121.045191.48591

Table 5: Comparison between the best and default runs of scVI on the Cortex, Pbmc and Brain Large datasets. Held-out marginal log-likelihood and imputation benchmarks are reported.

Note that the median (resp. mean) imputation score is the median (resp. mean) of the median absolute error per cell. Also, the held-out marginal negative log-likelihood is an importance sampling estimate of the negative marginal log-likelihood logp(x)-\log p(x) computed on a 25% test set and is the metric we optimize for.

From Table 5, we see that the tuning process improves the held-out marginal negative log-likelihood on all three datasets. This model improvement, in turn, ameliorates the imputation metrics for all three datasets, albeit not significantly for the Pbmc and Brain Large datasets.

Efficiency of the tuning process#

Our last concern was to investigate the efficiency of the tuning process. To do so we wanted to check how the held-out negative Evidence Lower BOund (ELBO) histories of the runs were evolving with the number of runs. Below is a plot of these histories.

Synthetic

Figure 2: Held-out negative ELBO history for each of the 100 trainings performed for each dataset. The lines are colored from red to green where red represents the first trial and green the last one.

In Figure 2, we see that most green runs (the last trials) are concentrated in the low negative ELBO regions which indicates that the TPE algorithm provides relevant trial suggestions. This would suggest that allowing a slightly higher budget might yield better results. The user should consider this when setting the number of evaluations to run.

Discussion#

As we illustrated, hyper-parameter tuning provides a improvement for the data fit and also for any metrics that are correlated with the held-out marginal log-likelihood. It is possible that some other scores such as clustering might not be improved by this process since scVI is not per se separating any form of clusters in its latent space. A case where improvement on log-likelihood should be correlated with better clustering scores is for example when using scANVI, which has a Gaussian mixture model prior over the latent space.

In this set of experiments, the hyperparameter selection procedure retained for all datasets an architecture with a unique hidden layer. To make sure that scVI benefits from non-linearity (compared to its deterministic homolog ZINB-WaVE), we investigated whether hyperopt would select a model with one layer compared to a linear model. We report our results in this table and conclude that a non-linear model sensibly outperforms the linear alternative on both of the Cortex and Pbmc datasets. Brain Large was not part of this experiment as it would have required a considerable amount of additional computing ressources. Besides, it is worth mentioning that some contemporary results tend to indicate that deep architectures are good approximators for gene expression measurements. For instance, Chen et al. (2016) compare linear regression to their deep neural network D-GEX on gene expression inference. They show architectures with three fully-connected layers outperform shallower architectures and linear regression by a significant margin. Such evidence could very well motivate a more in-depth study of scVI's optimization procedure. Perhaps, techniques to improve and accelerate deep neural-networks training could have a significant impact on the results we are able to reach. For example, between-layer connections such as those proposed in the DenseNet architecture Huang et al. (2018) might lead a path towards future improvement. In the meantime, we also plan to release a patch to scVI allowing for assymetric encoders and decoders as well as variable number of hidden units for each layer of each of scVI's neural networks.

Finally, we have seen how sensible scVI can be with respect to its set of hyperparameters on certain datasets. Thus, our new hyperparameter tuning feature should be used whenever possible - especially when trying to compare other methods to scVI. We note Qiwen Hu and Casey Greene's remarks on hyperparameter selection as well as their comment on Nature Methods and aknowledge their work being our main motivation for incorporating this new tool into scVI.

Acknowledgements#

We acknowledge Valentine Svensson for proposing the linear decoder and contributing to the scVI codebase.