I'm a Research Scientist at Samsung AI Center and I'm finishing my PhD that is focused on applications and understanding of stochastic deep learning models. Prior to that, I received a master's degree at Moscow Institute of Physics and Technology and Yandex school of data analysis, where lerned machine learning. I did a bachelor's degree at Bauman Moscow State Technical University with a major in applied math and computer science.

SparsesVD works in practice and it was used for network sparsification in leading IT companies. However, future studies showed that careful usage of pruning-based methods can produce better results.

Training of deep models with noise is known to be hard and unstable. That is less the case with SparseVD. All variances are initialized with small values and did not change much during training. Using small variances does not hurt the performance, thus SparseVD might be considered as a fancy regulariser with (almost) no noise.

The sparse solution is just a local optimum, as better values of ELBO can be achieved with a less flexible variational posterior q(w_ij)=N(w_ij | 0, σ_ij).

Variational dropout secretly trains highly sparsified deep neural networks, while a pattern of sparsity is learned jointly with weights during training.

The work shows that i) a simple ensemble of independently trained networks performs significantly better than recent techniques ii) a simple test-time augmentation applied to a conventional network outperforms low-parameters ensembles (e.g. Dropout) and also improves all ensembles for free iii) comparison of uncertainty estimation ability of algorithms is often done incorectly in literature.

The deep weight prior is the generative model for kernels of convolutional neural networks, that acts as a prior distribution while training on new datasets.

It is possible to learn a zero-centered Gaussian distribution over the weights of a neural network by learning only variances, and it works surprisingly well.