CS 201: Batch Normalization Causes Gradient Explosion in Deep Randomly Initialized Networks, GREG YANG, Microsoft Research

Speaker: Greg Yang
Affiliation: Microsoft Research

ABSTRACT:

Batch Normalization (batchnorm) has become a staple in deep learning since its introduction in 2015. The authors conjectured that “Batch Normalization may lead the layer Jacobians to have singular values close to 1” and recent works suggest it benefits optimization by smoothing the optimization landscape during training. We disprove the “Jacobian singular value” conjecture for randomly initialized networks, showing batchnorm causes gradient explosion that is exponential in depth. This implies that at initialization, batchnorm in fact “roughens” the optimization landscape. This explosion empirically prevents one from training relu networks with more than 50 layers without skip connection. We discuss several ways of mitigating this explosion and their relevance in practice. This work is a collaboration with Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Sam S. Schoenholz. and it appeared in ICLR 2019 (https://openreview.net/forum?id=SyMDXnCcF7).

BIO:

Greg Yang is a Researcher at Microsoft Research and has a Bachelors in Mathematics and a Masters in Computer Science from Harvard University. Greg is broadly interested in artificial intelligence, theoretical computer science, and mathematics.

Hosted by Professor Stefano Soatto

Date/Time:
Date(s) - Jun 06, 2019
4:15 pm - 5:45 pm

Location:
3400 Boelter Hall
420 Westwood Plaza Los Angeles California 90095