CS 201: ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning, MICHAEL MAHONEY, ICSI & Department of Statistics – UC Berkeley

Speaker: Michael Mahoney
Affiliation: ICSI & Department of Statistics - UC Berkeley


Second order optimization algorithms have a long history in scientific computing, but they tend not to be used much in machine learning.  This is in spite of the fact that they gracefully handle step size issues, poor conditioning problems, communication-computation tradeoffs, etc., all problems which are increasingly important in large-scale and high performance machine learning.  A large part of the reason for this is that their implementation requires some care, e.g., a good implementation isn’t possible in a few lines of python after taking a data science boot camp, and that a naive implementation typically performs worse than heavily parameterized/hyperparameterized stochastic first order methods.

We describe ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the Hessian.  ADAHESSIAN includes several novel performance-improving features, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a spatial averaging to reduce the variance of the second derivative; and (iii) a root-mean-square exponential moving average to smooth out variations of the second-derivative across different iterations.  Extensive tests on natural language processing, computer vision, and recommendation system tasks demonstrate that ADAHESSIAN achieves state-of-the-art results.  The cost per iteration of ADAHESSIAN is comparable to first-order methods, and ADAHESSIAN exhibits improved robustness towards variations in hyperparameter values.


Michael Mahoney is an Associate Adjunct Professor at ICSI and the Department of Statistics at UC Berkeley, and is a faculty member of  the RISELab (in the past AMPLab) in the Department of EECS.

Michael Mahoney’s work focuses on the applied mathematics of data, in particular on the theory and practice of what is now called big data. On the theory side, we develop algorithmic and statistical methods for matrix, graph, regression, optimization, and related problems. On the implementation side, we provide implementations (e.g., on single machine, distributed data system, and supercomputer environments) of a range of matrix, graph, and optimization algorithms. On the applied side, we apply these methods to a range of problems in internet and social media analysis, social networks analysis, as well as genetics, mass spec imaging, astronomy, climate, and a range of other scientific applications.

 Hosted by Professor Quanquan Gu

Location: Via Zoom

Date(s) - Oct 06, 2020
4:00 pm - 5:45 pm

Zoom Webinar
404 Westwood Plaza Los Angeles
Map Unavailable