Séminaire Images Optimisation et Probabilités
(Proba-Stat) Stochastic Approximation : Finite-time analyses and Variance Reduction
Gersende Fort
( Institut de Mathématiques de Toulouse, CNRS )Salle de conférénces
le 14 novembre 2024 à 11:15
In statistical learning, many analyses and methods rely on optimization, including its stochastic versions introduced for example, to overcome an intractability of the objective function or to reduce the computational cost of the deterministic optimization step.
In 1951, H. Robbins and S. Monro introduced a novel iterative algorithm, named "Stochastic Approximation", for the computation of the zeros of a function defined by an expectation with no closed-form expression. This algorithm produces a sequence of iterates, by replacing at each iteration the unknown expectation with a Monte Carlo approximation based on one sample. Then, this method was generalized: it is a stochastic algorithm designed to find the zeros of a vector field when only stochastic oracles of this vector field are available.
Stochastic Gradient Descent algorithms are the most popular examples of Stochastic Approximation : oracles come from a Monte Carlo approximation of a large sum. Possibly less popular are examples named "beyond the gradient case" for at least two reasons. First, they rely on oracles that are biased approximation of the vector field, as it occurs when biased Monte Carlo sampling is used for the definition of the oracles. Second, the vector field is not necessarily a gradient vector field. Many examples in Statistics and more
generally in statistical learning are "beyond the gradient case": among examples, let us cite compressed stochastic gradient descent, stochastic Majorize-Minimization methods such as the Expectation-Maximization algorithm, or the Temporal Difference algorithm in reinforcement learning.
In this talk, we will show that these "beyond the gradient case" Stochastic Approximation algorithms still converge, even when the oracles are biased, as soon as some parameters of the algorithm are tuned enough. We will discuss what 'tuned enough' means when the quality criterion relies on epsilon-approximate stationarity. We will also comment the efficiency of the
algorithm through sample complexity. Such analyses are based on non-asymptotic convergence bounds in expectation: we will present a unified method to obtain such bounds for a large class of Stochastic Approximation methods including both the gradient case and the beyond the gradient case. Finally, a Variance Reduction technique will be described and its efficiency illustrated.