Optimising high-dimensional non-convex objectives is the central computational challenge in modern machine learning. Despite the empirical success of first-order methods, theoretical guarantees remain elusive outside restricted problem classes. A core difficulty is that standard convergence proofs assume Lipschitz-smooth gradients and bounded stochastic noise, assumptions routinely violated in practice by batch normalisation layers and heavy-tailed gradient distributions.
Early convergence results for SGD in non-convex settings established O(1/√T) rates for the minimum gradient norm, matching known lower bounds up to logarithmic factors. Subsequent work extended these guarantees to variance-reduced estimators, showing that finite-sum structure can be exploited to achieve faster rates without full-batch gradients. See Theorem 2.1 for a precise statement.
Adam and its variants introduce per-parameter learning rates derived from second-moment estimates, dramatically accelerating early training. However, their convergence in non-convex settings requires careful choice of the ε stabilisation term; values that are too small reintroduce instability near saddle points, while large values recover the behaviour of standard SGD.
Let f : ℝd → ℝ be an L-smooth function bounded below. Under mild assumptions on the stochastic gradient oracle, we show that the iterates {x_t} produced by SGD with step size ηt = η/√t satisfy 𝔼[‖∇f(xτ)‖²] = O(log T/√T), improving the constant in prior bounds by a factor of 2Lη.