Generalisation, Averaging Schemes, Privacy

Essential Bibliography

Analysis of generalisation within the Teacher/Student setup

Original paper introducing the teacher/student framework for generalisation (and gradient flow) analysis:

Saad, Solla; 1995

Also the basis of more modern works on generalisation - mostly in the continual learning setting, e.g.:

Stochastic Weight Averaging

Original paper introducing Stochastic Weight Averaging as a generalisation-enhancing training scheme:

Izmailov, Podoprikhin, Garipov, Vetrov, Gordon Wilson; 2019

Improvement upon SWA with clever weight sampling and learning rate scheduling:

Athiwaratkun, Finzi, Izmailov, Gordon Wilson; 2019

Re-purposing SWA as a surrogate-Bayesian uncertainty estimation tool:

Maddox, Garipov, Izmailov, Vetrov, Gordon Wilson; 2019

Paper casting constant-LR SGD as sampling. Interesting from the theoretical viewpoint.

Mandt, Hoffmann, Blei; 2019

Differentially-private Stochastic Gradient Descent

Original paper introducing the method:

Abadi, McMahan, Chu, Mironov, Zhang, Goodfellow, Talwar; 2016

The original paper introducing the link between algorithmic stability and generalisation (note: DP-SGD should be algorithmically stable!):

Bousquet, Elisseeff; 2000

And a recent take on the problem, for adaptive optimisers:

Nguyen, Pham, Reddi, Póczos; 2022

Workshop paper investigating inter-relationship among SWA, DP-SGD and generalisation. Theory up to non-noisy quadratic objectives.

Indri, Drucks, Gaertner; 2023

Flat/sharp minima in loss landscapes

Papers supporting the flatness $=$ generalisation hypothetical equivalence:

A paper casting doubts on the flatness $=$ generalisation hypothetical equivalence:

Dinh, Pascanu, Bengio, Bengio; 2017

A famous optimiser (SAM) directly integrating sharpness-awareness in the optimisation process, at the cost of a double backward pass:

Foret, Kleiner, Mobahi, Neyshabur; 2020

And a paper aiming at deeply understanding how it works:

Andriushchenko, Flammarion; 2022

On the limits of wide minima optimisers:

Kaddour, Liu, Silva, Kusner; 2023

Local gradient/weight averaging schemes (at optimisation-time)

The paper that started them all (Polyak-Ruppert averaging):

Polyak, Juditsky; 1992
(or the technical report from Ruppert; 1988)

The original Lookahead optimiser:

Zhang, Lucas Hinton, Ba; 2019
(and a brief talk about it, by Zhang himself)

The new Lookaround optimiser:

Zhang*, Liu, Song, Zhu, Xu, Song; 2023
(* a different Zhang)

Implementations

The reference PyTorch implementation of DP-SGD can be found as part of the Opacus Projectthat is citable as:

Yousefpour, Shilov, Sablayrolles, Testuggine, Prasad, Malek, Nguyen, Ghosh, Bharadwaj, Zhao, Cormode, Mironov; 2021

As far as optimisers are concerned - due to some unexplained very peculiarity - it appears that they become unmaintained after a while from publication (Lookahead, SAM) or are based on an old software stack (Lookaround). To ease the situation, I have minimally bugfixed and re-published them.

They should be available as

from ebtorch.optim import Lookahead, Lookaround, SAM

after a simple

pip install ebtorch

The Lookahead one is fairly battle-tested, the Lookaround one is still very experimental, the SAM one closely matches the original and should be OK to use.