This repository provides Tensorflow and Pytorch reference implementations for PAL. PAL is an An efficient and effective line search approach for DNNs which exploits the almost parabolic shape of the loss in negative gradient direction to automatically estimate good step sizes.
If you have any questions or suggestions, please do not hesitate to contact me: maximus.mutschler(at)uni-tuebingen.de
Fig1: PAL's basic idea
PAL is based on the empirical observarion that the loss function can be approximated by a one-dimensional parabola in negative gradient/line direction.
To do this, one additional point has to be meashured on the line.
PAL performs a variable update step by jumping into the minimum of the approximated parabola.
PAL surpasses SLS, ALIG, SGD-HD and COCOB and competes against ADAM, SLS, SGD and RMSProp on ResNet-32, MobilenetV2, DenseNet-40 and EfficientNet architectures trained on CIFAR-10 and CIFAR-100.
However, the latter are tuned by piecewise constant step sizes, whereas PAL does derive its own learning rate schedule.
PAL surpasses all those optimizers when they are trained without a schedule.
Therefore we PAL could be used in scenarios where default schedules fail.
For a detailed explanation, please refer to our paper.: https://arxiv.org/abs/1903.11991
Fig2: Exemplary performance of PAL with data augmentation
Fig3: Exemplary performance of PAL without data augmentation, however this leads to severe overfitting
For a detailed explanation, please refer to our paper.
The introduced hyperparameters lead to good training and test errors:
Usually only the measuring step size has to be adapted slightly.
Its sensitivity is not as high as the one of of the learning rate of SGD.
Abbreviation | Name | Description | Default parameter intervalls | Sensitivity compared to SGD leaning rate |
---|---|---|---|---|
μ | measuring step size | distance to the second sampled training loss value | [0.1,1] | medium |
α | update step adaptation | Multiplier to the update step | [1.0,1.2,1.7] | low |
β | direciton adaptation factor | Adapts the line direction depending on previous line directions | [0.0.4] | low |
smax | maximum step size | maximum step size on line. | [3.6] | low |
- No limitations. Can be used in the same way as any other PyTorch optimizer.
- Runs with PyTorch 1.4
- Uses tensorboardX for plotting
- Parabola approximations and loss lines can be plotted
- limitations:
- The DNN must not contain any random components such as Dropout or ShakeDrop. This is because PALS requires two loss values of the same deterministic function (= two network inferences) to determine an update step. Otherwise the function would not be continuous and a parabolic approximation is not be possible. However, if these random component implementations could be changed so that drawn random numbers can be reused for at least two inferences, PAL would also support these operations.
- If using Dropout this has to be replaced with the adapted implementation we provide which works with PAL.
- With Tensorflow 1.15 and 2.0 is was not possible for us to write a completely graph-based optimizer. Therefore it has to be used slightly different as other optimizers. Have a look into the example code! This is not the case with Pytorch.
- The Tensorflow implementation does not support Keras and Estimator API.
- Runs with Tensorflow 1.15
- Uses tensorboard for plotting
- Parabola approximations and loss lines can be plotted
A virtual environment capable of executing the provided code can be created with the provided python_virtual_env_requirements.txt