## Description¶

Info

Parent class: Optimizer

Derived classes: -

The main drawback of pure adaptive gradient descent is the uncontrolled accumulation of squared gradients, which leads to a constant decrease in the learning rate and, as a result, to a paralysis of the learning process itself.

The first principle of AdaDelta is that instead of the full amount of Gt updates, the gradient square averaged over history will be used. The method resembles the principle used in MomentumSGD - exponential moving average method.

Let us introduce a notation $E[g^2]_t$ - moving average of the square of the gradient at the $t$ time.

E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2

Then, when we insert $E[g^2]_t$ in the formula for updating parameters for AdaGrad instead or $G_t$, we will get the following formula (matrix operations are omitted for simplicity):

\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t

The denominator is the root mean square (RMS) of the gradients :

RMS[g]_t = \sqrt{E[g^2]_t + \epsilon}

Then the expression for the parameter update value is:

\Delta\theta_t = - \frac {\eta}{RMS[g]_t} g_t

Let us note that since the learning rate is nondimensional, the abstract units of measurement of this quantity and parameters do not coincide (however, the same applies to many other stochastic gradient descent algorithms). The next idea when developing AdaDelta was to bring the update value to the same units of measurement that the model parameters have.

This can be done by removing the learning rate from the formula. Let us define for this another moving average - the average of the squares of the values of the update parameters:

E[\Delta\theta^2]_t = \gamma E[\Delta\theta^2]_{t-1} + (1 - \gamma)\Delta\theta_t^2

Then:

RMS[\Delta\theta]_t = \sqrt{E[\Delta\theta^2]_t + \epsilon}

When we insert $RMS[\Delta\theta]_t$ in the formula of updating the parameters, we get the expression for the AdaDelta algorithm:

\begin{equation} \Delta\theta_t = -\frac {RMS[\Delta\theta]_{t-1}} {RMS[g]_t} g_t \end{equation}

\begin{equation} \theta_{t + 1} = \theta_t + \Delta\theta_t \end{equation}

It is recommended to set the $\gamma$ parameter to $0.9$.

## Initializing¶

def __init__(self, rho=0.95, epsilon=1e-6, nodeinfo=None):


Parameters

Parameter Allowed types Description Default
rho float Decay rate 0.95
epsilon float Smoothing parameter 1e-6
nodeinfo NodeInfo Object containing information about the computational node None

Explanations

-

## Examples¶

Necessary imports:

import numpy as np
from PuzzleLib.Backend import gpuarray


Info

gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))


Declaring the optimizer:

optimizer = AdaDelta()


Suppose that there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, the following is required:

optimizer.setupOn(net, useGlobalState=True)


Info

You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)