View Source Axon.Updates (Axon v0.5.1)

Parameter update methods.

Update methods transform the input tensor in some way, usually by scaling or shifting the input with respect to some input state. Update methods are composed to create more advanced optimization methods such as AdaGrad or Adam. Each update returns a tuple:

{init_fn, update_fn}

Which represent a state initialization and state update function respectively. While each method in the Updates API is a regular Elixir function, the two methods they return are implemented as defn, so they can be accelerated using any Nx backend or compiler.

Update methods are just combinators that can be arbitrarily composed to create complex optimizers. For example, the Adam optimizer in Axon.Optimizers is implemented as:

def adam(learning_rate, opts \\ []) do
  Updates.scale_by_adam(opts)
  |> Updates.scale(-learning_rate)
end

Updates are maps of updates, often associated with parameters of the same names. Using Axon.Updates.apply_updates/3 will merge updates and parameters by adding associated parameters and updates, and ensuring any given model state is preserved.

custom-combinators
Custom combinators

You can create your own combinators using the stateless/2 and stateful/3 primitives. Every update method in this module is implemented in terms of one of these two primitives.

stateless/2 represents a stateless update:

def scale(combinator \\ Axon.Updates.identity(), step_size) do
  stateless(combinator, &apply_scale(&1, &2, step_size))
end

defnp apply_scale(x, _params, step) do
  deep_new(updates, fn x -> Nx.multiply(x, step) end)
end

Notice how the function given to stateless/2 is defined within defn. This is what allows the anonymous functions returned by Axon.Updates to be used inside defn.

stateful/3 represents a stateful update and follows the same pattern:

def my_stateful_update(updates) do
  Axon.Updates.stateful(updates, &init_my_update/1, &apply_my_update/2)
end

defnp init_my_update(params) do
  state = zeros_like(params, type: :f32)
  %{state: state}
end

defnp apply_my_update(updates, state) do
  new_state = deep_new(state, fn v -> Nx.add(v, 0.01) end)
  updates = deep_merge(updates, state, fn g, z -> Nx.multiply(g, z) end)
  {updates, %{state: new_state}}
end

State associated with individual parameters should have keys that match the keys of the parameter. For example, if you have parameters %{kernel: kernel} with associated states mu and nu representing the first and second moments, your state should look something like:

%{
  mu: %{kernel: kernel_mu}
  nu: %{kernel: kernel_nu}
}

Link to this section Summary

Functions

add_decayed_weights(combinator_or_opts \\ [])

Adds decayed weights to updates.

add_decayed_weights(combinator, opts)

add_noise(combinator_or_opts \\ [])

Adds random Gaussian noise to the input.

add_noise(combinator, opts)

apply_updates(params, updates, state \\ nil)

Applies updates to params and updates state parameters with given state map.

centralize(combinator_or_opts \\ [])

Centralizes input by shifting updates by their mean.

centralize(combinator, opts)

clip(combinator_or_opts \\ [])

Clips input between -delta and delta.

clip(combinator, opts)

clip_by_global_norm(combinator_or_opts \\ [])

Clips input using input global norm.

clip_by_global_norm(combinator, opts)

compose(arg1, arg2)

Composes two updates. This is useful for extending optimizers without having to reimplement them. For example, you can implement gradient centralization

identity()

Returns the identity update.

identity(combinator)

scale(combinator \\ identity(), step_size)

Scales input by a fixed step size.

scale_by_adam(combinator_or_opts \\ [])

Scales input according to Adam algorithm.

scale_by_adam(combinator, opts)

scale_by_belief(combinator_or_opts \\ [])

Scales input according to the AdaBelief algorithm.

scale_by_belief(combinator, opts)

scale_by_radam(combinator_or_opts \\ [])

Scale input according to the Rectified Adam algorithm.

scale_by_radam(combinator, opts)

scale_by_rms(combinator_or_opts \\ [])

Scales input by the root of the EMA of squared inputs.

scale_by_rms(combinator, opts)

scale_by_rss(combinator_or_opts \\ [])

Scales input by the root of all prior squared inputs.

scale_by_rss(combinator, opts)

scale_by_schedule(combinator \\ identity(), schedule_fn)

Scales input using the given schedule function.

scale_by_state(combinator_or_step)

Scales input by a tunable learning rate which can be manipulated by external APIs such as Axon's Loop API.

scale_by_state(combinator, step)

scale_by_stddev(combinator_or_opts \\ [])

Scales input by the root of the centered EMA of squared inputs.

scale_by_stddev(combinator, opts)

scale_by_trust_ratio(combinator_or_opts \\ [])

Scale by trust ratio.

scale_by_trust_ratio(combinator, opts)

scale_by_yogi(combinator_or_opts \\ [])

Scale input according to the Yogi algorithm.

scale_by_yogi(combinator, opts)

stateful(arg \\ identity(), init_fn, apply_fn)

Represents a stateful update.

stateless(arg \\ identity(), apply_fn)

Represents a stateless update.

trace(combinator_or_opts \\ [])

Trace inputs with past inputs.

trace(combinator, opts)

Link to this section Functions

add_decayed_weights(combinator_or_opts \\ [])

Adds decayed weights to updates.

Commonly used as a regularization strategy.

options
Options

* `:decay` - Rate of decay. Defaults to `0.0`.

add_decayed_weights(combinator, opts)

add_noise(combinator_or_opts \\ [])

Adds random Gaussian noise to the input.

options
Options

* `:seed` - Random seed to use. Defaults to the
  current system time.

* `:eta` - Controls amount of noise to add.
  Defaults to `0.01`.

* `:gamma` - Controls amount of noise to add.
  Defaults to `0.55`.

add_noise(combinator, opts)

apply_updates(params, updates, state \\ nil)

Applies updates to params and updates state parameters with given state map.

centralize(combinator_or_opts \\ [])

Centralizes input by shifting updates by their mean.

centralize(combinator, opts)

clip(combinator_or_opts \\ [])

Clips input between -delta and delta.

options
Options

:delta - maximum absolute value of the input. Defaults to 2.0

clip(combinator, opts)

clip_by_global_norm(combinator_or_opts \\ [])

Clips input using input global norm.

options
Options

:max_norm - maximum norm value of input. Defaults to 1.0

clip_by_global_norm(combinator, opts)

compose(arg1, arg2)

Composes two updates. This is useful for extending optimizers without having to reimplement them. For example, you can implement gradient centralization:

import Axon.Updates

Axon.Updates.compose(Axon.Updates.centralize(), Axon.Optimizers.rmsprop())

This is equivalent to:

Axon.Updates.centralize()
|> Axon.Updates.scale_by_rms()

identity()

Returns the identity update.

This is often as the initial update in many functions in this module.

identity(combinator)

scale(combinator \\ identity(), step_size)

Scales input by a fixed step size.

$$f(x_i) = \alpha x_i$$

scale_by_adam(combinator_or_opts \\ [])

Scales input according to Adam algorithm.

options
Options

* `:b1` - first moment decay. Defaults to `0.9`

* `:b2` - second moment decay. Defaults to `0.999`

* `:eps` - numerical stability term. Defaults to `1.0e-8`

* `:eps_root` - numerical stability term. Defaults to `1.0e-15`

references
References

Adam: A Method for Stochastic Optimization

scale_by_adam(combinator, opts)

scale_by_belief(combinator_or_opts \\ [])

Scales input according to the AdaBelief algorithm.

options
Options

* `:b1` - first moment decay. Defaults to `0.9`.

* `:b2` - second moment decay. Defaults to `0.999`.

* `:eps` - numerical stability term. Defaults to `0.0`.

* `:eps_root` - numerical stability term. Defaults to `1.0e-16`.

references
References

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

scale_by_belief(combinator, opts)

scale_by_radam(combinator_or_opts \\ [])

Scale input according to the Rectified Adam algorithm.

options
Options

* `:b1` - first moment decay. Defaults to `0.9`

* `:b2` - second moment decay. Defaults to `0.999`

* `:eps` - numerical stability term. Defaults to `1.0e-8`

* `:eps_root` - numerical stability term. Defaults to `0.0`

* `:threshold` - threshold for variance. Defaults to `5.0`

references
References

On the Variance of the Adaptive Learning Rate and Beyond

scale_by_radam(combinator, opts)

scale_by_rms(combinator_or_opts \\ [])

Scales input by the root of the EMA of squared inputs.

options
Options

* `:decay` - EMA decay rate. Defaults to `0.9`.

* `:eps` - numerical stability term. Defaults to `1.0e-8`.

references
References

Overview of mini-batch gradient descent

scale_by_rms(combinator, opts)

scale_by_rss(combinator_or_opts \\ [])

Scales input by the root of all prior squared inputs.

options
Options

* `:eps` - numerical stability term. Defaults to `1.0e-7`

scale_by_rss(combinator, opts)

scale_by_schedule(combinator \\ identity(), schedule_fn)

Scales input using the given schedule function.

This can be useful for implementing learning rate schedules. The number of update iterations is tracked by an internal counter. You might need to update the schedule to operate on per-batch schedule rather than per-epoch.

scale_by_state(combinator_or_step)

Scales input by a tunable learning rate which can be manipulated by external APIs such as Axon's Loop API.

$$f(x_i) = \alpha x_i$$

scale_by_state(combinator, step)

scale_by_stddev(combinator_or_opts \\ [])

Scales input by the root of the centered EMA of squared inputs.

options
Options

* `:decay` - EMA decay rate. Defaults to `0.9`.

* `:eps` - numerical stability term. Defaults to `1.0e-8`.

references
References

Overview of mini-batch gradient descent

scale_by_stddev(combinator, opts)

scale_by_trust_ratio(combinator_or_opts \\ [])

Scale by trust ratio.

options
Options

* `:min_norm` - Min norm to clip. Defaults to
  `0.0`.

* `:trust_coefficient` - Trust coefficient. Defaults
  to `1.0`.

* `:eps` - Numerical stability term. Defaults to `0.0`.

scale_by_trust_ratio(combinator, opts)

scale_by_yogi(combinator_or_opts \\ [])

Scale input according to the Yogi algorithm.

options
Options

* `:initial_accumulator_value` - Initial state accumulator value.

* `:b1` - first moment decay. Defaults to `0.9`

* `:b2` - second moment decay. Defaults to `0.999`

* `:eps` - numerical stability term. Defaults to `1.0e-8`

* `:eps_root` - numerical stability term. Defaults to `0.0`

references
References

* [Adaptive Methods for Nonconvex Optimization](https://proceedings.neurips.cc/paper/2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf)

scale_by_yogi(combinator, opts)

stateful(arg \\ identity(), init_fn, apply_fn)

Represents a stateful update.

Stateful updates require some update state, such as momentum or RMS of previous updates. Therefore you must implement some initialization function as well as an update function.

stateless(arg \\ identity(), apply_fn)

Represents a stateless update.

Stateless updates do not depend on an update state and thus only require an implementation of an update function.

Settings View Source Axon.Updates (Axon v0.5.1)

custom-combinators Custom combinators

Link to this section Summary

Functions

Link to this section Functions

add_decayed_weights(combinator_or_opts \\ [])

options Options

add_decayed_weights(combinator, opts)

add_noise(combinator_or_opts \\ [])

options Options

add_noise(combinator, opts)

apply_updates(params, updates, state \\ nil)

centralize(combinator_or_opts \\ [])

centralize(combinator, opts)

clip(combinator_or_opts \\ [])

options Options

clip(combinator, opts)

clip_by_global_norm(combinator_or_opts \\ [])

options Options

clip_by_global_norm(combinator, opts)

compose(arg1, arg2)

identity()

identity(combinator)

scale(combinator \\ identity(), step_size)

scale_by_adam(combinator_or_opts \\ [])

options Options

references References

scale_by_adam(combinator, opts)

scale_by_belief(combinator_or_opts \\ [])

options Options

references References

scale_by_belief(combinator, opts)

scale_by_radam(combinator_or_opts \\ [])

options Options

references References

scale_by_radam(combinator, opts)

scale_by_rms(combinator_or_opts \\ [])

options Options

references References

scale_by_rms(combinator, opts)

scale_by_rss(combinator_or_opts \\ [])

options Options

scale_by_rss(combinator, opts)

scale_by_schedule(combinator \\ identity(), schedule_fn)

scale_by_state(combinator_or_step)

scale_by_state(combinator, step)

scale_by_stddev(combinator_or_opts \\ [])

options Options

references References

scale_by_stddev(combinator, opts)

scale_by_trust_ratio(combinator_or_opts \\ [])

options Options

scale_by_trust_ratio(combinator, opts)

scale_by_yogi(combinator_or_opts \\ [])

options Options

references References

scale_by_yogi(combinator, opts)

stateful(arg \\ identity(), init_fn, apply_fn)

stateless(arg \\ identity(), apply_fn)

trace(combinator_or_opts \\ [])

options Options

trace(combinator, opts)

View Source Axon.Updates (Axon v0.5.1)

custom-combinators
Custom combinators

options
Options

options
Options

options
Options

options
Options

options
Options

references
References

options
Options

references
References

options
Options

references
References

options
Options

references
References

options
Options

options
Options

references
References

options
Options

options
Options

references
References

options
Options