Edifice.RL.PolicyValue (Edifice v0.2.0)

Copy Markdown View Source

Policy-Value network for reinforcement learning.

Shared-trunk actor-critic architecture with separate policy and value heads. Suitable for PPO, A2C, and other policy gradient methods.

Architecture

Input [batch, input_size]
      |
+==================+
|  Shared Trunk    |
|  dense  GELU    |
|  dense  GELU    |
+==================+
      |
+-----+-----+
|           |
v           v
Policy     Value
Head       Head
|           |
v           v
[batch,    [batch]
action_size]

Action Types

  • :discrete — Policy outputs softmax probabilities over discrete actions
  • :continuous — Policy outputs tanh-squashed values in [-1, 1]

Returns

An Axon model outputting %{policy: ..., value: ...} via Axon.container.

Usage

model = PolicyValue.build(
  input_size: 64,
  action_size: 4,
  action_type: :discrete,
  hidden_size: 128
)

For a complete PPO training loop, see the exphil project which builds on these primitives.

References

  • Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
  • Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (A3C, 2016)

Summary

Types

Options for build/1.

Functions

Build a policy-value network.

Get the output size (action_size for policy head).

Types

build_opt()

@type build_opt() ::
  {:input_size, pos_integer()}
  | {:action_size, pos_integer()}
  | {:action_type, :discrete | :continuous}
  | {:hidden_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a policy-value network.

Options

  • :input_size - Input observation dimension (required)
  • :action_size - Number of actions (discrete) or action dimensions (continuous) (required)
  • :action_type - :discrete or :continuous (default: :discrete)
  • :hidden_size - Hidden layer size (default: 64)

Returns

An Axon model outputting %{policy: [batch, action_size], value: [batch]}.

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size (action_size for policy head).