Policy-Value network for reinforcement learning.
Shared-trunk actor-critic architecture with separate policy and value heads. Suitable for PPO, A2C, and other policy gradient methods.
Architecture
Input [batch, input_size]
|
+==================+
| Shared Trunk |
| dense → GELU |
| dense → GELU |
+==================+
|
+-----+-----+
| |
v v
Policy Value
Head Head
| |
v v
[batch, [batch]
action_size]Action Types
:discrete— Policy outputs softmax probabilities over discrete actions:continuous— Policy outputs tanh-squashed values in [-1, 1]
Returns
An Axon model outputting %{policy: ..., value: ...} via Axon.container.
Usage
model = PolicyValue.build(
input_size: 64,
action_size: 4,
action_type: :discrete,
hidden_size: 128
)For a complete PPO training loop, see the exphil project which builds on these primitives.
References
- Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
- Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (A3C, 2016)
Summary
Types
@type build_opt() :: {:input_size, pos_integer()} | {:action_size, pos_integer()} | {:action_type, :discrete | :continuous} | {:hidden_size, pos_integer()}
Options for build/1.
Functions
Build a policy-value network.
Options
:input_size- Input observation dimension (required):action_size- Number of actions (discrete) or action dimensions (continuous) (required):action_type-:discreteor:continuous(default::discrete):hidden_size- Hidden layer size (default: 64)
Returns
An Axon model outputting %{policy: [batch, action_size], value: [batch]}.
@spec output_size(keyword()) :: pos_integer()
Get the output size (action_size for policy head).