Edifice.RL.PolicyValue (Edifice v0.2.0)

Policy-Value network for reinforcement learning.

Shared-trunk actor-critic architecture with separate policy and value heads. Suitable for PPO, A2C, and other policy gradient methods.

Architecture

Input [batch, input_size]
      |
+==================+
|  Shared Trunk    |
|  dense → GELU    |
|  dense → GELU    |
+==================+
      |
+-----+-----+
|           |
v           v
Policy     Value
Head       Head
|           |
v           v
[batch,    [batch]
action_size]

Action Types

:discrete — Policy outputs softmax probabilities over discrete actions
:continuous — Policy outputs tanh-squashed values in [-1, 1]

Returns

An Axon model outputting %{policy: ..., value: ...} via Axon.container.

Usage

model = PolicyValue.build(
  input_size: 64,
  action_size: 4,
  action_type: :discrete,
  hidden_size: 128
)

For a complete PPO training loop, see the exphil project which builds on these primitives.

References

Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (A3C, 2016)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a policy-value network.

output_size(opts \\ [])

Get the output size (action_size for policy head).