RLHF heads: reward model and DPO preference heads for alignment.
Provides composable head modules for Reinforcement Learning from Human Feedback (RLHF) pipelines. Two head types are supported:
Reward Head (:reward)
Maps a sequence to a scalar reward value per batch element:
Input [batch, seq, input_size]
|
v
Dense(hidden) -> SiLU -> Dense(1) -> Squeeze -> Mean Pool
|
v
Output [batch]DPO Head (:dpo)
Takes two inputs ("chosen" and "rejected") and computes the preference logit (chosen_reward - rejected_reward) for Direct Preference Optimization:
Chosen [batch, seq, input_size] -> Reward Head -> chosen_score
Rejected [batch, seq, input_size] -> Reward Head -> rejected_score
|
v
Output = chosen_score - rejected_score [batch]Usage
# Reward head
model = RLHFHead.build(input_size: 256, head_type: :reward)
# DPO head
model = RLHFHead.build(input_size: 256, head_type: :dpo)References
- Ouyang et al., "Training language models to follow instructions with human feedback" (2022)
- Rafailov et al., "Direct Preference Optimization" (2023)
Summary
Functions
Build an RLHF head.
Build a DPO preference head that computes chosen_reward - rejected_reward.
Build a reward head that maps sequences to scalar rewards.
Get the output size of an RLHF head (always scalar per batch).
Get recommended defaults for RLHF heads.
Types
@type build_opt() :: {:dropout, float()} | {:head_type, :reward | :dpo} | {:hidden_size, pos_integer()} | {:input_size, pos_integer()}
Options for build/1.
Functions
Build an RLHF head.
Options
:input_size- Input feature dimension (required):hidden_size- Hidden layer dimension (default: 256):head_type- Head type::rewardor:dpo(default: :reward):dropout- Dropout rate (default: 0.1)
Returns
An Axon model. For :reward, input is "state_sequence" and output is [batch].
For :dpo, inputs are "chosen" and "rejected", output is [batch].
Build a DPO preference head that computes chosen_reward - rejected_reward.
Inputs: "chosen" and "rejected" [batch, seq, input_size]
Output: [batch] (preference logit)
Build a reward head that maps sequences to scalar rewards.
Input: "state_sequence" [batch, seq, input_size]
Output: [batch]
@spec output_size(keyword()) :: pos_integer()
Get the output size of an RLHF head (always scalar per batch).
@spec recommended_defaults() :: keyword()
Get recommended defaults for RLHF heads.