Edifice.Meta.RLHFHead (Edifice v0.2.0)

Copy Markdown View Source

RLHF heads: reward model and DPO preference heads for alignment.

Provides composable head modules for Reinforcement Learning from Human Feedback (RLHF) pipelines. Two head types are supported:

Reward Head (:reward)

Maps a sequence to a scalar reward value per batch element:

Input [batch, seq, input_size]
      |
      v
Dense(hidden) -> SiLU -> Dense(1) -> Squeeze -> Mean Pool
      |
      v
Output [batch]

DPO Head (:dpo)

Takes two inputs ("chosen" and "rejected") and computes the preference logit (chosen_reward - rejected_reward) for Direct Preference Optimization:

Chosen  [batch, seq, input_size] -> Reward Head -> chosen_score
Rejected [batch, seq, input_size] -> Reward Head -> rejected_score
      |
      v
Output = chosen_score - rejected_score  [batch]

Usage

# Reward head
model = RLHFHead.build(input_size: 256, head_type: :reward)

# DPO head
model = RLHFHead.build(input_size: 256, head_type: :dpo)

References

  • Ouyang et al., "Training language models to follow instructions with human feedback" (2022)
  • Rafailov et al., "Direct Preference Optimization" (2023)

Summary

Types

Options for build/1.

Functions

Build an RLHF head.

Build a DPO preference head that computes chosen_reward - rejected_reward.

Build a reward head that maps sequences to scalar rewards.

Get the output size of an RLHF head (always scalar per batch).

Get recommended defaults for RLHF heads.

Types

build_opt()

@type build_opt() ::
  {:dropout, float()}
  | {:head_type, :reward | :dpo}
  | {:hidden_size, pos_integer()}
  | {:input_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build an RLHF head.

Options

  • :input_size - Input feature dimension (required)
  • :hidden_size - Hidden layer dimension (default: 256)
  • :head_type - Head type: :reward or :dpo (default: :reward)
  • :dropout - Dropout rate (default: 0.1)

Returns

An Axon model. For :reward, input is "state_sequence" and output is [batch]. For :dpo, inputs are "chosen" and "rejected", output is [batch].

build_dpo_head(opts)

@spec build_dpo_head(keyword()) :: Axon.t()

Build a DPO preference head that computes chosen_reward - rejected_reward.

Inputs: "chosen" and "rejected" [batch, seq, input_size] Output: [batch] (preference logit)

build_reward_head(opts)

@spec build_reward_head(keyword()) :: Axon.t()

Build a reward head that maps sequences to scalar rewards.

Input: "state_sequence" [batch, seq, input_size] Output: [batch]

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of an RLHF head (always scalar per batch).