# `Edifice.Meta.RLHFHead`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/meta/rlhf_head.ex#L1)

RLHF heads: reward model and DPO preference heads for alignment.

Provides composable head modules for Reinforcement Learning from Human
Feedback (RLHF) pipelines. Two head types are supported:

## Reward Head (`:reward`)

Maps a sequence to a scalar reward value per batch element:

```
Input [batch, seq, input_size]
      |
      v
Dense(hidden) -> SiLU -> Dense(1) -> Squeeze -> Mean Pool
      |
      v
Output [batch]
```

## DPO Head (`:dpo`)

Takes two inputs ("chosen" and "rejected") and computes the preference
logit (chosen_reward - rejected_reward) for Direct Preference Optimization:

```
Chosen  [batch, seq, input_size] -> Reward Head -> chosen_score
Rejected [batch, seq, input_size] -> Reward Head -> rejected_score
      |
      v
Output = chosen_score - rejected_score  [batch]
```

## Usage

    # Reward head
    model = RLHFHead.build(input_size: 256, head_type: :reward)

    # DPO head
    model = RLHFHead.build(input_size: 256, head_type: :dpo)

## References
- Ouyang et al., "Training language models to follow instructions with human feedback" (2022)
- Rafailov et al., "Direct Preference Optimization" (2023)

# `build_opt`

```elixir
@type build_opt() ::
  {:dropout, float()}
  | {:head_type, :reward | :dpo}
  | {:hidden_size, pos_integer()}
  | {:input_size, pos_integer()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: Axon.t()
```

Build an RLHF head.

## Options
  - `:input_size` - Input feature dimension (required)
  - `:hidden_size` - Hidden layer dimension (default: 256)
  - `:head_type` - Head type: `:reward` or `:dpo` (default: :reward)
  - `:dropout` - Dropout rate (default: 0.1)

## Returns
  An Axon model. For `:reward`, input is `"state_sequence"` and output is `[batch]`.
  For `:dpo`, inputs are `"chosen"` and `"rejected"`, output is `[batch]`.

# `build_dpo_head`

```elixir
@spec build_dpo_head(keyword()) :: Axon.t()
```

Build a DPO preference head that computes chosen_reward - rejected_reward.

Inputs: `"chosen"` and `"rejected"` `[batch, seq, input_size]`
Output: `[batch]` (preference logit)

# `build_reward_head`

```elixir
@spec build_reward_head(keyword()) :: Axon.t()
```

Build a reward head that maps sequences to scalar rewards.

Input: `"state_sequence"` `[batch, seq, input_size]`
Output: `[batch]`

# `output_size`

```elixir
@spec output_size(keyword()) :: pos_integer()
```

Get the output size of an RLHF head (always scalar per batch).

# `recommended_defaults`

```elixir
@spec recommended_defaults() :: keyword()
```

Get recommended defaults for RLHF heads.

---

*Consult [api-reference.md](api-reference.md) for complete listing*