# `Edifice.Robotics.OpenVLA`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/robotics/openvla.ex#L1)

OpenVLA: Open-Source Vision-Language-Action Model.

OpenVLA is a vision-language-action model for robot manipulation that combines
a DINOv2 vision encoder with a Llama-style language model backbone to predict
discretized robot actions from images and text instructions.

## Key Innovation: Action Tokenization

Instead of predicting continuous actions directly, OpenVLA discretizes each
action dimension into bins (default 256), treating robot control as a
sequence-to-sequence problem. The model autoregressively generates action
tokens conditioned on visual and language tokens.

## Architecture

```
Image [batch, C, H, W]           Text "pick up the red block"
      |                                    |
      v                                    v
+================+                  +================+
|   DINOv2 ViT   |                  |  LLM Tokenizer |
+================+                  +================+
      |                                    |
      v                                    v
Visual Tokens                       Text Token IDs
[batch, num_patches, vision_dim]   [batch, text_len]
      |                                    |
      v                                    v
[Vision-LM Projection]             [LLM Embedding]
[batch, num_patches, hidden_dim]   [batch, text_len, hidden_dim]
      |                                    |
      +---------------+--------------------+
                      |
                      v
            [Concatenate: visual | text]
            [batch, num_patches + text_len, hidden_dim]
                      |
                      v
            +====================+
            |  Llama-style LLM   |
            |  (Decoder-Only)    |
            |  with Causal Mask  |
            +====================+
                      |
                      v
            Action Token Logits
            [batch, action_dim, num_bins]
                      |
                      v
            Argmax -> Detokenize
                      |
                      v
            Continuous Actions
            [batch, action_dim]
```

## Action Tokenization

For action dimension `d` with range `[a_min, a_max]`:
- Bin index = floor((action - a_min) / (a_max - a_min) * (num_bins - 1))
- Action = a_min + (bin_index / (num_bins - 1)) * (a_max - a_min)

Default: 256 bins per dimension, covering normalized range [-1, 1].

## Usage

    model = OpenVLA.build(
      image_size: 224,
      vision_encoder: :dino_v2,
      action_dim: 7,
      num_action_bins: 256,
      hidden_dim: 2048,
      num_layers: 24,
      num_heads: 16
    )

    # Forward pass: image + text -> action logits
    action_logits = model.predict(params, %{"image" => img, "text_tokens" => tokens})

    # Tokenize ground truth actions for training
    action_tokens = OpenVLA.tokenize_actions(actions, 256)

    # Compute loss (cross-entropy on action tokens only)
    loss = OpenVLA.vla_loss(action_logits, action_tokens)

    # Detokenize predictions for inference
    actions = OpenVLA.detokenize_actions(predicted_tokens, 256)

## References

- Paper: "OpenVLA: An Open-Source Vision-Language-Action Model"
  (Kim et al., 2024) - https://arxiv.org/abs/2406.09246
- Project: https://openvla.github.io/

# `build_opt`

```elixir
@type build_opt() ::
  {:image_size, pos_integer()}
  | {:patch_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:vision_dim, pos_integer()}
  | {:hidden_dim, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_kv_heads, pos_integer()}
  | {:action_dim, pos_integer()}
  | {:num_action_bins, pos_integer()}
  | {:max_text_len, pos_integer()}
  | {:dropout, float()}
  | {:rope, boolean()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: Axon.t()
```

Build an OpenVLA model.

## Options

  - `:image_size` - Input image size, square (default: 224)
  - `:patch_size` - ViT patch size (default: 14)
  - `:in_channels` - Number of input channels (default: 3)
  - `:vision_dim` - Vision encoder output dimension (default: 384)
  - `:hidden_dim` - LLM hidden dimension (default: 2048)
  - `:num_layers` - Number of LLM layers (default: 24)
  - `:num_heads` - Number of attention heads (default: 16)
  - `:num_kv_heads` - Number of KV heads for GQA (default: 4)
  - `:action_dim` - Robot action dimension (default: 7)
  - `:num_action_bins` - Number of bins for action discretization (default: 256)
  - `:max_text_len` - Maximum text sequence length (default: 64)
  - `:dropout` - Dropout rate (default: 0.1)
  - `:rope` - Apply RoPE to attention (default: false)

## Returns

  An Axon model with inputs `"image"` and `"text_tokens"`, outputting
  `[batch, action_dim, num_action_bins]` logits.

# `detokenize_actions`

```elixir
@spec detokenize_actions(Nx.Tensor.t(), pos_integer(), keyword()) :: Nx.Tensor.t()
```

Detokenize discrete bin indices back to continuous actions.

Converts bin indices to the center of each bin.

## Parameters

  - `action_tokens` - Integer tensor of bin indices `[batch, action_dim]`
  - `num_bins` - Number of bins per dimension (default: 256)

## Options

  - `:action_min` - Minimum action value (default: -1.0)
  - `:action_max` - Maximum action value (default: 1.0)

## Returns

  Float tensor of continuous actions with same shape as input.

# `encode_image`

```elixir
@spec encode_image(
  Axon.t(),
  keyword()
) :: Axon.t()
```

Encode an image through the vision encoder (DINOv2-style ViT).

Returns visual tokens without the DINO projection head.

## Parameters

  - `image` - Axon node for image input
  - `opts` - Build options

## Returns

  Axon node with shape `[batch, num_patches, vision_dim]`.

# `output_size`

```elixir
@spec output_size(keyword()) :: pos_integer()
```

Get the output size of an OpenVLA model.

# `param_count`

```elixir
@spec param_count(keyword()) :: pos_integer()
```

Calculate approximate parameter count for an OpenVLA model.

# `recommended_defaults`

```elixir
@spec recommended_defaults() :: keyword()
```

Recommended default configuration for OpenVLA.

# `small_config`

```elixir
@spec small_config() :: keyword()
```

Get small model configuration (for testing/prototyping).

# `tokenize_actions`

```elixir
@spec tokenize_actions(Nx.Tensor.t(), pos_integer(), keyword()) :: Nx.Tensor.t()
```

Tokenize continuous actions into discrete bin indices.

Discretizes each action dimension into `num_bins` uniform bins.

## Parameters

  - `actions` - Continuous actions tensor `[batch, action_dim]` or `[action_dim]`
  - `num_bins` - Number of bins per dimension (default: 256)

## Options

  - `:action_min` - Minimum action value (default: -1.0)
  - `:action_max` - Maximum action value (default: 1.0)

## Returns

  Integer tensor of bin indices with same shape as input.

# `vla_loss`

```elixir
@spec vla_loss(Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()
```

Compute VLA loss (cross-entropy on action tokens).

## Parameters

  - `logits` - Predicted action logits `[batch, action_dim, num_bins]`
  - `target_tokens` - Ground truth action tokens `[batch, action_dim]`

## Returns

  Scalar loss tensor.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
