# `ExTorch.Export`

Read and introspect PyTorch ExportedProgram `.pt2` archives.

This module provides a pure-Elixir reader for `.pt2` files produced by
`torch.export.save()`. It can extract the model graph, weight metadata,
and raw weight tensors without requiring Python or C++ ExportedProgram support.

## Python export workflow

    import torch

    model = MyModel()
    model.eval()
    exported = torch.export.export(model, (example_input,))
    torch.export.save(exported, "model.pt2")

## Elixir usage

    # Load and run inference directly
    model = ExTorch.Export.load("model.pt2")
    output = ExTorch.Export.forward(model, [input])

    # Or read schema and weights separately
    schema = ExTorch.Export.read_schema("model.pt2")
    weights = ExTorch.Export.read_weights("model.pt2")

    # Generate DSL source code
    IO.puts(ExTorch.Export.to_elixir("model.pt2", "MyModel"))

## Note

This reads `.pt2` files from `torch.export.save`, NOT from
`aoti_compile_and_package`. AOTI-compiled `.pt2` files don't contain
the graph or separable weights -- use `ExTorch.AOTI` for those.

# `forward`

```elixir
@spec forward(ExTorch.Export.Model.t(), [ExTorch.Tensor.t()]) ::
  ExTorch.Tensor.t() | [ExTorch.Tensor.t()]
```

Run inference on a loaded Export model.

Interprets the ATen computation graph, dispatching each operation to
the corresponding ExTorch tensor function.

## Args
  * `model` (`ExTorch.Export.Model`) - the loaded model.
  * `inputs` (`[ExTorch.Tensor]`) - input tensors, matching the model's user inputs.

## Returns
The output tensor (or list of tensors for multi-output models).

## Example

    model = ExTorch.Export.load("model.pt2")
    input = ExTorch.randn({1, 10})
    output = ExTorch.Export.forward(model, [input])

# `forward_compiled`

```elixir
@spec forward_compiled(ExTorch.Export.Model.t(), [ExTorch.Tensor.t()]) ::
  ExTorch.Tensor.t() | [ExTorch.Tensor.t()]
```

Run inference using the pre-compiled graph executor.

The fastest Export inference path. All op schemas were resolved and
argument templates pre-built at `load/2` time. This function only
passes tensors to C++ and gets tensors back — zero encoding overhead.

Falls back to `forward_native/2` if the graph couldn't be pre-compiled.

    model = ExTorch.Export.load("model.pt2", device: :cuda)
    output = ExTorch.Export.forward_compiled(model, [input])

# `forward_native`

```elixir
@spec forward_native(ExTorch.Export.Model.t(), [ExTorch.Tensor.t()]) ::
  ExTorch.Tensor.t() | [ExTorch.Tensor.t()]
```

Run inference using the native graph executor.

Compiles the schema graph into an instruction stream and executes the
entire graph in a single NIF call via `execute_graph`, eliminating
per-node NIF boundary crossings. This is significantly faster than
`forward/2` for high-node-count models (e.g., ViT with 430 nodes)
while still supporting all ops through the `c10::Dispatcher`.

Falls back gracefully for ops registered via `ExTorch.Export.OpRegistry`
since those are also dispatched through the same C++ dispatcher.

    model = ExTorch.Export.load("vit_b_16.pt2", device: :cuda)
    input = ExTorch.Tensor.to(input, device: :cuda)
    output = ExTorch.Export.forward_native(model, [input])

# `forward_profiled`

```elixir
@spec forward_profiled(ExTorch.Export.Model.t(), [ExTorch.Tensor.t()]) ::
  {ExTorch.Tensor.t() | [ExTorch.Tensor.t()], map()}
```

Run `forward/2` with per-node timing instrumentation. Returns
`{output, %{op_target => %{count: N, total_us: T}}}`, aggregated by op
target so you can see which ops dominate inference time.

Only meant for diagnostics. Adds ~1μs of measurement overhead per node
from `:erlang.monotonic_time/1`.

# `load`

```elixir
@spec load(
  String.t(),
  keyword()
) :: ExTorch.Export.Model.t()
```

Load an exported `.pt2` model for inference.

Reads the graph and weights, and prepares the model for `forward/2`.

## Args
  * `path` (`String`) - path to the `.pt2` file from `torch.export.save`.
  * `opts` (`keyword`) - optional:
    * `:device` (`:cpu | :cuda | {:cuda, index}`) - device to place all
      weight tensors on. Defaults to `:cpu`. When set to `:cuda`, every
      loaded parameter/buffer is moved to the GPU at load time, so
      subsequent `forward/2` calls run entirely on the GPU (as long as
      the user input is also on the GPU).

## Returns
An `%ExTorch.Export.Model{}` struct.

## Example

    # CPU (default)
    model = ExTorch.Export.load("model.pt2")
    output = ExTorch.Export.forward(model, [input_tensor])

    # GPU
    model = ExTorch.Export.load("model.pt2", device: :cuda)
    input = ExTorch.Tensor.to(cpu_input, device: :cuda)
    output = ExTorch.Export.forward(model, [input])

# `read_schema`

```elixir
@spec read_schema(String.t()) :: map()
```

Read the model schema from an exported `.pt2` archive.

Returns a map with:
  * `:graph` - the computation graph as a list of node maps
  * `:inputs` - graph input names
  * `:outputs` - graph output names
  * `:weights` - weight metadata (name → shape, dtype, requires_grad)

# `read_weights`

```elixir
@spec read_weights(String.t()) :: %{required(String.t()) =&gt; ExTorch.Tensor.t()}
```

Load weight tensors from an exported `.pt2` archive.

Returns a map of `%{fqn => %ExTorch.Tensor{}}`.

# `to_elixir`

```elixir
@spec to_elixir(String.t(), String.t()) :: String.t()
```

Generate an `ExTorch.NN.Module` DSL definition from an exported `.pt2` archive.

Maps ATen operations in the graph to ExTorch NN layer types where possible.

## Args
  * `path` - path to the `.pt2` file.
  * `module_name` - name for the generated Elixir module.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
