# `Dala.Gpu.Compute`
[🔗](https://github.com/manhvu/dala/blob/main/lib/dala/gpu/compute.ex#L1)

High-level GPU compute orchestration for Dala.

Wraps [EXCubeCL](https://hexdocs.pm/ex_cubecl/readme.html) with Dala-native
patterns: GenServer-managed lifecycle, dirty-CPU scheduling, and integration
with `Dala.Gpu` surfaces, `Dala.Media` pipelines, and `Dala.ML` inference.

## Architecture

```
┌──────────────────────────────────────────────────────┐
│  Dala.Gpu.Compute                                    │
│  ├── buffer management (create, read, free)           │
│  ├── kernel execution (sync + async)                  │
│  ├── pipeline orchestration (multi-stage)             │
│  └── stream scheduler (mobile-optimized)              │
├──────────────────────────────────────────────────────┤
│  EXCubeCL (Elixir NIF stubs)                         │
├──────────────────────────────────────────────────────┤
│  Rust NIF → CubeCL Runtime → Metal / OpenGL ES / CPU │
└──────────────────────────────────────────────────────┘
```

## Quick Start

    # Check GPU availability
    Dala.Gpu.Compute.device_info()
    # %{name: "ExCubecl CPU (Rust NIF)", gpu: false, version: "0.3.0"}

    # Create buffers
    a = Dala.Gpu.Compute.buffer([1.0, 2.0, 3.0], {3}, :f32)
    b = Dala.Gpu.Compute.buffer([4.0, 5.0, 6.0], {3}, :f32)
    c = Dala.Gpu.Compute.buffer([0.0, 0.0, 0.0], {3}, :f32)

    # Run a kernel
    Dala.Gpu.Compute.run_kernel(:elementwise_add, [a, b], c, %{})

    # Read results
    Dala.Gpu.Compute.read(c)
    # <<5.0, 7.0, 9.0>>  (binary)

    # Cleanup (optional — buffers auto-freed by ResourceArc)
    Dala.Gpu.Compute.free(a)
    Dala.Gpu.Compute.free(b)
    Dala.Gpu.Compute.free(c)

## Async Execution

    cmd_id = Dala.Gpu.Compute.submit(%{
      op: :run_kernel,
      kernel: "relu",
      inputs: [a.ref],
      output: b.ref,
      params: %{}
    })

    Dala.Gpu.Compute.poll(cmd_id)  # :pending | :completed | {:error, reason}
    Dala.Gpu.Compute.wait(cmd_id)   # blocks until done

## Pipeline Orchestration

    pipeline = Dala.Gpu.Compute.pipeline()
    pipeline
    |> Dala.Gpu.Compute.pipeline_add(%{
      op: :run_kernel,
      kernel: :blur,
      inputs: [input_buf],
      output: temp_buf,
      params: %{radius: 3}
    })
    |> Dala.Gpu.Compute.pipeline_add(%{
      op: :run_kernel,
      kernel: :relu,
      inputs: [temp_buf],
      output: output_buf,
      params: %{}
    })
    Dala.Gpu.Compute.pipeline_run(pipeline)

## Integration with Dala.Gpu surfaces

For rendering results to screen, pair a compute buffer with a `Dala.Gpu.Surface`:

    {:ok, surface} = Dala.Gpu.create_surface(640, 480)
    # Run compute → read buffer → upload to surface
    Dala.Gpu.Compute.run_kernel(:generate_gradient, [], output_buf, %{})
    pixels = Dala.Gpu.Compute.read(output_buf)
    Dala.Gpu.set_pixels(surface, pixels)
    Dala.Gpu.present(surface)

## Supported Types

| Type   | Description                   |
|--------|-------------------------------|
| `:f32` | 32-bit float                  |
| `:f64` | 64-bit float                  |
| `:s32` | 32-bit signed integer         |
| `:s64` | 64-bit signed integer         |
| `:u32` | 32-bit unsigned integer       |
| `:u8`  | 8-bit unsigned integer        |

## Mobile Notes

On iOS, CubeCL kernels compile to Metal shaders at runtime.
On Android, they compile to OpenGL ES compute shaders.
On desktop (dev), a CPU fallback is used.

GPU compute is automatically dirty-CPU scheduled so it won't block
the BEAM scheduler.

# `add`

```elixir
@spec add(
  Dala.Gpu.Compute.Buffer.t(),
  Dala.Gpu.Compute.Buffer.t(),
  Dala.Gpu.Compute.Buffer.t()
) ::
  :ok | {:error, term()}
```

Elementwise addition: output = a + b

## Example

    c = Dala.Gpu.Compute.buffer_zeros({3}, :f32)
    Dala.Gpu.Compute.add(a, b, c)

# `buffer`

```elixir
@spec buffer(list(), tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()
```

Create a GPU buffer from a list of values.

## Options

- `:shape` — tuple describing dimensions, e.g. `{3}` for a 1D vector of 3 elements
- `:dtype` — data type atom (`:f32`, `:f64`, `:s32`, `:s64`, `:u32`, `:u8`)

## Example

    buf = Dala.Gpu.Compute.buffer([1.0, 2.0, 3.0], {3}, :f32)

# `buffer_from_binary`

```elixir
@spec buffer_from_binary(binary(), tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()
```

Create a GPU buffer from a raw binary.

## Example

    buf = Dala.Gpu.Compute.buffer_from_binary(binary_data, {640, 480, 4}, :u8)

# `buffer_zeros`

```elixir
@spec buffer_zeros(tuple(), atom()) :: Dala.Gpu.Compute.Buffer.t()
```

Create an uninitialized GPU buffer with the given shape and dtype.

## Example

    buf = Dala.Gpu.Compute.buffer_zeros({256, 256}, :f32)

# `device_info`

```elixir
@spec device_info() :: map()
```

Return GPU device information.

# `dtype`

```elixir
@spec dtype(Dala.Gpu.Compute.Buffer.t()) :: atom()
```

Return the data type of a buffer.

# `free`

```elixir
@spec free(Dala.Gpu.Compute.Buffer.t()) :: :ok
```

Free a GPU buffer and release all associated GPU memory.

# `free_many`

```elixir
@spec free_many([Dala.Gpu.Compute.Buffer.t()]) :: :ok
```

Free multiple GPU buffers at once.

# `free_pipeline`

```elixir
@spec free_pipeline(Dala.Gpu.Compute.Pipeline.t()) :: :ok
```

Free a pipeline and its internal resources.

# `from_nx`

```elixir
@spec from_nx(Nx.Tensor.t()) :: Dala.Gpu.Compute.Buffer.t()
```

Convert an Nx tensor to a GPU buffer.

## Example

    tensor = Nx.tensor([1.0, 2.0, 3.0])
    buf = Dala.Gpu.Compute.from_nx(tensor)

# `gpu?`

```elixir
@spec gpu?() :: boolean()
```

Return true if a real GPU is available (not CPU fallback).

# `matmul`

```elixir
@spec matmul(
  Dala.Gpu.Compute.Buffer.t(),
  Dala.Gpu.Compute.Buffer.t(),
  Dala.Gpu.Compute.Buffer.t()
) ::
  :ok | {:error, term()}
```

Matrix multiplication: output = a * b

Both buffers must be 2D. Shape validation is performed by the kernel.

## Example

    a = Dala.Gpu.Compute.buffer(list_4, {2, 2}, :f32)
    b = Dala.Gpu.Compute.buffer(list_4, {2, 2}, :f32)
    c = Dala.Gpu.Compute.buffer_zeros({2, 2}, :f32)
    Dala.Gpu.Compute.matmul(a, b, c)

# `multiply`

```elixir
@spec multiply(
  Dala.Gpu.Compute.Buffer.t(),
  Dala.Gpu.Compute.Buffer.t(),
  Dala.Gpu.Compute.Buffer.t()
) ::
  :ok | {:error, term()}
```

Elementwise multiply: output = a * b

## Example

    Dala.Gpu.Compute.multiply(a, b, output)

# `pipeline`

```elixir
@spec pipeline() :: Dala.Gpu.Compute.Pipeline.t()
```

Create a new empty GPU compute pipeline.

# `pipeline_add`

```elixir
@spec pipeline_add(Dala.Gpu.Compute.Pipeline.t(), map()) ::
  Dala.Gpu.Compute.Pipeline.t()
```

Add a stage to a pipeline. Returns the pipeline for chaining.

# `pipeline_run`

```elixir
@spec pipeline_run(Dala.Gpu.Compute.Pipeline.t()) :: :ok | {:error, term()}
```

Execute all stages in a pipeline sequentially.

# `poll`

```elixir
@spec poll(non_neg_integer()) :: :pending | :completed | {:error, term()}
```

Poll an async command. Returns `:pending`, `:completed`, or `{:error, reason}`.

# `read`

```elixir
@spec read(Dala.Gpu.Compute.Buffer.t()) :: binary()
```

Read data from a GPU buffer back to an Elixir binary.

EXCubeCL 0.3+ returns `{:ok, binary()}` from `read/1`.
This function returns the binary directly for a clean Dala API.

## Example

    data = Dala.Gpu.Compute.read(buf)
    # <<5.0, 7.0, 9.0>>

# `read_binary`

```elixir
@spec read_binary(Dala.Gpu.Compute.Buffer.t()) :: binary()
```

Read data from a GPU buffer as a raw binary (zero-copy when possible).

# `read_list`

```elixir
@spec read_list(Dala.Gpu.Compute.Buffer.t()) :: list()
```

Read data from a GPU buffer and convert to an Elixir list.

# `relu`

```elixir
@spec relu(Dala.Gpu.Compute.Buffer.t(), Dala.Gpu.Compute.Buffer.t()) ::
  :ok | {:error, term()}
```

Elementwise ReLU activation: output = max(0, input)

## Example

    Dala.Gpu.Compute.relu(input, output)

# `run_kernel`

```elixir
@spec run_kernel(
  atom(),
  [Dala.Gpu.Compute.Buffer.t()],
  Dala.Gpu.Compute.Buffer.t(),
  map()
) ::
  :ok | {:error, term()}
```

Run a named kernel synchronously.

## Parameters

- `kernel` — kernel atom (e.g. `:elementwise_add`, `:relu`, `:blur`)
- `inputs` — list of input `Buffer` structs
- `output` — output `Buffer` struct
- `params` — map of kernel-specific parameters

## Example

    Dala.Gpu.Compute.run_kernel(:elementwise_add, [a, b], c, %{})
    Dala.Gpu.Compute.run_kernel(:relu, [a], b, %{slope: 0.1})
    Dala.Gpu.Compute.run_kernel(:blur, [image_buf], out_buf, %{radius: 3, sigma: 1.5})

# `run_kernel_async`

```elixir
@spec run_kernel_async(
  atom(),
  [Dala.Gpu.Compute.Buffer.t()],
  Dala.Gpu.Compute.Buffer.t(),
  map()
) ::
  :ok | {:error, term()}
```

Run a kernel asynchronously and wait for completion.

# `run_to_surface`

```elixir
@spec run_to_surface(
  atom(),
  [Dala.Gpu.Compute.Buffer.t()],
  Dala.Gpu.Compute.Buffer.t(),
  pid(),
  map()
) ::
  :ok | {:error, term()}
```

Run a compute kernel and upload the result directly to a GPU surface.

This is a convenience function that combines kernel execution with
surface pixel upload, avoiding an intermediate read-back to the CPU.

## Example

    {:ok, surface} = Dala.Gpu.create_surface(640, 480)
    Dala.Gpu.Compute.run_to_surface(kernel, [input_buf], output_buf, surface, %{})

# `scale`

```elixir
@spec scale(Dala.Gpu.Compute.Buffer.t(), number(), Dala.Gpu.Compute.Buffer.t()) ::
  :ok | {:error, term()}
```

Scalar multiply: output = input * scalar

## Example

    Dala.Gpu.Compute.scale(input, 2.5, output)

# `shape`

```elixir
@spec shape(Dala.Gpu.Compute.Buffer.t()) :: tuple()
```

Return the shape of a buffer.

# `size`

```elixir
@spec size(Dala.Gpu.Compute.Buffer.t()) :: non_neg_integer()
```

Return the size of a buffer in bytes.

# `submit`

```elixir
@spec submit(map()) :: non_neg_integer()
```

Submit a compute command asynchronously. Returns a command ID for polling.

The spec map is encoded as a string for EXCubeCL 0.3+.

## Example

    cmd_id = Dala.Gpu.Compute.submit(%{
      op: :run_kernel,
      kernel: "relu",
      inputs: [a.ref],
      output: b.ref,
      params: %{}
    })

    # Later...
    case Dala.Gpu.Compute.poll(cmd_id) do
      :completed -> Dala.Gpu.Compute.read(b)
      {:error, reason} -> handle_error(reason)
      :pending -> retry_later()
    end

# `to_nx`

```elixir
@spec to_nx(Dala.Gpu.Compute.Buffer.t(), tuple(), atom()) :: Nx.Tensor.t()
```

Convert a GPU buffer to an Nx tensor.

## Example

    tensor = Dala.Gpu.Compute.to_nx(buf, {3}, :f32)

# `version`

```elixir
@spec version() :: String.t()
```

Return the EXCubeCL version string.

# `wait`

```elixir
@spec wait(non_neg_integer()) :: :ok | {:error, term()}
```

Block until an async command completes. Returns `:ok` or `{:error, reason}`.

---

*Consult [api-reference.md](api-reference.md) for complete listing*