# `HuggingfaceClient.Inference.TGI`
[🔗](https://github.com/huggingface/huggingface_client/blob/v0.1.0/lib/huggingface_client/inference/tgi.ex#L1)

Client for HuggingFace Text Generation Inference (TGI) servers.

TGI is a high-performance inference server for deploying large language models.
This client works with both:
- Self-hosted TGI servers (`docker run ghcr.io/huggingface/text-generation-inference`)
- HuggingFace Inference Endpoints powered by TGI

See: https://huggingface.co/docs/text-generation-inference

## Quick start

    # Connect to a local TGI server
    client = HuggingfaceClient.Inference.TGI.new("http://localhost:8080")

    # Or an Inference Endpoint
    client = HuggingfaceClient.Inference.TGI.new(
      "https://xxx.aws.endpoints.huggingface.cloud",
      token: "hf_..."
    )

    # Generate text
    {:ok, resp} = HuggingfaceClient.Inference.TGI.generate(client,
      inputs: "What is deep learning?",
      max_new_tokens: 200
    )
    IO.puts(resp["generated_text"])

    # Chat completion (OpenAI-compatible)
    {:ok, resp} = HuggingfaceClient.Inference.TGI.chat_completion(client,
      messages: [%{"role" => "user", "content" => "Hello!"}],
      max_tokens: 100
    )

    # Streaming
    HuggingfaceClient.Inference.TGI.generate_stream(client,
      inputs: "Tell me a story about",
      max_new_tokens: 500
    )
    |> Enum.each(fn token -> IO.write(token["text"]) end)

# `t`

```elixir
@type t() :: %HuggingfaceClient.Inference.TGI{
  base_url: String.t(),
  timeout: pos_integer(),
  token: String.t() | nil
}
```

# `chat_completion`

```elixir
@spec chat_completion(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}
```

OpenAI-compatible chat completion via TGI.

## Options

- `:messages` — list of message maps with `"role"` and `"content"` (required)
- `:max_tokens` — maximum tokens to generate
- `:temperature` — sampling temperature
- `:top_p` — nucleus sampling
- `:stop` — stop sequences
- `:stream` — if `true`, returns streaming response (default: `false`)

## Example

    {:ok, resp} = HuggingfaceClient.Inference.TGI.chat_completion(client,
      messages: [
        %{"role" => "system",  "content" => "You are a helpful assistant."},
        %{"role" => "user",    "content" => "What is 2+2?"}
      ],
      max_tokens: 100
    )
    IO.puts(resp["choices"] |> hd() |> get_in(["message", "content"]))

# `decode`

```elixir
@spec decode(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}
```

Decodes token IDs back to text.

## Example

    {:ok, result} = HuggingfaceClient.Inference.TGI.decode(client, ids: [1, 2, 3, 4])
    IO.puts(result["decoded_text"])

# `embed`

```elixir
@spec embed(
  t(),
  keyword()
) :: {:ok, [[float()]]} | {:error, Exception.t()}
```

Generates embeddings (for TEI-compatible endpoints).

## Example

    {:ok, embedding} = HuggingfaceClient.Inference.TGI.embed(client,
      inputs: "Hello, world!"
    )
    IO.puts("Embedding dimensions: #{length(embedding)}")

# `generate`

```elixir
@spec generate(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}
```

Generates text from an input prompt.

## Options

- `:inputs` — input text prompt (required)
- `:max_new_tokens` — maximum number of tokens to generate (default: 20)
- `:temperature` — sampling temperature (0.0 = greedy)
- `:top_p` — nucleus sampling probability
- `:top_k` — top-k sampling
- `:repetition_penalty` — penalize repeated tokens (> 1.0 to penalize)
- `:stop` — list of stop sequences
- `:seed` — random seed for reproducibility
- `:do_sample` — if `false`, use greedy decoding
- `:return_full_text` — if `true`, include input in response
- `:best_of` — generate N samples, return best (increases latency)
- `:watermark` — add a watermark to the output

## Example

    {:ok, resp} = HuggingfaceClient.Inference.TGI.generate(client,
      inputs: "What is the capital of France?",
      max_new_tokens: 50,
      temperature: 0.7
    )
    IO.puts(resp["generated_text"])

# `generate_batch`

```elixir
@spec generate_batch(
  t(),
  keyword()
) :: {:ok, [map()]} | {:error, Exception.t()}
```

Batch text generation.

## Options

- `:inputs` — list of input prompts (required)
- Same generation parameters as `generate/2`

## Example

    {:ok, results} = HuggingfaceClient.Inference.TGI.generate_batch(client,
      inputs: ["Hello world!", "What is AI?"],
      max_new_tokens: 100
    )
    Enum.each(results, fn r -> IO.puts(r["generated_text"]) end)

# `generate_stream`

```elixir
@spec generate_stream(
  t(),
  keyword()
) :: Enumerable.t()
```

Streams text generation token by token.

Returns an enumerable of token maps, each containing:
- `"token"` — map with `"id"`, `"text"`, `"logprob"`, `"special"`
- `"generated_text"` — full text so far (only on last token)
- `"details"` — generation details (only on last token)

## Example

    HuggingfaceClient.Inference.TGI.generate_stream(client,
      inputs: "Once upon a time",
      max_new_tokens: 200
    )
    |> Enum.each(fn token ->
      IO.write(token["token"]["text"])
    end)
    IO.puts("")  # newline at end

# `health`

```elixir
@spec health(t()) :: :ok | {:error, Exception.t()}
```

Checks server health. Returns `:ok` if healthy.

# `info`

```elixir
@spec info(t()) :: {:ok, map()} | {:error, Exception.t()}
```

Gets information about the running TGI server (model, config, etc.).

## Example

    {:ok, info} = HuggingfaceClient.Inference.TGI.info(client)
    IO.puts("Model: #{info["model_id"]}")
    IO.puts("Max tokens: #{info["max_total_tokens"]}")

# `new`

```elixir
@spec new(
  String.t(),
  keyword()
) :: t()
```

Creates a new TGI client.

## Parameters

- `base_url` — URL of the TGI server (e.g. `"http://localhost:8080"`)

## Options

- `:token` — Bearer token for authentication
- `:timeout` — request timeout in milliseconds (default: 60_000)

## Example

    # Local server
    client = HuggingfaceClient.Inference.TGI.new("http://localhost:8080")

    # Inference endpoint with auth
    client = HuggingfaceClient.Inference.TGI.new(
      "https://my-endpoint.aws.endpoints.huggingface.cloud",
      token: "hf_..."
    )

# `tokenize`

```elixir
@spec tokenize(
  t(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}
```

Tokenizes text and returns token count and IDs.

## Example

    {:ok, result} = HuggingfaceClient.Inference.TGI.tokenize(client, inputs: "Hello, world!")
    IO.puts("Token count: #{length(result["tokens"])}")

---

*Consult [api-reference.md](api-reference.md) for complete listing*
