HfHub.InferenceEndpoints (HfHub v0.2.0)

Copy Markdown View Source

Inference Endpoints API for dedicated model hosting.

Provides management of HuggingFace Inference Endpoints - dedicated infrastructure for model inference with auto-scaling and GPU support.

Accelerator Options

  • :cpu - CPU-based inference
  • :gpu - GPU-based inference

Instance Sizes

  • :x1 - 1x resources
  • :x2 - 2x resources
  • :x4 - 4x resources
  • :x8 - 8x resources

Cloud Vendors

  • :aws - Amazon Web Services
  • :azure - Microsoft Azure
  • :gcp - Google Cloud Platform

Endpoint Types

  • :public - Publicly accessible
  • :protected - Requires authentication (default)
  • :private - Private VPC endpoint

Examples

# List all endpoints
{:ok, endpoints} = HfHub.InferenceEndpoints.list()

# Create a GPU endpoint
{:ok, endpoint} = HfHub.InferenceEndpoints.create("my-endpoint",
  repository: "bert-base-uncased",
  accelerator: :gpu,
  instance_size: :x1,
  instance_type: "g5.xlarge",
  region: "us-east-1",
  vendor: :aws,
  task: "text-classification"
)

# Pause endpoint to save costs
{:ok, endpoint} = HfHub.InferenceEndpoints.pause("my-endpoint")

# Resume when needed
{:ok, endpoint} = HfHub.InferenceEndpoints.resume("my-endpoint")

Summary

Functions

Creates a new inference endpoint.

Deletes an endpoint.

Gets an endpoint by name.

Lists all inference endpoints.

Pauses an endpoint.

Resumes a paused endpoint.

Scales endpoint to zero replicas.

Updates an existing endpoint.

Types

accelerator()

@type accelerator() :: :cpu | :gpu

endpoint_type()

@type endpoint_type() :: :public | :protected | :private

instance_size()

@type instance_size() :: :x1 | :x2 | :x4 | :x8

vendor()

@type vendor() :: :aws | :azure | :gcp

Functions

create(name, opts)

@spec create(
  String.t(),
  keyword()
) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}

Creates a new inference endpoint.

Arguments

  • name - Endpoint name

Required Options

  • :repository - Model repository ID (e.g., "bert-base-uncased")
  • :accelerator - :cpu or :gpu
  • :instance_size - :x1, :x2, :x4, or :x8
  • :instance_type - Instance type (e.g., "g5.xlarge")
  • :region - Cloud region (e.g., "us-east-1")
  • :vendor - Cloud vendor: :aws, :azure, or :gcp

Optional

  • :framework - "pytorch", "tensorflow", etc. (default: "pytorch")
  • :task - ML task (e.g., "text-classification")
  • :namespace - Organization namespace (default: current user)
  • :min_replica - Minimum replicas (default: 0)
  • :max_replica - Maximum replicas (default: 1)
  • :scale_to_zero_timeout - Seconds before scaling to zero
  • :type - :public, :protected, or :private (default: :protected)
  • :custom_image - Custom Docker image configuration
  • :token - Authentication token

Examples

{:ok, endpoint} = HfHub.InferenceEndpoints.create("my-endpoint",
  repository: "bert-base-uncased",
  accelerator: :gpu,
  instance_size: :x1,
  instance_type: "g5.xlarge",
  region: "us-east-1",
  vendor: :aws,
  task: "text-classification"
)

{:ok, endpoint} = HfHub.InferenceEndpoints.create("my-endpoint",
  repository: "sentence-transformers/all-MiniLM-L6-v2",
  accelerator: :cpu,
  instance_size: :x2,
  instance_type: "c6i.xlarge",
  region: "eu-west-1",
  vendor: :aws,
  min_replica: 1,
  max_replica: 4,
  scale_to_zero_timeout: 300
)

delete(name, opts \\ [])

@spec delete(
  String.t(),
  keyword()
) :: :ok | {:error, term()}

Deletes an endpoint.

Warning: This is destructive and cannot be undone.

Arguments

  • name - Endpoint name

Options

  • :namespace - Organization namespace
  • :token - Authentication token

Examples

:ok = HfHub.InferenceEndpoints.delete("my-endpoint")

get(name, opts \\ [])

@spec get(
  String.t(),
  keyword()
) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}

Gets an endpoint by name.

Arguments

  • name - Endpoint name

Options

  • :namespace - Organization namespace (default: current user)
  • :token - Authentication token

Examples

{:ok, endpoint} = HfHub.InferenceEndpoints.get("my-endpoint")
{:ok, endpoint} = HfHub.InferenceEndpoints.get("my-endpoint", namespace: "my-org")

list(opts \\ [])

@spec list(keyword()) ::
  {:ok, [HfHub.InferenceEndpoints.Endpoint.t()]} | {:error, term()}

Lists all inference endpoints.

Options

  • :namespace - Organization namespace (default: current user)
  • :token - Authentication token

Examples

{:ok, endpoints} = HfHub.InferenceEndpoints.list()
{:ok, endpoints} = HfHub.InferenceEndpoints.list(namespace: "my-org")

pause(name, opts \\ [])

@spec pause(
  String.t(),
  keyword()
) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}

Pauses an endpoint.

Paused endpoints don't incur compute costs but retain configuration. They must be resumed before they can serve requests.

Arguments

  • name - Endpoint name

Options

  • :namespace - Organization namespace
  • :token - Authentication token

Examples

{:ok, endpoint} = HfHub.InferenceEndpoints.pause("my-endpoint")

resume(name, opts \\ [])

@spec resume(
  String.t(),
  keyword()
) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}

Resumes a paused endpoint.

Arguments

  • name - Endpoint name

Options

  • :namespace - Organization namespace
  • :token - Authentication token

Examples

{:ok, endpoint} = HfHub.InferenceEndpoints.resume("my-endpoint")

scale_to_zero(name, opts \\ [])

@spec scale_to_zero(
  String.t(),
  keyword()
) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}

Scales endpoint to zero replicas.

Different from pause: the endpoint can auto-wake on incoming requests, while a paused endpoint must be explicitly resumed.

Arguments

  • name - Endpoint name

Options

  • :namespace - Organization namespace
  • :token - Authentication token

Examples

{:ok, endpoint} = HfHub.InferenceEndpoints.scale_to_zero("my-endpoint")

update(name, opts \\ [])

@spec update(
  String.t(),
  keyword()
) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}

Updates an existing endpoint.

Only provided options are updated; others remain unchanged.

Arguments

  • name - Endpoint name

Options

  • :namespace - Organization namespace
  • :accelerator - :cpu or :gpu
  • :instance_size - :x1, :x2, :x4, or :x8
  • :instance_type - Instance type
  • :min_replica - Minimum replicas
  • :max_replica - Maximum replicas
  • :scale_to_zero_timeout - Seconds before scaling to zero
  • :repository - Model repository ID
  • :framework - Framework ("pytorch", "tensorflow", etc.)
  • :revision - Model revision
  • :task - ML task
  • :token - Authentication token

Examples

{:ok, endpoint} = HfHub.InferenceEndpoints.update("my-endpoint",
  instance_size: :x2,
  max_replica: 4
)

{:ok, endpoint} = HfHub.InferenceEndpoints.update("my-endpoint",
  repository: "bert-large-uncased"
)