Manage Dedicated Inference Endpoints on the HuggingFace Hub.
Inference Endpoints let you deploy models on fully managed, dedicated infrastructure (CPU or GPU) with private endpoints and autoscaling.
See: https://huggingface.co/docs/inference-endpoints
Example
# Create an endpoint
{:ok, endpoint} = HuggingfaceClient.create_inference_endpoint(
name: "my-bert-endpoint",
repository: "sentence-transformers/all-MiniLM-L6-v2",
framework: "pytorch",
task: "sentence-similarity",
accelerator: "cpu",
vendor: "aws",
region: "us-east-1",
type: "protected",
instance_size: "small",
instance_type: "c6i",
access_token: "hf_..."
)
# Get endpoint status
{:ok, ep} = HuggingfaceClient.get_inference_endpoint("my-bert-endpoint",
namespace: "my-org",
access_token: "hf_..."
)
IO.puts("Status: #{ep["status"]["state"]}")
Summary
Functions
Creates a new dedicated inference endpoint.
Deletes an inference endpoint.
Gets details about a specific inference endpoint.
Lists all inference endpoints for a namespace (user or org).
Pauses an inference endpoint (stops billing while keeping configuration).
Resumes a paused inference endpoint.
Scales an endpoint to zero replicas (same as pause but more explicit).
Updates an existing inference endpoint configuration.
Functions
@spec create(keyword()) :: {:ok, map()} | {:error, Exception.t()}
Creates a new dedicated inference endpoint.
Required options
:name— endpoint name (unique within namespace):repository— model repo ID (e.g."gpt2"):accelerator—"cpu"or"gpu":vendor— cloud vendor:"aws","azure","gcp":region— cloud region (e.g."us-east-1"):instance_size— size:"small","medium","large","xlarge":instance_type— instance type (vendor-specific, e.g."c6i"for AWS CPU):type— endpoint type:"public","protected", or"private"
Optional options
:namespace— organization namespace (defaults to authenticated user):framework— ML framework (e.g."pytorch"):task— pipeline task (e.g."text-generation"):revision— model revision/branch:image— custom Docker image config:env— environment variables map:scaling— scaling config map:access_token
@spec delete( String.t(), keyword() ) :: :ok | {:error, Exception.t()}
Deletes an inference endpoint.
Example
:ok = HuggingfaceClient.delete_inference_endpoint("my-endpoint",
namespace: "my-org",
access_token: "hf_..."
)
@spec get( String.t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Gets details about a specific inference endpoint.
Example
{:ok, ep} = HuggingfaceClient.get_inference_endpoint("my-endpoint",
namespace: "my-org",
access_token: "hf_..."
)
@spec list(keyword()) :: {:ok, [map()]} | {:error, Exception.t()}
Lists all inference endpoints for a namespace (user or org).
Example
{:ok, endpoints} = HuggingfaceClient.list_inference_endpoints(access_token: "hf_...")
@spec pause( String.t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Pauses an inference endpoint (stops billing while keeping configuration).
Example
{:ok, ep} = HuggingfaceClient.pause_inference_endpoint("my-endpoint",
namespace: "my-org",
access_token: "hf_..."
)
@spec resume( String.t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Resumes a paused inference endpoint.
@spec scale_to_zero( String.t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Scales an endpoint to zero replicas (same as pause but more explicit).
@spec update( String.t(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Updates an existing inference endpoint configuration.
Example
{:ok, ep} = HuggingfaceClient.update_inference_endpoint("my-endpoint",
instance_size: "large",
namespace: "my-org",
access_token: "hf_..."
)