Inference Endpoints API for dedicated model hosting.
Provides management of HuggingFace Inference Endpoints - dedicated infrastructure for model inference with auto-scaling and GPU support.
Accelerator Options
:cpu- CPU-based inference:gpu- GPU-based inference
Instance Sizes
:x1- 1x resources:x2- 2x resources:x4- 4x resources:x8- 8x resources
Cloud Vendors
:aws- Amazon Web Services:azure- Microsoft Azure:gcp- Google Cloud Platform
Endpoint Types
:public- Publicly accessible:protected- Requires authentication (default):private- Private VPC endpoint
Examples
# List all endpoints
{:ok, endpoints} = HfHub.InferenceEndpoints.list()
# Create a GPU endpoint
{:ok, endpoint} = HfHub.InferenceEndpoints.create("my-endpoint",
repository: "bert-base-uncased",
accelerator: :gpu,
instance_size: :x1,
instance_type: "g5.xlarge",
region: "us-east-1",
vendor: :aws,
task: "text-classification"
)
# Pause endpoint to save costs
{:ok, endpoint} = HfHub.InferenceEndpoints.pause("my-endpoint")
# Resume when needed
{:ok, endpoint} = HfHub.InferenceEndpoints.resume("my-endpoint")
Summary
Functions
Creates a new inference endpoint.
Deletes an endpoint.
Gets an endpoint by name.
Lists all inference endpoints.
Pauses an endpoint.
Resumes a paused endpoint.
Scales endpoint to zero replicas.
Updates an existing endpoint.
Types
Functions
@spec create( String.t(), keyword() ) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}
Creates a new inference endpoint.
Arguments
name- Endpoint name
Required Options
:repository- Model repository ID (e.g., "bert-base-uncased"):accelerator-:cpuor:gpu:instance_size-:x1,:x2,:x4, or:x8:instance_type- Instance type (e.g., "g5.xlarge"):region- Cloud region (e.g., "us-east-1"):vendor- Cloud vendor::aws,:azure, or:gcp
Optional
:framework- "pytorch", "tensorflow", etc. (default: "pytorch"):task- ML task (e.g., "text-classification"):namespace- Organization namespace (default: current user):min_replica- Minimum replicas (default: 0):max_replica- Maximum replicas (default: 1):scale_to_zero_timeout- Seconds before scaling to zero:type-:public,:protected, or:private(default::protected):custom_image- Custom Docker image configuration:token- Authentication token
Examples
{:ok, endpoint} = HfHub.InferenceEndpoints.create("my-endpoint",
repository: "bert-base-uncased",
accelerator: :gpu,
instance_size: :x1,
instance_type: "g5.xlarge",
region: "us-east-1",
vendor: :aws,
task: "text-classification"
)
{:ok, endpoint} = HfHub.InferenceEndpoints.create("my-endpoint",
repository: "sentence-transformers/all-MiniLM-L6-v2",
accelerator: :cpu,
instance_size: :x2,
instance_type: "c6i.xlarge",
region: "eu-west-1",
vendor: :aws,
min_replica: 1,
max_replica: 4,
scale_to_zero_timeout: 300
)
Deletes an endpoint.
Warning: This is destructive and cannot be undone.
Arguments
name- Endpoint name
Options
:namespace- Organization namespace:token- Authentication token
Examples
:ok = HfHub.InferenceEndpoints.delete("my-endpoint")
@spec get( String.t(), keyword() ) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}
Gets an endpoint by name.
Arguments
name- Endpoint name
Options
:namespace- Organization namespace (default: current user):token- Authentication token
Examples
{:ok, endpoint} = HfHub.InferenceEndpoints.get("my-endpoint")
{:ok, endpoint} = HfHub.InferenceEndpoints.get("my-endpoint", namespace: "my-org")
@spec list(keyword()) :: {:ok, [HfHub.InferenceEndpoints.Endpoint.t()]} | {:error, term()}
Lists all inference endpoints.
Options
:namespace- Organization namespace (default: current user):token- Authentication token
Examples
{:ok, endpoints} = HfHub.InferenceEndpoints.list()
{:ok, endpoints} = HfHub.InferenceEndpoints.list(namespace: "my-org")
@spec pause( String.t(), keyword() ) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}
Pauses an endpoint.
Paused endpoints don't incur compute costs but retain configuration. They must be resumed before they can serve requests.
Arguments
name- Endpoint name
Options
:namespace- Organization namespace:token- Authentication token
Examples
{:ok, endpoint} = HfHub.InferenceEndpoints.pause("my-endpoint")
@spec resume( String.t(), keyword() ) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}
Resumes a paused endpoint.
Arguments
name- Endpoint name
Options
:namespace- Organization namespace:token- Authentication token
Examples
{:ok, endpoint} = HfHub.InferenceEndpoints.resume("my-endpoint")
@spec scale_to_zero( String.t(), keyword() ) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}
Scales endpoint to zero replicas.
Different from pause: the endpoint can auto-wake on incoming requests, while a paused endpoint must be explicitly resumed.
Arguments
name- Endpoint name
Options
:namespace- Organization namespace:token- Authentication token
Examples
{:ok, endpoint} = HfHub.InferenceEndpoints.scale_to_zero("my-endpoint")
@spec update( String.t(), keyword() ) :: {:ok, HfHub.InferenceEndpoints.Endpoint.t()} | {:error, term()}
Updates an existing endpoint.
Only provided options are updated; others remain unchanged.
Arguments
name- Endpoint name
Options
:namespace- Organization namespace:accelerator-:cpuor:gpu:instance_size-:x1,:x2,:x4, or:x8:instance_type- Instance type:min_replica- Minimum replicas:max_replica- Maximum replicas:scale_to_zero_timeout- Seconds before scaling to zero:repository- Model repository ID:framework- Framework ("pytorch", "tensorflow", etc.):revision- Model revision:task- ML task:token- Authentication token
Examples
{:ok, endpoint} = HfHub.InferenceEndpoints.update("my-endpoint",
instance_size: :x2,
max_replica: 4
)
{:ok, endpoint} = HfHub.InferenceEndpoints.update("my-endpoint",
repository: "bert-large-uncased"
)