README

HfHub Logo

HfHub

Elixir client for HuggingFace Hub — dataset/model metadata, file downloads, caching, and authentication. An Elixir port of Python's huggingface_hub.

hf_hub_ex provides a robust, production-ready interface to the HuggingFace Hub API, enabling Elixir applications to seamlessly access models, datasets, and spaces. This library is designed to be the foundational layer for porting Python HuggingFace libraries (datasets, evaluate, transformers) to the BEAM ecosystem.

Features

Hub API Client — Fetch metadata for models, datasets, and spaces
Bumblebee Compatible — Drop-in integration with Elixir ML pipelines via tuple-based repository API
Repo Tree Listing — Recursive tree listing with pagination
File Downloads — Stream files from HuggingFace repositories with resume support
Archive Extraction — Optional extraction for zip/tar.gz/tgz/tar.xz/gz files
Smart Caching — Local file caching with LRU eviction and ETag-based validation
Filesystem Utilities — Manage local HuggingFace cache directory structure
Authentication — Token-based authentication for private repositories
Structured Errors — 30+ exception types matching Python's huggingface_hub
BEAM-native — Leverages OTP, GenServers, and supervision trees for reliability
Type-safe — Comprehensive typespecs and pattern matching

Installation

Add hf_hub to your dependencies in mix.exs:

def deps do
  [
    {:hf_hub, "~> 0.2.0"}
  ]
end

Then run:

mix deps.get

Quick Start

Authentication

Set your HuggingFace token as an environment variable or in config:

export HF_TOKEN="hf_..."

Or in config/config.exs:

config :hf_hub,
  token: System.get_env("HF_TOKEN"),
  cache_dir: Path.expand("~/.cache/huggingface")

Fetching Model Metadata

# Get model information
{:ok, model_info} = HfHub.Api.model_info("bert-base-uncased")

IO.inspect(model_info.id)          # "bert-base-uncased"
IO.inspect(model_info.downloads)   # 123456789
IO.inspect(model_info.tags)        # ["pytorch", "bert", "fill-mask"]

Downloading Files

# Download a model file
{:ok, path} = HfHub.Download.hf_hub_download(
  repo_id: "bert-base-uncased",
  filename: "config.json",
  repo_type: :model
)

# Read the downloaded file
{:ok, config} = File.read(path)

# Download and extract an archive (returns extracted path)
{:ok, extracted_path} = HfHub.Download.hf_hub_download(
  repo_id: "albertvillanova/tmp-tests-zip",
  filename: "ds.zip",
  repo_type: :dataset,
  extract: true
)

# Download with progress tracking
{:ok, path} = HfHub.Download.hf_hub_download(
  repo_id: "some/model",
  filename: "model.bin",
  progress_callback: fn downloaded, total ->
    if total, do: IO.puts("#{round(downloaded / total * 100)}%")
  end
)

# Download with SHA256 verification
{:ok, path} = HfHub.Download.hf_hub_download(
  repo_id: "some/model",
  filename: "model.bin",
  verify_checksum: true,
  expected_sha256: "abc123..."  # Optional: fails if hash doesn't match
)

Offline Mode

# Check if offline mode is enabled (via HF_HUB_OFFLINE=1 or config)
if HfHub.offline_mode?() do
  IO.puts("Running in offline mode - only cached files available")
end

# Try to load a file from cache without network requests
case HfHub.try_to_load_from_cache("bert-base-uncased", "config.json") do
  {:ok, path} ->
    # File is cached, use it directly
    File.read!(path)
  {:error, :not_cached} ->
    # File not cached, decide whether to download
    {:ok, path} = HfHub.Download.hf_hub_download(
      repo_id: "bert-base-uncased",
      filename: "config.json"
    )
    File.read!(path)
end

Accessing Datasets

# Get dataset information
{:ok, dataset_info} = HfHub.Api.dataset_info("squad")

# Download dataset files
{:ok, path} = HfHub.Download.hf_hub_download(
  repo_id: "squad",
  filename: "train-v1.1.json",
  repo_type: :dataset
)

# Discover configs and splits
{:ok, configs} = HfHub.Api.dataset_configs("dpdl-benchmark/caltech101")
{:ok, splits} = HfHub.Api.dataset_splits("dpdl-benchmark/caltech101", config: "default")

# Resolve file paths for a config + split
{:ok, files} =
  HfHub.DatasetFiles.resolve("dpdl-benchmark/caltech101", "default", "train")

Bumblebee-Compatible API

Use the tuple-based repository API for seamless integration with Elixir ML pipelines:

# Repository reference types
repo = {:hf, "bert-base-uncased"}
repo_with_opts = {:hf, "bert-base-uncased", revision: "v1.0", auth_token: "hf_xxx"}
local_repo = {:local, "/path/to/model"}

# List files with ETags for cache validation
{:ok, files} = HfHub.get_repo_files({:hf, "bert-base-uncased"})
# => %{"config.json" => "\"abc123\"", "pytorch_model.bin" => "\"def456\"", ...}

# ETag-based cached download
{:ok, path} = HfHub.cached_download(
  "https://huggingface.co/bert-base-uncased/resolve/main/config.json"
)

# Build file URLs
url = HfHub.file_url("bert-base-uncased", "config.json", "main")

Repository Management

# Create a new repository
{:ok, url} = HfHub.Repo.create("my-org/my-model", private: true)

# Create a Space with Gradio
{:ok, url} = HfHub.Repo.create("my-space", repo_type: :space, space_sdk: "gradio")

# Delete a repository
:ok = HfHub.Repo.delete("my-org/old-model")

# Update settings
:ok = HfHub.Repo.update_settings("my-model", private: true, gated: :auto)

# Move/rename
{:ok, url} = HfHub.Repo.move("old-name", "new-org/new-name")

# Check existence
true = HfHub.Repo.exists?("bert-base-uncased")

File Upload

# Upload a small file (< 10MB uses base64, >= 10MB uses LFS automatically)
{:ok, info} = HfHub.Commit.upload_file(
  "/path/to/model.bin",
  "model.bin",
  "my-org/my-model",
  token: token,
  commit_message: "Add model weights"
)

# Upload from binary content
{:ok, info} = HfHub.Commit.upload_file(
  Jason.encode!(%{hidden_size: 768}),
  "config.json",
  "my-model",
  token: token
)

# Delete a file
{:ok, info} = HfHub.Commit.delete_file("old_model.bin", "my-model", token: token)

# Multiple operations in one commit
alias HfHub.Commit.Operation

{:ok, info} = HfHub.Commit.create("my-model", [
  Operation.add("config.json", config_content),
  Operation.add("model.bin", "/path/to/model.bin"),
  Operation.delete("old_config.json")
], token: token, commit_message: "Update model")

Folder Upload

# Upload entire folder
{:ok, info} = HfHub.Commit.upload_folder(
  "/path/to/model_dir",
  "my-org/my-model",
  token: token,
  commit_message: "Upload model"
)

# With pattern filtering
{:ok, info} = HfHub.Commit.upload_folder(
  "/path/to/model_dir",
  "my-model",
  token: token,
  ignore_patterns: ["*.pyc", "__pycache__/**"],
  allow_patterns: ["*.safetensors", "*.json"]
)

# Large folder with automatic batching
{:ok, infos} = HfHub.Commit.upload_large_folder(
  "/path/to/huge_model",
  "my-model",
  token: token,
  multi_commits: true
)

Git Operations

# Create a branch
{:ok, info} = HfHub.Git.create_branch("my-org/my-model", "feature-branch", token: token)

# Create branch from specific revision
{:ok, info} = HfHub.Git.create_branch("my-model", "hotfix", revision: "v1.0", token: token)

# Delete a branch
:ok = HfHub.Git.delete_branch("my-model", "old-branch", token: token)

# Create a tag
{:ok, info} = HfHub.Git.create_tag("my-model", "v1.0", token: token)

# Create annotated tag with message
{:ok, info} = HfHub.Git.create_tag("my-model", "v2.0",
  revision: "abc123",
  message: "Release v2.0",
  token: token
)

# List all refs (branches, tags)
{:ok, refs} = HfHub.Git.list_refs("bert-base-uncased")
refs.branches  # [%BranchInfo{name: "main", ...}]
refs.tags      # [%TagInfo{name: "v1.0", ...}]

# List commits
{:ok, commits} = HfHub.Git.list_commits("bert-base-uncased", revision: "main")

# Super squash (destructive - squashes all commits)
:ok = HfHub.Git.super_squash("my-model", message: "Squashed history", token: token)

User & Organization Profiles

# Get user profile
{:ok, user} = HfHub.Users.get("username")
IO.inspect(user.num_followers)

# List followers/following
{:ok, followers} = HfHub.Users.list_followers("username")
{:ok, following} = HfHub.Users.list_following("username")

# Like/unlike repos
:ok = HfHub.Users.like("bert-base-uncased")
:ok = HfHub.Users.unlike("bert-base-uncased")

# Organization info
{:ok, org} = HfHub.Organizations.get("huggingface")
{:ok, members} = HfHub.Organizations.list_members("huggingface")

Model & Dataset Cards

# Load and parse cards
{:ok, card} = HfHub.Cards.load_model_card("bert-base-uncased")
card.data.license  # "apache-2.0"
card.data.tags     # ["pytorch", "bert", "fill-mask"]

{:ok, card} = HfHub.Cards.load_dataset_card("squad")
card.data.task_categories  # ["question-answering"]

# Parse from content
{:ok, card} = HfHub.Cards.parse_model_card(readme_content)

# Create and render cards
card = HfHub.Cards.create_model_card(%{
  language: "en",
  license: "mit",
  tags: ["text-classification"]
})
markdown = HfHub.Cards.render(card)

Cache Management

# Check if a file is cached
cached? = HfHub.Cache.cached?(
  repo_id: "bert-base-uncased",
  filename: "pytorch_model.bin"
)

# Clear cache for a specific repo
:ok = HfHub.Cache.clear_cache(repo_id: "bert-base-uncased")

# Get cache statistics
{:ok, stats} = HfHub.Cache.cache_stats()
IO.inspect(stats.total_size)  # Total bytes in cache
IO.inspect(stats.file_count)  # Number of cached files

Examples

The examples/ directory contains runnable scripts demonstrating common use cases:

# Run all examples at once
./examples/run_all.sh

# Or run individual examples:
mix run examples/list_datasets.exs      # List top datasets
mix run examples/list_models.exs        # List popular models
mix run examples/dataset_info.exs       # Get dataset metadata
mix run examples/list_repo_tree.exs     # List repo tree entries
mix run examples/dataset_configs_splits.exs  # Dataset configs + splits
mix run examples/dataset_files_resolver.exs  # Resolve dataset files by config + split
mix run examples/download_file.exs      # Download a single file
mix run examples/download_with_extract.exs   # Download + extract archives
mix run examples/cache_demo.exs         # Cache management demo
mix run examples/stream_download.exs    # Stream large files
mix run examples/snapshot_download.exs  # Download entire repo
mix run examples/auth_demo.exs          # Authentication flow

See the examples README for detailed documentation.

API Overview

HfHub.Api

Interact with the HuggingFace Hub API:

model_info/2 — Fetch model metadata
dataset_info/2 — Fetch dataset metadata
space_info/2 — Fetch space metadata
list_models/1 — List models with filters
list_datasets/1 — List datasets with filters
list_repo_tree/2 — List repo tree entries (files + folders)
list_files/2 — List files in a repository
dataset_configs/2 — Get dataset configuration/subset names
dataset_splits/2 — Get dataset split names for a config

HfHub.Download

Download files from HuggingFace repositories:

hf_hub_download/1 — Download a single file (with caching, optional extraction)
snapshot_download/1 — Download entire repository snapshot
download_stream/1 — Stream download for large files
resume_download/1 — Resume interrupted downloads

HfHub.DatasetFiles

Resolve dataset files by config and split:

resolve/4 — Resolve file paths by config + split
resolve_from_tree/3 — Resolve file paths from a repo tree

HfHub.Cache

Manage local file cache:

cached?/1 — Check if file exists in cache
cache_path/1 — Get local path for cached file
clear_cache/1 — Remove cached files
cache_stats/0 — Get cache usage statistics
evict_lru/1 — Evict least recently used files
validate_integrity/0 — Validate checksums of cached files

HfHub.FS

Filesystem utilities for HuggingFace cache:

ensure_cache_dir/0 — Create cache directory structure
repo_path/2 — Get local path for repository
file_path/4 — Get local path for file in repository
lock_file/2 — Acquire file lock for concurrent downloads
unlock_file/1 — Release a file lock
cache_dir/0 — Get configured cache directory

HfHub.Config

Configuration utilities:

endpoint/0 — Get HuggingFace Hub endpoint URL
cache_dir/0 — Get cache directory path
http_opts/0 — Get HTTP client options
cache_opts/0 — Get cache options

HfHub.Auth

Authentication and authorization:

get_token/0 — Retrieve HuggingFace token
set_token/1 — Set authentication token
login/1 — Interactive login flow
logout/0 — Remove stored credentials
whoami/0 — Get current user information
validate_token/1 — Validate token format
auth_headers/1 — Build HTTP authorization headers

HfHub.Hub

Bumblebee-compatible ETag-based caching:

cached_download/2 — Download with ETag-based cache validation
file_url/3 — Build file URL for repository
file_listing_url/3 — Build tree listing URL

HfHub.Repository

Repository reference types and helpers:

normalize!/1 — Normalize repository tuples
file_url/2 — Build file URL from repository reference
file_listing_url/1 — Build listing URL from repository reference
cache_scope/1 — Convert repo ID to cache scope string

HfHub.RepoFiles

Repository file listing with ETags:

get_repo_files/1 — Get map of files to ETags for cache validation

HfHub.Constants

Constants matching Python's huggingface_hub.constants:

File names: config_name/0, pytorch_weights_name/0, safetensors_single_file/0
Timeouts: default_etag_timeout/0, default_download_timeout/0
Repository types: repo_types/0, repo_type_url_prefix/1

HfHub.Errors

Structured exceptions for error handling:

Repository: RepositoryNotFound, RevisionNotFound, EntryNotFound, GatedRepo
HTTP: HTTPError, BadRequest, OfflineMode
Cache: CacheNotFound, CorruptedCache, LocalEntryNotFound
Inference: InferenceTimeout, InferenceEndpointError
Storage: XetError, DDUFError, SafetensorsParsing

HfHub.LFS

LFS (Large File Storage) utilities:

UploadInfo.from_path/1 — Create upload info from file
UploadInfo.from_binary/1 — Create upload info from binary
sha256_hex/1 — Get hex-encoded SHA256 hash
oid/1 — Get LFS object identifier
lfs_headers/0 — Get standard LFS headers

HfHub.Commit

Commit operations for file uploads:

create/3 — Create commit with multiple operations
upload_file/4 — Upload single file (regular or LFS)
upload_folder/3 — Upload entire directory with pattern filtering
upload_large_folder/3 — Upload large directories with automatic batching
delete_file/3 — Delete file from repository
delete_folder/3 — Delete folder from repository
matches_pattern?/2 — Check if path matches gitignore-style pattern
needs_lfs?/1 — Check if file needs LFS upload
lfs_threshold/0 — Get LFS size threshold (10MB)

HfHub.Git

Git operations for branch, tag, and commit management:

create_branch/3 — Create a new branch from a revision
delete_branch/3 — Delete a branch
create_tag/3 — Create a tag (lightweight or annotated)
delete_tag/3 — Delete a tag
list_refs/2 — List all refs (branches, tags, converts, pull requests)
list_commits/2 — List commit history for a revision
super_squash/2 — Squash all commits (destructive)

HfHub.Users

User profile and activity API:

get/2 — Get user profile by username
list_followers/2 — List users who follow a user
list_following/2 — List users a user is following
list_liked_repos/2 — List repositories liked by a user
like/2, unlike/2 — Like/unlike repositories
list_likers/2 — List users who liked a repository

HfHub.Organizations

Organization profile API:

get/2 — Get organization profile by name
list_members/2 — List organization members

HfHub.Cards

Model and Dataset card parsing and creation:

load_model_card/2 — Load and parse model card from repository
load_dataset_card/2 — Load and parse dataset card from repository
parse_model_card/1 — Parse model card from markdown content
parse_dataset_card/1 — Parse dataset card from markdown content
create_model_card/1 — Create model card from data
create_dataset_card/1 — Create dataset card from data
render/1 — Render card to markdown with YAML frontmatter

Configuration

Configure hf_hub in your config/config.exs:

config :hf_hub,
  # Authentication token (defaults to HF_TOKEN env var)
  token: System.get_env("HF_TOKEN"),

  # Cache directory (defaults to ~/.cache/huggingface)
  cache_dir: Path.expand("~/.cache/huggingface"),

  # Hub endpoint (defaults to https://huggingface.co)
  endpoint: "https://huggingface.co",

  # HTTP client options
  http_opts: [
    receive_timeout: 30_000,
    pool_timeout: 5_000
  ],

  # Cache options
  cache_opts: [
    max_size: 10 * 1024 * 1024 * 1024,  # 10 GB
    eviction_policy: :lru
  ]

Comparison to Python's `huggingface_hub`

hf_hub_ex aims for feature parity with the Python library while embracing Elixir idioms:

Feature	Python `huggingface_hub`	Elixir `hf_hub_ex`
API Client	✅	✅
File Downloads	✅	✅
Caching	✅	✅ (OTP-based)
Authentication	✅	✅
Repository Management	✅	✅
Upload Files	✅	✅
Inference API	✅	🚧 (Planned)

Key Differences

Concurrency — Leverages OTP for parallel downloads and supervision
Caching — GenServer-based cache with configurable eviction policies
Error Handling — Pattern matching with {:ok, result} / {:error, reason} tuples
Type Safety — Comprehensive typespecs and Dialyzer integration

Roadmap

[x] Core API client (models, datasets, spaces)
[x] File download with caching
[x] Authentication support
[x] Repository management (create, delete, update)
[x] File uploads (single file, LFS support)
[x] Folder uploads (with pattern filtering and batching)
[ ] Inference API client
[ ] WebSocket support for real-time inference
[ ] Integration with crucible_datasets for dataset loading

See docs/ROADMAP.md for detailed feature parity status with Python huggingface_hub.

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/my-new-feature)
Write tests for your changes
Ensure all tests pass (mix test)
Run code quality checks (mix format && mix credo && mix dialyzer)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/my-new-feature)
Create a Pull Request

Testing

# Run all tests
mix test

# Run with coverage
mix test --cover

# Run specific test file
mix test test/hf_hub/api_test.exs

License

MIT License - See LICENSE for details.

Acknowledgments

Inspired by huggingface_hub (Python)
Part of the North-Shore-AI research ecosystem
Built with Req for HTTP client functionality