ElixirDatasets.Repository (ElixirDatasets v0.1.0)

View Source

Functions for managing dataset repositories (local and Hugging Face).

Summary

Types

A location to fetch dataset files from. Can be either a Hugging Face repository or a local resource

Functions

Downloads a file from a repository.

Gets the list of files in a repository.

Normalizes repository specification to a consistent format.

Converts a repository ID to a cache scope string.

Types

t_repository()

@type t_repository() ::
  {:hf, String.t()} | {:hf, String.t(), keyword()} | {:local, Path.t()}

A location to fetch dataset files from. Can be either a Hugging Face repository or a local resource:

  • {:hf, repository_id} - the Hugging Face repository ID

  • {:hf, repository_id, options} - the Hugging Face repository ID with additional options

  • {:local, path} - a local directory or file path containing the datasets

Functions

download(arg, filename, etag)

@spec download(t_repository(), String.t(), String.t() | nil) ::
  {:ok, String.t()} | {:error, String.t()}

Downloads a file from a repository.

For local repositories, verifies the file exists. For Hugging Face repositories, downloads the file using the Hub API.

Returns

{:ok, path} where path is the local file path, or {:error, reason} if the download fails.

get_files(arg)

@spec get_files(t_repository()) :: {:ok, map()} | {:error, String.t()}

Gets the list of files in a repository.

For local repositories, lists files in the directory. For Hugging Face repositories, fetches the file listing from the API.

Returns

{:ok, repo_files} where repo_files is a map of %{filename => etag}, or {:error, reason} if the operation fails.

normalize!(other)

@spec normalize!(t_repository()) :: t_repository()

Normalizes repository specification to a consistent format.

Examples

iex> ElixirDatasets.Repository.normalize!({:hf, "repo/name"})
{:hf, "repo/name", []}

iex> ElixirDatasets.Repository.normalize!({:local, "/path/to/data"})
{:local, "/path/to/data"}

repository_id_to_cache_scope(repository_id)

@spec repository_id_to_cache_scope(String.t()) :: String.t()

Converts a repository ID to a cache scope string.

Replaces slashes with double dashes and removes non-word characters.

Examples

iex> ElixirDatasets.Repository.repository_id_to_cache_scope("user/repo-name")
"user--repo-name"