ElixirDatasets.Info (ElixirDatasets v0.1.0)

View Source

Functions for fetching and parsing dataset metadata from Hugging Face Hub.

Summary

Functions

Gets the configuration names available for a dataset.

Fetches dataset information from the Hugging Face API.

Fetches dataset information from the Hugging Face API and returns a list of DatasetInfo structs.

Gets the split names (e.g., 'train', 'test', 'validation') for a dataset.

Parses raw dataset info map into a list of DatasetInfo structs.

Functions

get_dataset_config_names(repository_id, opts \\ [])

@spec get_dataset_config_names(
  String.t(),
  keyword()
) :: {:ok, [String.t()]} | {:error, String.t()}

Gets the configuration names available for a dataset.

Parameters

  • repository_id - the Hugging Face dataset repository ID (e.g., "glue")
  • opts - optional keyword list with the following options:
    • :auth_token - the token to use as HTTP bearer authorization

Returns

Returns {:ok, config_names} where config_names is a list of configuration names, or {:error, reason} if the request fails.

Examples

iex> {:ok, configs} = ElixirDatasets.Info.get_dataset_config_names("glue")
iex> Enum.member?(configs, "cola")
true

get_dataset_info(repository_id, opts \\ [])

@spec get_dataset_info(
  String.t(),
  keyword()
) :: {:ok, map()} | {:error, String.t()}

Fetches dataset information from the Hugging Face API.

Parameters

  • repository_id - the Hugging Face dataset repository ID (e.g., "aaaaa32r/elixirDatasets")
  • opts - optional keyword list with the following options:
    • :auth_token - the token to use as HTTP bearer authorization

Returns

Returns {:ok, dataset_info} where dataset_info is a map containing the dataset metadata, or {:error, reason} if the request fails.

get_dataset_infos(repository_id, opts \\ [])

@spec get_dataset_infos(
  String.t(),
  keyword()
) :: {:ok, [ElixirDatasets.DatasetInfo.t()]} | {:error, String.t()}

Fetches dataset information from the Hugging Face API and returns a list of DatasetInfo structs.

This function retrieves all available dataset configurations for a given repository.

Parameters

  • repository_id - the Hugging Face dataset repository ID (e.g., "aaaaa32r/elixirDatasets")
  • opts - optional keyword list with the following options:
    • :auth_token - the token to use as HTTP bearer authorization

Returns

Returns {:ok, dataset_infos} where dataset_infos is a list of DatasetInfo structs, or {:error, reason} if the request fails.

Examples

iex> {:ok, infos} = ElixirDatasets.Info.get_dataset_infos("aaaaa32r/elixirDatasets")
iex> Enum.map(infos, & &1.config_name)
["csv", "default"]

get_dataset_split_names(repository_id, opts \\ [])

@spec get_dataset_split_names(
  String.t(),
  keyword()
) :: {:ok, [String.t()]} | {:error, String.t()}

Gets the split names (e.g., 'train', 'test', 'validation') for a dataset.

Parameters

  • repository_id - the Hugging Face dataset repository ID (e.g., "cornell-movie-review-data/rotten_tomatoes")
  • opts - optional keyword list with the following options:
    • :auth_token - the token to use as HTTP bearer authorization

Returns

Returns {:ok, split_names} where split_names is a list of strings representing the available splits, or {:error, reason} if the request fails.

Examples

iex> {:ok, splits} = ElixirDatasets.Info.get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes")
iex> splits
["train", "validation", "test"]

parse_dataset_infos(data)

@spec parse_dataset_infos(map()) :: [ElixirDatasets.DatasetInfo.t()]

Parses raw dataset info map into a list of DatasetInfo structs.

Extracts the dataset_info array from the HuggingFace API response's cardData field and converts each entry into a DatasetInfo struct.

Parameters

  • data - the raw response map from the HuggingFace API

Returns

A list of DatasetInfo structs.