ElixirDatasets.Loader (ElixirDatasets v0.1.0)

Functions for loading datasets from repositories.

Summary

Functions

load_dataset(repository, opts \\ [])

Loads a dataset from the given repository.

load_dataset!(repository, opts \\ [])

Similar to load_dataset/2 but raises an error if loading fails.

load_spec(repository, repo_files, num_proc)

Loads the specification of files to download from a repository.

Functions

load_dataset(repository, opts \\ [])

@spec load_dataset(
  ElixirDatasets.Repository.t_repository(),
  keyword()
) :: {:ok, [Explorer.DataFrame.t()] | Enumerable.t()} | {:error, Exception.t()}

Loads a dataset from the given repository.

The repository can be either a local directory or a Hugging Face repository.

Options

Data Loading Options

:split - which split of the data to load (e.g., "train", "test", "validation"). If not specified, all splits are loaded. Files are matched by name patterns (e.g., "train.csv", "test-00000.parquet", "validation.jsonl").
:name - the name of the dataset configuration to load. For datasets with multiple configurations, this specifies which one to use. Files are matched by looking for the config name in the file path (e.g., "sst2/train.parquet").
:streaming - if true, returns an enumerable that progressively yields data rows (maps) without loading the entire dataset into memory. Data is fetched on-demand as you iterate. Useful for large datasets. Default is false.

HuggingFace Hub Options

:auth_token - the token to use as HTTP bearer authorization for remote files. If not provided, the token from the ELIXIR_DATASETS_HF_TOKEN environment variable is used.
:cache_dir - the directory to store downloaded files in. Defaults to the standard cache location for the operating system.
:offline - if true, only cached files are used and no network requests are made. Returns an error if the file is not cached.
:etag - if provided, skips the HEAD request to fetch the latest ETag value and uses this value instead.
:download_mode - controls download/cache behavior. Can be:
- :reuse_dataset_if_exists (default) - reuse cached data if available
- :force_redownload - always download, even if cached
:verification_mode - controls verification checks. Can be:
- :basic_checks (default) - basic validation
- :no_checks - skip all validation
:num_proc - number of processes to use for parallel dataset processing. Default is 1 (no parallelization). Set to a higher number to speed up dataset downloading and loading. For example, num_proc: 4 will use 4 parallel processes.

Returns

When streaming: false (default): {:ok, datasets} where datasets is a list of Explorer.DataFrame.t()
When streaming: true: {:ok, stream} where stream is an Enumerable that yields rows progressively
On error: {:error, reason}

Examples

iex> ElixirDatasets.Loader.load_dataset({:hf, "dataset_name"}, split: "train")

iex> ElixirDatasets.Loader.load_dataset({:hf, "glue"}, name: "sst2")

iex> {:ok, stream} = ElixirDatasets.Loader.load_dataset(
...>  {:hf, "cornell-movie-review-data/rotten_tomatoes"},
...>  split: "train",
...>  streaming: true
...> )
...> stream |> Stream.take(100) |> IO.inspect()

load_dataset!(repository, opts \\ [])

@spec load_dataset!(
  ElixirDatasets.Repository.t_repository(),
  keyword()
) :: [Explorer.DataFrame.t()] | Enumerable.t()

Similar to load_dataset/2 but raises an error if loading fails.

Accepts the same options as load_dataset/2.

Returns

a list of loaded datasets (or a Stream if streaming is enabled)
raises an error if loading fails

Examples

iex> datasets = ElixirDatasets.Loader.load_dataset!({:hf, "cornell-movie-review-data/rotten_tomatoes"}, split: "train")

iex> stream = ElixirDatasets.Loader.load_dataset!({:hf, "cornell-movie-review-data/rotten_tomatoes"}, streaming: true)
iex> stream |> Enum.take(10)

load_spec(repository, repo_files, num_proc)

@spec load_spec(tuple(), map(), pos_integer()) ::
  {:ok, [{String.t(), String.t()}]} | {:error, String.t()}

Loads the specification of files to download from a repository.

Filters files by valid extensions and downloads them in parallel if num_proc > 1.

Parameters

repository - normalized repository tuple
repo_files - map of files from repository
num_proc - number of parallel processes to use

Returns

{:ok, paths_with_extensions} where each element is {path, extension}, or {:error, reason} if download fails.