ElixirDatasets.Loader (ElixirDatasets v0.1.0)
View SourceFunctions for loading datasets from repositories.
Summary
Functions
Loads a dataset from the given repository.
Similar to load_dataset/2 but raises an error if loading fails.
Loads the specification of files to download from a repository.
Functions
@spec load_dataset( ElixirDatasets.Repository.t_repository(), keyword() ) :: {:ok, [Explorer.DataFrame.t()] | Enumerable.t()} | {:error, Exception.t()}
Loads a dataset from the given repository.
The repository can be either a local directory or a Hugging Face repository.
Options
Data Loading Options
:split- which split of the data to load (e.g., "train", "test", "validation"). If not specified, all splits are loaded. Files are matched by name patterns (e.g., "train.csv", "test-00000.parquet", "validation.jsonl").:name- the name of the dataset configuration to load. For datasets with multiple configurations, this specifies which one to use. Files are matched by looking for the config name in the file path (e.g., "sst2/train.parquet").:streaming- iftrue, returns an enumerable that progressively yields data rows (maps) without loading the entire dataset into memory. Data is fetched on-demand as you iterate. Useful for large datasets. Default isfalse.
HuggingFace Hub Options
:auth_token- the token to use as HTTP bearer authorization for remote files. If not provided, the token from theELIXIR_DATASETS_HF_TOKENenvironment variable is used.:cache_dir- the directory to store downloaded files in. Defaults to the standard cache location for the operating system.:offline- iftrue, only cached files are used and no network requests are made. Returns an error if the file is not cached.:etag- if provided, skips the HEAD request to fetch the latest ETag value and uses this value instead.:download_mode- controls download/cache behavior. Can be::reuse_dataset_if_exists(default) - reuse cached data if available:force_redownload- always download, even if cached
:verification_mode- controls verification checks. Can be::basic_checks(default) - basic validation:no_checks- skip all validation
:num_proc- number of processes to use for parallel dataset processing. Default is1(no parallelization). Set to a higher number to speed up dataset downloading and loading. For example,num_proc: 4will use 4 parallel processes.
Returns
- When
streaming: false(default):{:ok, datasets}wheredatasetsis a list of Explorer.DataFrame.t() - When
streaming: true:{:ok, stream}wherestreamis an Enumerable that yields rows progressively - On error:
{:error, reason}
Examples
iex> ElixirDatasets.Loader.load_dataset({:hf, "dataset_name"}, split: "train")
iex> ElixirDatasets.Loader.load_dataset({:hf, "glue"}, name: "sst2")
iex> {:ok, stream} = ElixirDatasets.Loader.load_dataset(
...> {:hf, "cornell-movie-review-data/rotten_tomatoes"},
...> split: "train",
...> streaming: true
...> )
...> stream |> Stream.take(100) |> IO.inspect()
@spec load_dataset!( ElixirDatasets.Repository.t_repository(), keyword() ) :: [Explorer.DataFrame.t()] | Enumerable.t()
Similar to load_dataset/2 but raises an error if loading fails.
Accepts the same options as load_dataset/2.
Returns
- a list of loaded datasets (or a Stream if streaming is enabled)
- raises an error if loading fails
Examples
iex> datasets = ElixirDatasets.Loader.load_dataset!({:hf, "cornell-movie-review-data/rotten_tomatoes"}, split: "train")
iex> stream = ElixirDatasets.Loader.load_dataset!({:hf, "cornell-movie-review-data/rotten_tomatoes"}, streaming: true)
iex> stream |> Enum.take(10)
@spec load_spec(tuple(), map(), pos_integer()) :: {:ok, [{String.t(), String.t()}]} | {:error, String.t()}
Loads the specification of files to download from a repository.
Filters files by valid extensions and downloads them in parallel if num_proc > 1.
Parameters
repository- normalized repository tuplerepo_files- map of files from repositorynum_proc- number of parallel processes to use
Returns
{:ok, paths_with_extensions} where each element is {path, extension},
or {:error, reason} if download fails.