ElixirDatasets.Utils.Loader (ElixirDatasets v0.1.0)
View SourceUtility functions for loading datasets from various file formats.
Supports loading from CSV, Parquet, and JSONL files. Each format is handled by its corresponding loader function that returns a dataframe or decoded data.
Summary
Functions
Loads datasets from multiple file paths with optional parallel processing.
Similar to load_datasets_from_paths/2 but raises an error if loading fails.
Functions
@spec load_datasets_from_paths([{Path.t(), String.t()}], pos_integer()) :: {:ok, [Explorer.DataFrame.t()]} | {:error, Exception.t()}
Loads datasets from multiple file paths with optional parallel processing.
Automatically detects the file format based on the extension and loads each file accordingly.
When num_proc is greater than 1, files are loaded in parallel using multiple processes,
which can significantly speed up loading when dealing with multiple files.
Parameters
paths_with_extensions- list of {path, extension} tuples to loadnum_proc- number of processes to use for parallel loading (default: 1). When set to 1, files are loaded sequentially. When greater than 1, files are loaded in parallel usingTask.async_streamwith the specified concurrency.
Returns
{:ok, [datasets]}- a list of loaded datasets in the same order as input{:error, reason}- if any file fails to load
Examples
# Sequential loading
paths = [{"data1.csv", "csv"}, {"data2.parquet", "parquet"}]
{:ok, datasets} = load_datasets_from_paths(paths)
# Parallel loading with 4 processes
{:ok, datasets} = load_datasets_from_paths(paths, 4)
@spec load_datasets_from_paths!([{Path.t(), String.t()}], pos_integer()) :: [ Explorer.DataFrame.t() ]
Similar to load_datasets_from_paths/2 but raises an error if loading fails.
Loads datasets from multiple file paths with optional parallel processing. Raises an exception if any file fails to load.
Parameters
paths_with_extensions- list of {path, extension} tuples to loadnum_proc- number of processes to use for parallel loading (default: 1). When set to 1, files are loaded sequentially. When greater than 1, files are loaded in parallel.
Returns
- a list of loaded datasets in the same order as input
- raises an error if any file fails to load
Examples
# Sequential loading
iex> paths = [{"data1.csv", "csv"}, {"data2.parquet", "parquet"}]
iex> datasets = load_datasets_from_paths!(paths)
# Parallel loading with 4 processes
iex> datasets = load_datasets_from_paths!(paths, 4)