ElixirDatasets.Utils.Loader (ElixirDatasets v0.1.0)

View Source

Utility functions for loading datasets from various file formats.

Supports loading from CSV, Parquet, and JSONL files. Each format is handled by its corresponding loader function that returns a dataframe or decoded data.

Summary

Functions

Loads datasets from multiple file paths with optional parallel processing.

Functions

load_datasets_from_paths(paths_with_extensions, num_proc \\ 1)

@spec load_datasets_from_paths([{Path.t(), String.t()}], pos_integer()) ::
  {:ok, [Explorer.DataFrame.t()]} | {:error, Exception.t()}

Loads datasets from multiple file paths with optional parallel processing.

Automatically detects the file format based on the extension and loads each file accordingly. When num_proc is greater than 1, files are loaded in parallel using multiple processes, which can significantly speed up loading when dealing with multiple files.

Parameters

  • paths_with_extensions - list of {path, extension} tuples to load
  • num_proc - number of processes to use for parallel loading (default: 1). When set to 1, files are loaded sequentially. When greater than 1, files are loaded in parallel using Task.async_stream with the specified concurrency.

Returns

  • {:ok, [datasets]} - a list of loaded datasets in the same order as input
  • {:error, reason} - if any file fails to load

Examples

# Sequential loading
paths = [{"data1.csv", "csv"}, {"data2.parquet", "parquet"}]
{:ok, datasets} = load_datasets_from_paths(paths)

# Parallel loading with 4 processes
{:ok, datasets} = load_datasets_from_paths(paths, 4)

load_datasets_from_paths!(paths_with_extensions, num_proc \\ 1)

@spec load_datasets_from_paths!([{Path.t(), String.t()}], pos_integer()) :: [
  Explorer.DataFrame.t()
]

Similar to load_datasets_from_paths/2 but raises an error if loading fails.

Loads datasets from multiple file paths with optional parallel processing. Raises an exception if any file fails to load.

Parameters

  • paths_with_extensions - list of {path, extension} tuples to load
  • num_proc - number of processes to use for parallel loading (default: 1). When set to 1, files are loaded sequentially. When greater than 1, files are loaded in parallel.

Returns

  • a list of loaded datasets in the same order as input
  • raises an error if any file fails to load

Examples

# Sequential loading
iex> paths = [{"data1.csv", "csv"}, {"data2.parquet", "parquet"}]
iex> datasets = load_datasets_from_paths!(paths)

# Parallel loading with 4 processes
iex> datasets = load_datasets_from_paths!(paths, 4)