Nasty.Statistics.Neural.DataLoader (Nasty v0.3.0)

View Source

Data loading utilities for neural models.

Converts various corpus formats (CoNLL-U, raw text) into batches suitable for neural network training.

Features

  • Load Universal Dependencies CoNLL-U format
  • Convert to neural-friendly tensors
  • Automatic batching and padding
  • Vocabulary building from corpus
  • Train/validation/test splits
  • Streaming for large datasets

Example

# Load CoNLL-U corpus
{:ok, data} = DataLoader.load_conllu("en_ewt-ud-train.conllu")

# Split into train/valid/test
{train, valid, test} = DataLoader.split(data, [0.8, 0.1, 0.1])

# Create batches for training
train_batches = DataLoader.create_batches(train, batch_size: 32)

# Use in training
Trainer.train(model, train_batches, valid_batches, opts)

Summary

Functions

Analyzes corpus statistics.

Alias for analyze/1

Builds vocabulary from a list of sentences.

Wrapper for create_batches/4 with simple signature for raw data batching.

Creates batches from sentences for neural network training.

Extract tag vocabulary from sentences

Loads a CoNLL-U corpus file.

Alias for load_conllu that reads from file path

Splits data into train/validation/test sets.

Wrapper for split/2 with default validation split

Wrapper for split/2 with train/valid/test

Wrapper for stream_batches/4 with simpler signature for streaming raw data.

Streams batches from a large corpus file.

Types

batch()

@type batch() :: {inputs :: map(), targets :: map()}

sentence()

@type sentence() :: {words :: [String.t()], tags :: [atom()]}

Functions

analyze(sentences)

@spec analyze([sentence()]) :: map()

Analyzes corpus statistics.

Parameters

  • sentences - List of sentences

Returns

Map with corpus statistics:

  • :num_sentences - Total sentences
  • :num_tokens - Total tokens
  • :avg_length - Average sentence length
  • :max_length - Maximum sentence length
  • :min_length - Minimum sentence length
  • :vocab_size - Unique word count
  • :tag_counts - Frequency of each tag

analyze_corpus(sentences)

Alias for analyze/1

build_vocabularies(sentences, opts \\ [])

@spec build_vocabularies(
  [sentence()],
  keyword()
) :: {:ok, map(), map()}

Builds vocabulary from a list of sentences.

Parameters

  • sentences - List of {words, tags} tuples
  • opts - Vocabulary options (passed to Embeddings.build_vocabulary)

Returns

  • {:ok, vocab, tag_vocab} - Word and tag vocabularies

create_batches(data, batch_opts)

Wrapper for create_batches/4 with simple signature for raw data batching.

When called with just data and options (no vocab), returns simple chunked batches. When called with vocab and tag_vocab, delegates to the full implementation.

Examples

# Simple batching (no vocab conversion)
batches = DataLoader.create_batches(data, batch_size: 32)

# Full neural batching (with vocab conversion)
batches = DataLoader.create_batches(sentences, vocab, tag_vocab, batch_size: 32)

create_batches(sentences, vocab, tag_vocab, opts \\ [])

@spec create_batches([sentence()], map(), map(), keyword()) :: [batch()]

Creates batches from sentences for neural network training.

Parameters

  • sentences - List of {words, tags} tuples
  • vocab - Vocabulary for word-to-ID mapping
  • tag_vocab - Tag vocabulary
  • opts - Batching options

Options

  • :batch_size - Batch size (default: 32)
  • :shuffle - Shuffle batches (default: true)
  • :drop_last - Drop incomplete last batch (default: false)
  • :pad_value - Padding value for sequences (default: 0)

Returns

List of batches, where each batch is {inputs, targets}.

extract_tag_vocab(sentences)

Extract tag vocabulary from sentences

extract_vocabulary(sentences, opts \\ [])

Extract word vocabulary

load_conllu(path_or_content, opts \\ [])

@spec load_conllu(
  Path.t() | String.t(),
  keyword()
) :: {:ok, [sentence()]} | {:error, term()}

Loads a CoNLL-U corpus file.

Parameters

  • path - Path to CoNLL-U file
  • opts - Loading options

Options

  • :max_sentences - Maximum sentences to load (default: unlimited)
  • :min_length - Minimum sentence length (default: 1)
  • :max_length - Maximum sentence length (default: 100)

Returns

  • {:ok, sentences} - List of {words, tags} tuples
  • {:error, reason} - Loading failed

load_conllu_file(path, opts \\ [])

Alias for load_conllu that reads from file path

split(data, ratios)

@spec split([sentence()], [float()]) :: tuple()

Splits data into train/validation/test sets.

Parameters

  • data - List of sentences
  • ratios - List of split ratios (must sum to 1.0)

Examples

# 80% train, 10% valid, 10% test
{train, valid, test} = DataLoader.split(data, [0.8, 0.1, 0.1])

# 90% train, 10% valid
{train, valid} = DataLoader.split(data, [0.9, 0.1])

Returns

Tuple of split datasets matching the number of ratios provided.

split_data(data, opts \\ [])

Wrapper for split/2 with default validation split

split_train_valid_test(data, opts \\ [])

Wrapper for split/2 with train/valid/test

stream_batches(data, batch_opts)

Wrapper for stream_batches/4 with simpler signature for streaming raw data.

When called with just data and options (no vocab), returns simple chunked stream. When called with path/vocab, delegates to the full file-based streaming implementation.

Examples

# Simple streaming (no vocab conversion)
stream = DataLoader.stream_batches(data, batch_size: 32)

# Full neural streaming from file (with vocab conversion)
stream = DataLoader.stream_batches(path, vocab, tag_vocab, batch_size: 32)

stream_batches(path, vocab, tag_vocab, opts \\ [])

@spec stream_batches(Path.t(), map(), map(), keyword()) :: Enumerable.t()

Streams batches from a large corpus file.

Useful for datasets that don't fit in memory.

Parameters

  • path - Path to CoNLL-U file
  • vocab - Vocabulary
  • tag_vocab - Tag vocabulary
  • opts - Streaming options

Returns

A stream of batches.

Example

DataLoader.stream_batches("large_corpus.conllu", vocab, tag_vocab, batch_size: 64)
|> Enum.take(100)  # Process first 100 batches