Nasty.Statistics.Neural.DataLoader (Nasty v0.3.0)
View SourceData loading utilities for neural models.
Converts various corpus formats (CoNLL-U, raw text) into batches suitable for neural network training.
Features
- Load Universal Dependencies CoNLL-U format
- Convert to neural-friendly tensors
- Automatic batching and padding
- Vocabulary building from corpus
- Train/validation/test splits
- Streaming for large datasets
Example
# Load CoNLL-U corpus
{:ok, data} = DataLoader.load_conllu("en_ewt-ud-train.conllu")
# Split into train/valid/test
{train, valid, test} = DataLoader.split(data, [0.8, 0.1, 0.1])
# Create batches for training
train_batches = DataLoader.create_batches(train, batch_size: 32)
# Use in training
Trainer.train(model, train_batches, valid_batches, opts)
Summary
Functions
Analyzes corpus statistics.
Alias for analyze/1
Builds vocabulary from a list of sentences.
Wrapper for create_batches/4 with simple signature for raw data batching.
Creates batches from sentences for neural network training.
Extract tag vocabulary from sentences
Extract word vocabulary
Loads a CoNLL-U corpus file.
Alias for load_conllu that reads from file path
Splits data into train/validation/test sets.
Wrapper for split/2 with default validation split
Wrapper for split/2 with train/valid/test
Wrapper for stream_batches/4 with simpler signature for streaming raw data.
Streams batches from a large corpus file.
Types
Functions
Analyzes corpus statistics.
Parameters
sentences- List of sentences
Returns
Map with corpus statistics:
:num_sentences- Total sentences:num_tokens- Total tokens:avg_length- Average sentence length:max_length- Maximum sentence length:min_length- Minimum sentence length:vocab_size- Unique word count:tag_counts- Frequency of each tag
Alias for analyze/1
Builds vocabulary from a list of sentences.
Parameters
sentences- List of {words, tags} tuplesopts- Vocabulary options (passed to Embeddings.build_vocabulary)
Returns
{:ok, vocab, tag_vocab}- Word and tag vocabularies
Wrapper for create_batches/4 with simple signature for raw data batching.
When called with just data and options (no vocab), returns simple chunked batches. When called with vocab and tag_vocab, delegates to the full implementation.
Examples
# Simple batching (no vocab conversion)
batches = DataLoader.create_batches(data, batch_size: 32)
# Full neural batching (with vocab conversion)
batches = DataLoader.create_batches(sentences, vocab, tag_vocab, batch_size: 32)
Creates batches from sentences for neural network training.
Parameters
sentences- List of {words, tags} tuplesvocab- Vocabulary for word-to-ID mappingtag_vocab- Tag vocabularyopts- Batching options
Options
:batch_size- Batch size (default: 32):shuffle- Shuffle batches (default: true):drop_last- Drop incomplete last batch (default: false):pad_value- Padding value for sequences (default: 0)
Returns
List of batches, where each batch is {inputs, targets}.
Extract tag vocabulary from sentences
Extract word vocabulary
Loads a CoNLL-U corpus file.
Parameters
path- Path to CoNLL-U fileopts- Loading options
Options
:max_sentences- Maximum sentences to load (default: unlimited):min_length- Minimum sentence length (default: 1):max_length- Maximum sentence length (default: 100)
Returns
{:ok, sentences}- List of {words, tags} tuples{:error, reason}- Loading failed
Alias for load_conllu that reads from file path
Splits data into train/validation/test sets.
Parameters
data- List of sentencesratios- List of split ratios (must sum to 1.0)
Examples
# 80% train, 10% valid, 10% test
{train, valid, test} = DataLoader.split(data, [0.8, 0.1, 0.1])
# 90% train, 10% valid
{train, valid} = DataLoader.split(data, [0.9, 0.1])Returns
Tuple of split datasets matching the number of ratios provided.
Wrapper for split/2 with default validation split
Wrapper for split/2 with train/valid/test
Wrapper for stream_batches/4 with simpler signature for streaming raw data.
When called with just data and options (no vocab), returns simple chunked stream. When called with path/vocab, delegates to the full file-based streaming implementation.
Examples
# Simple streaming (no vocab conversion)
stream = DataLoader.stream_batches(data, batch_size: 32)
# Full neural streaming from file (with vocab conversion)
stream = DataLoader.stream_batches(path, vocab, tag_vocab, batch_size: 32)
@spec stream_batches(Path.t(), map(), map(), keyword()) :: Enumerable.t()
Streams batches from a large corpus file.
Useful for datasets that don't fit in memory.
Parameters
path- Path to CoNLL-U filevocab- Vocabularytag_vocab- Tag vocabularyopts- Streaming options
Returns
A stream of batches.
Example
DataLoader.stream_batches("large_corpus.conllu", vocab, tag_vocab, batch_size: 64)
|> Enum.take(100) # Process first 100 batches