Nasty.Data.Corpus (Nasty v0.3.0)
View SourceCorpus loading and management with caching.
Handles loading training data from various formats (CoNLL-U, raw text) and provides utilities for train/validation/test splitting.
Examples
# Load UD corpus
{:ok, corpus} = Corpus.load_ud("data/en_ewt-ud-train.conllu")
# Split into train/dev/test
{train, dev, test} = Corpus.split(corpus, ratios: [0.8, 0.1, 0.1])
# Extract POS tagging training data
pos_data = Corpus.extract_pos_sequences(train)
Summary
Functions
Extract dependency relations from corpus.
Extract POS tagging sequences from corpus.
Load a Universal Dependencies corpus from CoNLL-U file.
Split corpus into train/validation/test sets.
Get corpus statistics.
Types
@type corpus() :: %{sentences: [Nasty.Data.CoNLLU.sentence()], metadata: map()}
Functions
Extract dependency relations from corpus.
Returns list of sentences with dependency information.
Extract POS tagging sequences from corpus.
Returns list of {words, tags} tuples suitable for POS tagger training.
Examples
pos_data = Corpus.extract_pos_sequences(corpus)
# => [{["The", "cat", "sat"], [:det, :noun, :verb]}, ...]
Load a Universal Dependencies corpus from CoNLL-U file.
Parameters
path- Path to .conllu fileopts- Options:cache- Enable caching (default: true):language- Language code (default: :en)
Returns
{:ok, corpus}- Loaded corpus{:error, reason}- Load failed
Split corpus into train/validation/test sets.
Parameters
corpus- The corpus to splitopts- Options:ratios- Split ratios [train, val, test] (default: [0.8, 0.1, 0.1]):shuffle- Shuffle before splitting (default: true):seed- Random seed for shuffling (default: :rand.uniform(10000))
Returns
{train_corpus, val_corpus, test_corpus}- Three corpora
Examples
{train, dev, test} = Corpus.split(corpus, ratios: [0.8, 0.1, 0.1])
Get corpus statistics.
Returns
- Map with corpus statistics:
:num_sentences- Number of sentences:num_tokens- Total tokens:num_types- Unique word types:pos_distribution- POS tag counts:avg_sentence_length- Average sentence length