Nasty.Data.Corpus (Nasty v0.3.0)

Corpus loading and management with caching.

Handles loading training data from various formats (CoNLL-U, raw text) and provides utilities for train/validation/test splitting.

Examples

# Load UD corpus
{:ok, corpus} = Corpus.load_ud("data/en_ewt-ud-train.conllu")

# Split into train/dev/test
{train, dev, test} = Corpus.split(corpus, ratios: [0.8, 0.1, 0.1])

# Extract POS tagging training data
pos_data = Corpus.extract_pos_sequences(train)

Summary

Types

corpus()

Functions

extract_dependencies(corpus)

Extract dependency relations from corpus.

extract_pos_sequences(corpus)

Extract POS tagging sequences from corpus.

load_ud(path, opts \\ [])

Load a Universal Dependencies corpus from CoNLL-U file.

split(corpus, opts \\ [])

Split corpus into train/validation/test sets.

statistics(corpus)

Get corpus statistics.

Types

corpus()

@type corpus() :: %{sentences: [Nasty.Data.CoNLLU.sentence()], metadata: map()}

Functions

extract_dependencies(corpus)

@spec extract_dependencies(corpus()) :: [map()]

Extract dependency relations from corpus.

Returns list of sentences with dependency information.

extract_pos_sequences(corpus)

@spec extract_pos_sequences(corpus()) :: [{[String.t()], [atom()]}]

Extract POS tagging sequences from corpus.

Returns list of {words, tags} tuples suitable for POS tagger training.

Examples

pos_data = Corpus.extract_pos_sequences(corpus)
# => [{["The", "cat", "sat"], [:det, :noun, :verb]}, ...]

load_ud(path, opts \\ [])

@spec load_ud(
  Path.t(),
  keyword()
) :: {:ok, corpus()} | {:error, term()}

Load a Universal Dependencies corpus from CoNLL-U file.

Parameters

path - Path to .conllu file
opts - Options
- :cache - Enable caching (default: true)
- :language - Language code (default: :en)

Returns

{:ok, corpus} - Loaded corpus
{:error, reason} - Load failed

split(corpus, opts \\ [])

@spec split(
  corpus(),
  keyword()
) :: {corpus(), corpus(), corpus()}

Split corpus into train/validation/test sets.

Parameters

corpus - The corpus to split
opts - Options
- :ratios - Split ratios [train, val, test] (default: [0.8, 0.1, 0.1])
- :shuffle - Shuffle before splitting (default: true)
- :seed - Random seed for shuffling (default: :rand.uniform(10000))

Returns

{train_corpus, val_corpus, test_corpus} - Three corpora

Examples

{train, dev, test} = Corpus.split(corpus, ratios: [0.8, 0.1, 0.1])

statistics(corpus)

@spec statistics(corpus()) :: map()

Get corpus statistics.

Returns

Map with corpus statistics:
- :num_sentences - Number of sentences
- :num_tokens - Total tokens
- :num_types - Unique word types
- :pos_distribution - POS tag counts
- :avg_sentence_length - Average sentence length