Nasty.Data.Corpus (Nasty v0.3.0)

View Source

Corpus loading and management with caching.

Handles loading training data from various formats (CoNLL-U, raw text) and provides utilities for train/validation/test splitting.

Examples

# Load UD corpus
{:ok, corpus} = Corpus.load_ud("data/en_ewt-ud-train.conllu")

# Split into train/dev/test
{train, dev, test} = Corpus.split(corpus, ratios: [0.8, 0.1, 0.1])

# Extract POS tagging training data
pos_data = Corpus.extract_pos_sequences(train)

Summary

Functions

Extract dependency relations from corpus.

Extract POS tagging sequences from corpus.

Load a Universal Dependencies corpus from CoNLL-U file.

Split corpus into train/validation/test sets.

Get corpus statistics.

Types

corpus()

@type corpus() :: %{sentences: [Nasty.Data.CoNLLU.sentence()], metadata: map()}

Functions

extract_dependencies(corpus)

@spec extract_dependencies(corpus()) :: [map()]

Extract dependency relations from corpus.

Returns list of sentences with dependency information.

extract_pos_sequences(corpus)

@spec extract_pos_sequences(corpus()) :: [{[String.t()], [atom()]}]

Extract POS tagging sequences from corpus.

Returns list of {words, tags} tuples suitable for POS tagger training.

Examples

pos_data = Corpus.extract_pos_sequences(corpus)
# => [{["The", "cat", "sat"], [:det, :noun, :verb]}, ...]

load_ud(path, opts \\ [])

@spec load_ud(
  Path.t(),
  keyword()
) :: {:ok, corpus()} | {:error, term()}

Load a Universal Dependencies corpus from CoNLL-U file.

Parameters

  • path - Path to .conllu file
  • opts - Options
    • :cache - Enable caching (default: true)
    • :language - Language code (default: :en)

Returns

  • {:ok, corpus} - Loaded corpus
  • {:error, reason} - Load failed

split(corpus, opts \\ [])

@spec split(
  corpus(),
  keyword()
) :: {corpus(), corpus(), corpus()}

Split corpus into train/validation/test sets.

Parameters

  • corpus - The corpus to split
  • opts - Options
    • :ratios - Split ratios [train, val, test] (default: [0.8, 0.1, 0.1])
    • :shuffle - Shuffle before splitting (default: true)
    • :seed - Random seed for shuffling (default: :rand.uniform(10000))

Returns

  • {train_corpus, val_corpus, test_corpus} - Three corpora

Examples

{train, dev, test} = Corpus.split(corpus, ratios: [0.8, 0.1, 0.1])

statistics(corpus)

@spec statistics(corpus()) :: map()

Get corpus statistics.

Returns

  • Map with corpus statistics:
    • :num_sentences - Number of sentences
    • :num_tokens - Total tokens
    • :num_types - Unique word types
    • :pos_distribution - POS tag counts
    • :avg_sentence_length - Average sentence length