Nasty.Data.CoNLLU (Nasty v0.3.0)

View Source

Parser for CoNLL-U format used by Universal Dependencies.

CoNLL-U Format

CoNLL-U is a tab-separated format with 10 columns:

  1. ID - Word index
  2. FORM - Word form
  3. LEMMA - Lemma
  4. UPOS - Universal POS tag
  5. XPOS - Language-specific POS tag
  6. FEATS - Morphological features
  7. HEAD - Head of dependency relation
  8. DEPREL - Dependency relation
  9. DEPS - Enhanced dependencies
  10. MISC - Miscellaneous annotations

Lines starting with # are comments (sentence-level metadata). Blank lines separate sentences.

Examples

# Parse a file
{:ok, sentences} = CoNLLU.parse_file("en_ewt-ud-train.conllu")

# Parse a string
conllu_text = """
# sent_id = 1
# text = The cat sat.
1\tThe\tthe\tDET\t...
2\tcat\tcat\tNOUN\t...
3\tsat\tsit\tVERB\t...
"""
{:ok, sentences} = CoNLLU.parse_string(conllu_text)

Summary

Functions

Convert parsed sentences back to CoNLL-U format.

Parse a CoNLL-U file.

Parse a CoNLL-U formatted string.

Types

sentence()

@type sentence() :: %{
  id: String.t() | nil,
  text: String.t() | nil,
  tokens: [token()],
  metadata: map()
}

token()

@type token() :: %{
  id: pos_integer(),
  form: String.t(),
  lemma: String.t(),
  upos: atom(),
  xpos: String.t() | nil,
  feats: map(),
  head: non_neg_integer(),
  deprel: String.t(),
  deps: String.t() | nil,
  misc: map()
}

Functions

format(sentences)

@spec format([sentence()]) :: String.t()

Convert parsed sentences back to CoNLL-U format.

Parameters

  • sentences - List of sentence maps

Returns

  • CoNLL-U formatted string

parse_file(path)

@spec parse_file(Path.t()) :: {:ok, [sentence()]} | {:error, term()}

Parse a CoNLL-U file.

Parameters

  • path - Path to the .conllu file

Returns

  • {:ok, sentences} - List of parsed sentences
  • {:error, reason} - Parse error

parse_string(content)

@spec parse_string(String.t()) :: {:ok, [sentence()]} | {:error, term()}

Parse a CoNLL-U formatted string.

Parameters

  • content - CoNLL-U formatted text

Returns

  • {:ok, sentences} - List of parsed sentences
  • {:error, reason} - Parse error