Nasty.Data.OntoNotes (Nasty v0.3.0)

Loader for OntoNotes 5.0 coreference data in CoNLL-2012 format.

The CoNLL-2012 format extends CoNLL-U with coreference annotations in the last column. Each token has a coreference column indicating which entity chain(s) it belongs to.

Format

CoNLL-2012 has the following tab-separated columns:

Document ID
Part number
Word number
Word itself
POS tag
Parse bit
Predicate lemma
Predicate sense
Word sense
Speaker
Named entities
Coreference chains (e.g., "(0)" or "(0|(1" or "0)")

Example

# Begin document doc1; part 000
doc1  0   0   John    NNP  ...  -  -  -  -  *  (0
doc1  0   1   works   VBZ  ...  -  -  -  -  *  -
doc1  0   2   at      IN   ...  -  -  -  -  *  -
doc1  0   3   Google  NNP  ...  -  -  -  -  *  (1)
doc1  0   4   .       .    ...  -  -  -  -  *  -
# ...
doc1  0   10  He      PRP  ...  -  -  -  -  *  0)
# End document

Usage

# Load training data
{:ok, documents} = OntoNotes.load_documents("data/ontonotes/train")

# Extract mention pairs for training
pairs = OntoNotes.extract_mention_pairs(documents, max_distance: 3)

# Create balanced training data
training_data = OntoNotes.create_training_data(documents,
  positive_negative_ratio: 1.0,
  max_distance: 3
)

Summary

Types

coref_document()

coref_sentence()

coref_token()

mention_pair()

Functions

create_antecedent_data(documents, opts \\ [])

Create antecedent training data for end-to-end coreference.

create_span_training_data(documents, opts \\ [])

Create span-based training data for end-to-end coreference.

create_training_data(documents, opts \\ [])

Create training data from documents.

extract_mention_pairs(documents, opts \\ [])

Extract mention pairs from documents for training.

load_document(path)

Load a single OntoNotes document file.

load_documents(path)

Load OntoNotes documents from a directory.

Types

coref_document()

@type coref_document() :: %{
  id: String.t(),
  sentences: [coref_sentence()],
  chains: [Nasty.AST.Semantic.CorefChain.t()]
}

coref_sentence()

@type coref_sentence() :: %{
  tokens: [coref_token()],
  mentions: [Nasty.AST.Semantic.Mention.t()]
}

coref_token()

@type coref_token() :: %{
  id: pos_integer(),
  text: String.t(),
  pos_tag: atom(),
  coref_ids: [non_neg_integer()]
}

mention_pair()

@type mention_pair() :: %{
  mention1: Nasty.AST.Semantic.Mention.t(),
  mention2: Nasty.AST.Semantic.Mention.t(),
  label: 0 | 1,
  document_id: String.t()
}

Functions

create_antecedent_data(documents, opts \\ [])

@spec create_antecedent_data(
  [coref_document()],
  keyword()
) :: [{map(), map(), 0 | 1}]

Create antecedent training data for end-to-end coreference.

For each mention, generates (mention, antecedent, label) triples. Label is 1 if antecedent is coreferent, 0 otherwise.

Options

:max_antecedent_distance - Maximum distance in mentions (default: 50)
:negative_antecedent_ratio - Ratio of negative to positive (default: 1.5)

Returns

List of {mention_span, antecedent_span, label} tuples

create_span_training_data(documents, opts \\ [])

@spec create_span_training_data(
  [coref_document()],
  keyword()
) :: [{map(), 0 | 1}]

Create span-based training data for end-to-end coreference.

Generates (span, label) pairs where label is 1 if the span is a mention, 0 otherwise. Also generates candidate spans using enumeration.

Options

:max_span_width - Maximum span width in tokens (default: 10)
:negative_span_ratio - Ratio of negative to positive spans (default: 3.0)

Returns

List of {span, label} tuples

create_training_data(documents, opts \\ [])

@spec create_training_data(
  [coref_document()],
  keyword()
) :: [{Nasty.AST.Semantic.Mention.t(), Nasty.AST.Semantic.Mention.t(), 0 | 1}]

Create training data from documents.

This is a convenience function that extracts mention pairs and formats them for training a neural coreference model.

Options

:positive_negative_ratio - Ratio of positive to negative samples (default: 1.0)
:max_distance - Maximum sentence distance (default: 3)
:shuffle - Whether to shuffle the data (default: true)
:seed - Random seed for shuffling (default: :os.system_time())

Returns

List of {mention1, mention2, label} tuples ready for training

extract_mention_pairs(documents, opts \\ [])

@spec extract_mention_pairs(
  [coref_document()],
  keyword()
) :: [mention_pair()]

Extract mention pairs from documents for training.

Generates both positive pairs (mentions in same chain) and negative pairs (mentions not in same chain).

Options

:max_distance - Maximum sentence distance between mentions (default: 3)
:positive_negative_ratio - Ratio of positive to negative samples (default: 1.0)
:window_size - Number of sentences to consider for negative sampling (default: 5)

Returns

List of mention pairs with labels (1 for coref, 0 for non-coref)

load_document(path)

@spec load_document(Path.t()) :: {:ok, coref_document()} | {:error, term()}

Load a single OntoNotes document file.

Parameters

path - Path to .coref or .v4_gold_conll file

Returns

{:ok, document} - Parsed document with coreference annotations
{:error, reason} - Parse error

load_documents(path)

@spec load_documents(Path.t()) :: {:ok, [coref_document()]} | {:error, term()}

Load OntoNotes documents from a directory.

Recursively searches for .coref files in the given directory.

Parameters

path - Path to directory containing CoNLL-2012 files

Returns

{:ok, documents} - List of parsed documents with coreference annotations
{:error, reason} - Load error