Nasty.Data.OntoNotes (Nasty v0.3.0)

View Source

Loader for OntoNotes 5.0 coreference data in CoNLL-2012 format.

The CoNLL-2012 format extends CoNLL-U with coreference annotations in the last column. Each token has a coreference column indicating which entity chain(s) it belongs to.

Format

CoNLL-2012 has the following tab-separated columns:

  1. Document ID
  2. Part number
  3. Word number
  4. Word itself
  5. POS tag
  6. Parse bit
  7. Predicate lemma
  8. Predicate sense
  9. Word sense
  10. Speaker
  11. Named entities
  12. Coreference chains (e.g., "(0)" or "(0|(1" or "0)")

Example

# Begin document doc1; part 000
doc1  0   0   John    NNP  ...  -  -  -  -  *  (0
doc1  0   1   works   VBZ  ...  -  -  -  -  *  -
doc1  0   2   at      IN   ...  -  -  -  -  *  -
doc1  0   3   Google  NNP  ...  -  -  -  -  *  (1)
doc1  0   4   .       .    ...  -  -  -  -  *  -
# ...
doc1  0   10  He      PRP  ...  -  -  -  -  *  0)
# End document

Usage

# Load training data
{:ok, documents} = OntoNotes.load_documents("data/ontonotes/train")

# Extract mention pairs for training
pairs = OntoNotes.extract_mention_pairs(documents, max_distance: 3)

# Create balanced training data
training_data = OntoNotes.create_training_data(documents,
  positive_negative_ratio: 1.0,
  max_distance: 3
)

Summary

Functions

Create antecedent training data for end-to-end coreference.

Create span-based training data for end-to-end coreference.

Create training data from documents.

Extract mention pairs from documents for training.

Load a single OntoNotes document file.

Load OntoNotes documents from a directory.

Types

coref_document()

@type coref_document() :: %{
  id: String.t(),
  sentences: [coref_sentence()],
  chains: [Nasty.AST.Semantic.CorefChain.t()]
}

coref_sentence()

@type coref_sentence() :: %{
  tokens: [coref_token()],
  mentions: [Nasty.AST.Semantic.Mention.t()]
}

coref_token()

@type coref_token() :: %{
  id: pos_integer(),
  text: String.t(),
  pos_tag: atom(),
  coref_ids: [non_neg_integer()]
}

mention_pair()

@type mention_pair() :: %{
  mention1: Nasty.AST.Semantic.Mention.t(),
  mention2: Nasty.AST.Semantic.Mention.t(),
  label: 0 | 1,
  document_id: String.t()
}

Functions

create_antecedent_data(documents, opts \\ [])

@spec create_antecedent_data(
  [coref_document()],
  keyword()
) :: [{map(), map(), 0 | 1}]

Create antecedent training data for end-to-end coreference.

For each mention, generates (mention, antecedent, label) triples. Label is 1 if antecedent is coreferent, 0 otherwise.

Options

  • :max_antecedent_distance - Maximum distance in mentions (default: 50)
  • :negative_antecedent_ratio - Ratio of negative to positive (default: 1.5)

Returns

List of {mention_span, antecedent_span, label} tuples

create_span_training_data(documents, opts \\ [])

@spec create_span_training_data(
  [coref_document()],
  keyword()
) :: [{map(), 0 | 1}]

Create span-based training data for end-to-end coreference.

Generates (span, label) pairs where label is 1 if the span is a mention, 0 otherwise. Also generates candidate spans using enumeration.

Options

  • :max_span_width - Maximum span width in tokens (default: 10)
  • :negative_span_ratio - Ratio of negative to positive spans (default: 3.0)

Returns

List of {span, label} tuples

create_training_data(documents, opts \\ [])

@spec create_training_data(
  [coref_document()],
  keyword()
) :: [{Nasty.AST.Semantic.Mention.t(), Nasty.AST.Semantic.Mention.t(), 0 | 1}]

Create training data from documents.

This is a convenience function that extracts mention pairs and formats them for training a neural coreference model.

Options

  • :positive_negative_ratio - Ratio of positive to negative samples (default: 1.0)
  • :max_distance - Maximum sentence distance (default: 3)
  • :shuffle - Whether to shuffle the data (default: true)
  • :seed - Random seed for shuffling (default: :os.system_time())

Returns

List of {mention1, mention2, label} tuples ready for training

extract_mention_pairs(documents, opts \\ [])

@spec extract_mention_pairs(
  [coref_document()],
  keyword()
) :: [mention_pair()]

Extract mention pairs from documents for training.

Generates both positive pairs (mentions in same chain) and negative pairs (mentions not in same chain).

Options

  • :max_distance - Maximum sentence distance between mentions (default: 3)
  • :positive_negative_ratio - Ratio of positive to negative samples (default: 1.0)
  • :window_size - Number of sentences to consider for negative sampling (default: 5)

Returns

List of mention pairs with labels (1 for coref, 0 for non-coref)

load_document(path)

@spec load_document(Path.t()) :: {:ok, coref_document()} | {:error, term()}

Load a single OntoNotes document file.

Parameters

  • path - Path to .coref or .v4_gold_conll file

Returns

  • {:ok, document} - Parsed document with coreference annotations
  • {:error, reason} - Parse error

load_documents(path)

@spec load_documents(Path.t()) :: {:ok, [coref_document()]} | {:error, term()}

Load OntoNotes documents from a directory.

Recursively searches for .coref files in the given directory.

Parameters

  • path - Path to directory containing CoNLL-2012 files

Returns

  • {:ok, documents} - List of parsed documents with coreference annotations
  • {:error, reason} - Load error