Nasty.Data.OntoNotes (Nasty v0.3.0)
View SourceLoader for OntoNotes 5.0 coreference data in CoNLL-2012 format.
The CoNLL-2012 format extends CoNLL-U with coreference annotations in the last column. Each token has a coreference column indicating which entity chain(s) it belongs to.
Format
CoNLL-2012 has the following tab-separated columns:
- Document ID
- Part number
- Word number
- Word itself
- POS tag
- Parse bit
- Predicate lemma
- Predicate sense
- Word sense
- Speaker
- Named entities
- Coreference chains (e.g., "(0)" or "(0|(1" or "0)")
Example
# Begin document doc1; part 000
doc1 0 0 John NNP ... - - - - * (0
doc1 0 1 works VBZ ... - - - - * -
doc1 0 2 at IN ... - - - - * -
doc1 0 3 Google NNP ... - - - - * (1)
doc1 0 4 . . ... - - - - * -
# ...
doc1 0 10 He PRP ... - - - - * 0)
# End documentUsage
# Load training data
{:ok, documents} = OntoNotes.load_documents("data/ontonotes/train")
# Extract mention pairs for training
pairs = OntoNotes.extract_mention_pairs(documents, max_distance: 3)
# Create balanced training data
training_data = OntoNotes.create_training_data(documents,
positive_negative_ratio: 1.0,
max_distance: 3
)
Summary
Functions
Create antecedent training data for end-to-end coreference.
Create span-based training data for end-to-end coreference.
Create training data from documents.
Extract mention pairs from documents for training.
Load a single OntoNotes document file.
Load OntoNotes documents from a directory.
Types
@type coref_document() :: %{ id: String.t(), sentences: [coref_sentence()], chains: [Nasty.AST.Semantic.CorefChain.t()] }
@type coref_sentence() :: %{ tokens: [coref_token()], mentions: [Nasty.AST.Semantic.Mention.t()] }
@type coref_token() :: %{ id: pos_integer(), text: String.t(), pos_tag: atom(), coref_ids: [non_neg_integer()] }
@type mention_pair() :: %{ mention1: Nasty.AST.Semantic.Mention.t(), mention2: Nasty.AST.Semantic.Mention.t(), label: 0 | 1, document_id: String.t() }
Functions
@spec create_antecedent_data( [coref_document()], keyword() ) :: [{map(), map(), 0 | 1}]
Create antecedent training data for end-to-end coreference.
For each mention, generates (mention, antecedent, label) triples. Label is 1 if antecedent is coreferent, 0 otherwise.
Options
:max_antecedent_distance- Maximum distance in mentions (default: 50):negative_antecedent_ratio- Ratio of negative to positive (default: 1.5)
Returns
List of {mention_span, antecedent_span, label} tuples
@spec create_span_training_data( [coref_document()], keyword() ) :: [{map(), 0 | 1}]
Create span-based training data for end-to-end coreference.
Generates (span, label) pairs where label is 1 if the span is a mention, 0 otherwise. Also generates candidate spans using enumeration.
Options
:max_span_width- Maximum span width in tokens (default: 10):negative_span_ratio- Ratio of negative to positive spans (default: 3.0)
Returns
List of {span, label} tuples
@spec create_training_data( [coref_document()], keyword() ) :: [{Nasty.AST.Semantic.Mention.t(), Nasty.AST.Semantic.Mention.t(), 0 | 1}]
Create training data from documents.
This is a convenience function that extracts mention pairs and formats them for training a neural coreference model.
Options
:positive_negative_ratio- Ratio of positive to negative samples (default: 1.0):max_distance- Maximum sentence distance (default: 3):shuffle- Whether to shuffle the data (default: true):seed- Random seed for shuffling (default: :os.system_time())
Returns
List of {mention1, mention2, label} tuples ready for training
@spec extract_mention_pairs( [coref_document()], keyword() ) :: [mention_pair()]
Extract mention pairs from documents for training.
Generates both positive pairs (mentions in same chain) and negative pairs (mentions not in same chain).
Options
:max_distance- Maximum sentence distance between mentions (default: 3):positive_negative_ratio- Ratio of positive to negative samples (default: 1.0):window_size- Number of sentences to consider for negative sampling (default: 5)
Returns
List of mention pairs with labels (1 for coref, 0 for non-coref)
@spec load_document(Path.t()) :: {:ok, coref_document()} | {:error, term()}
Load a single OntoNotes document file.
Parameters
path- Path to .coref or .v4_gold_conll file
Returns
{:ok, document}- Parsed document with coreference annotations{:error, reason}- Parse error
@spec load_documents(Path.t()) :: {:ok, [coref_document()]} | {:error, term()}
Load OntoNotes documents from a directory.
Recursively searches for .coref files in the given directory.
Parameters
path- Path to directory containing CoNLL-2012 files
Returns
{:ok, documents}- List of parsed documents with coreference annotations{:error, reason}- Load error