Information Extraction
View SourceThis document describes Nasty's information extraction capabilities, which identify and extract structured information from unstructured text.
Overview
Nasty provides four main information extraction features:
- Named Entity Recognition (NER) - Identifies entities like people, organizations, locations, dates
- Relation Extraction - Discovers semantic relationships between entities
- Event Extraction - Identifies events with participants, time, and location
- Coreference Resolution - Resolves pronouns to their antecedents
Named Entity Recognition (NER)
NER identifies and classifies entities mentioned in text into predefined categories.
Supported Entity Types
- PERSON - Individual person names ("John Smith", "Mary")
- ORG - Organizations ("Google Inc.", "Harvard University")
- LOC - Physical locations ("Mount Everest", "Pacific Ocean")
- GPE - Geopolitical entities ("France", "California", "New York")
- DATE - Temporal expressions ("January 5", "2026", "March")
- TIME - Time expressions ("3:00 PM", "noon", "midnight")
- MONEY - Monetary values ("$100", "50 euros")
- PERCENT - Percentages ("25%")
- QUANTITY - Measurements ("5 kg", "10 meters")
- EVENT - Named events ("World War II", "Olympics")
- PRODUCT - Products/services ("iPhone", "Windows")
- LANGUAGE - Language names ("English", "Spanish")
Usage
alias Nasty.Language.English.{Tokenizer, POSTagger, EntityRecognizer}
# Parse and tag text
{:ok, tokens} = Tokenizer.tokenize("John works at Google in California.")
{:ok, tagged} = POSTagger.tag_pos(tokens)
# Recognize entities
entities = EntityRecognizer.recognize(tagged)
# Inspect results
Enum.each(entities, fn entity ->
IO.puts("#{entity.type}: #{entity.text}")
end)
# Output:
# person: John
# org: Google
# gpe: CaliforniaNER Models
Nasty supports multiple NER approaches:
# Rule-based (default) - Fast, ~85% accuracy
entities = EntityRecognizer.recognize(tokens)
# Statistical CRF - ~90-95% accuracy
entities = EntityRecognizer.recognize(tokens, model: :crf)Entity Structure
%Entity{
type: :person, # Entity type
text: "John Smith", # Surface text
tokens: [token1, token2], # Token list
canonical_form: nil, # Normalized form
confidence: 0.85, # Confidence score
span: %{...} # Position info
}Relation Extraction
Relation extraction identifies semantic relationships between entities in text.
Supported Relation Types
- Employment:
:works_at,:employed_by,:member_of - Organization:
:founded,:acquired_by,:subsidiary_of - Location:
:located_in,:based_in,:headquarters_in - Personal:
:born_in,:educated_at,:ceo_of - Structure:
:part_of - Temporal:
:occurred_on,:founded_in
Usage
alias Nasty.{Nasty, Language.English.RelationExtractor}
# Parse document
{:ok, document} = Nasty.parse("John works at Google in California.")
# Extract relations
{:ok, relations} = RelationExtractor.extract(document)
# Inspect results
Enum.each(relations, fn rel ->
IO.puts("#{rel.subject.text} -[#{rel.type}]-> #{rel.object.text}")
IO.puts(" Confidence: #{rel.confidence}")
end)
# Output:
# John -[works_at]-> Google
# Confidence: 0.8
# Google -[located_in]-> California
# Confidence: 0.7Options
# Filter by confidence threshold
{:ok, relations} = RelationExtractor.extract(document, min_confidence: 0.7)
# Limit number of results
{:ok, relations} = RelationExtractor.extract(document, max_relations: 10)
# Filter by relation type (post-processing)
employment = Enum.filter(relations, fn r -> r.type == :works_at end)Relation Structure
%Relation{
type: :works_at, # Relation type
subject: %Entity{...}, # Source entity
object: %Entity{...}, # Target entity
confidence: 0.8, # Confidence score
evidence: "John works...", # Supporting text
span: %{...}, # Position info
language: :en # Language code
}Pattern Matching
Relations are detected using:
- Verb patterns: "works at", "founded", "acquired"
- Preposition patterns: "X at Y", "X in Y", "X of Y"
- Dependency paths: Subject-verb-object relationships
- Entity type constraints: PERSON + ORG → works_at
Event Extraction
Event extraction identifies actions, states, or processes with their participants and circumstances.
Supported Event Types
Business Events
:business_acquisition- Mergers and acquisitions:business_merger- Company mergers:product_launch- Product releases:company_founding- Company establishments
Employment Events
:employment_start- Hiring, joining:employment_end- Resignation, firing
Communication Events
:announcement- Public announcements:meeting- Meetings, discussions
Movement Events
:movement- Travel, arrival, departure
Transaction Events
:transaction- Sales, trades, exchanges
Usage
alias Nasty.{Nasty, Language.English.EventExtractor}
# Parse document
{:ok, document} = Nasty.parse("Google acquired YouTube in October 2006.")
# Extract events
{:ok, events} = EventExtractor.extract(document)
# Inspect results
Enum.each(events, fn event ->
IO.puts("Event: #{event.type}")
IO.puts(" Trigger: #{event.trigger.text}")
IO.puts(" Participants: #{inspect(event.participants)}")
IO.puts(" Time: #{event.time}")
end)
# Output:
# Event: business_acquisition
# Trigger: acquired
# Participants: %{agent: google_entity, patient: youtube_entity}
# Time: October 2006Options
# Filter by confidence
{:ok, events} = EventExtractor.extract(document, min_confidence: 0.7)
# Limit results
{:ok, events} = EventExtractor.extract(document, max_events: 5)
# Filter by event type (post-processing)
acquisitions = Enum.filter(events, fn e -> e.type == :business_acquisition end)Event Structure
%Event{
type: :business_acquisition, # Event type
trigger: %Token{...}, # Trigger word (verb/noun)
participants: %{ # Event participants
agent: %Entity{...}, # Who performed action
patient: %Entity{...}, # Who/what was affected
location: "California" # Where it occurred
},
time: "October 2006", # When it occurred
confidence: 0.8, # Confidence score
span: %{...}, # Position info
language: :en # Language code
}Event Detection
Events are detected through:
- Verb triggers: "acquired", "launched", "announced"
- Nominalizations: "acquisition", "merger", "announcement"
- Semantic roles: Agent, patient, beneficiary extraction
- Temporal expressions: DATE/TIME entity recognition
Coreference Resolution
Coreference resolution identifies when different expressions refer to the same entity, building chains of mentions across sentences.
Usage
alias Nasty.{Nasty, Language.English.CoreferenceResolver}
# Parse document with multiple sentences
text = \"\"\"
John works at Google. He is an engineer.
The company is based in California.
\"\"\"
{:ok, document} = Nasty.parse(text)
# Resolve coreferences
{:ok, chains} = CoreferenceResolver.resolve(document)
# Inspect results
Enum.each(chains, fn chain ->
IO.puts("Entity chain:")
Enum.each(chain.mentions, fn mention ->
IO.puts(" - #{mention.text} (#{mention.type})")
end)
end)
# Output:
# Entity chain:
# - John (proper_name)
# - He (pronoun)
# Entity chain:
# - Google (proper_name)
# - The company (definite_np)Mention Types
:proper_name- Proper nouns ("John", "Google"):pronoun- Pronouns ("he", "she", "it", "they"):definite_np- Definite noun phrases ("the company", "the president"):demonstrative- Demonstrative references ("this", "that")
Coreference Chain Structure
%CorefChain{
id: "chain_1", # Unique chain ID
representative: %Mention{...}, # Most informative mention
mentions: [ # All mentions in chain
%Mention{text: "John", type: :proper_name, ...},
%Mention{text: "He", type: :pronoun, ...}
],
entity_type: :person # Entity type for chain
}Mention Structure
%Mention{
text: "he", # Surface text
type: :pronoun, # Mention type
sentence_idx: 1, # Sentence number
token_idx: 0, # Token position
gender: :male, # Gender (male/female/unknown)
number: :singular, # Number (singular/plural)
span: %{...} # Position info
}Complete Pipeline Example
Here's a complete example using all information extraction features:
alias Nasty.Language.English.{
Tokenizer,
POSTagger,
Morphology,
SentenceParser,
EntityRecognizer,
RelationExtractor,
EventExtractor,
CoreferenceResolver
}
alias Nasty.AST.{Document, Paragraph}
text = \"\"\"
Google acquired YouTube in October 2006 for $1.65 billion.
The company announced the deal in San Francisco.
It was the largest acquisition in Google's history.
\"\"\"
# 1. Parse text into document structure
{:ok, tokens} = Tokenizer.tokenize(text)
{:ok, tagged} = POSTagger.tag_pos(tokens)
{:ok, analyzed} = Morphology.analyze(tagged)
{:ok, sentences} = SentenceParser.parse_sentences(analyzed)
paragraph = %Paragraph{
sentences: sentences,
span: %{...},
language: :en
}
document = %Document{
paragraphs: [paragraph],
span: %{...},
language: :en
}
# 2. Extract entities
entities = EntityRecognizer.recognize(tokens)
# => [%Entity{type: :org, text: "Google"}, ...]
# 3. Extract relations
{:ok, relations} = RelationExtractor.extract(document)
# => [%Relation{type: :acquired_by, subject: youtube, object: google}, ...]
# 4. Extract events
{:ok, events} = EventExtractor.extract(document)
# => [%Event{type: :business_acquisition, trigger: "acquired", ...}, ...]
# 5. Resolve coreferences
{:ok, chains} = CoreferenceResolver.resolve(document)
# => [%CorefChain{mentions: [google, "the company"], ...}, ...]Best Practices
Performance
- Reuse tagged tokens: Parse once, extract multiple times
- Set confidence thresholds: Filter low-confidence results
- Limit results: Use
max_relations/max_eventsoptions - Choose appropriate model: Rule-based for speed, CRF for accuracy
Accuracy
- Use domain-specific lexicons: Extend entity recognizer with domain terms
- Validate results: Check confidence scores
- Combine features: Use relations + events together for richer extraction
- Handle ambiguity: Month names like "May" can be dates or names
Common Patterns
# Filter high-confidence relations
high_conf = Enum.filter(relations, fn r -> r.confidence > 0.8 end)
# Group events by type
events_by_type = Enum.group_by(events, & &1.type)
# Find entity mentions across coreference chains
all_mentions = Enum.flat_map(chains, & &1.mentions)
# Extract date/time entities
temporal = Enum.filter(entities, fn e -> e.type in [:date, :time] end)Limitations
Current Limitations
- Numeric patterns: Years, times with colons, currency symbols not fully supported in rule-based NER
- Complex relations: Multi-hop relations not extracted
- Nested events: Sub-events not represented separately
- Cross-document: Coreference limited to single documents
Future Enhancements
- Neural NER models for better accuracy
- Transformer-based relation extraction
- Temporal relation extraction (before/after events)
- Cross-document entity linking
- Multi-lingual information extraction
References
- docs/PARSING_GUIDE.md - Parsing algorithms
- docs/languages/ENGLISH_GRAMMAR.md - Grammar specification
- Entity types specification - Universal Dependencies