Nasty.Language.Behaviour behaviour (Nasty v0.3.0)
View SourceBehaviour that all natural language implementations must implement.
This provides a language-agnostic interface for parsing, tagging, and rendering natural language text. Each language (English, Spanish, Catalan, etc.) implements this behaviour with language-specific rules and processing.
Example Implementation
defmodule Nasty.Language.English do
@behaviour Nasty.Language.Behaviour
@impl true
def language_code, do: :en
@impl true
def tokenize(text, _opts) do
# English-specific tokenization
{:ok, tokens}
end
@impl true
def tag_pos(tokens, _opts) do
# English-specific POS tagging
{:ok, tagged_tokens}
end
@impl true
def parse(tokens, _opts) do
# English-specific parsing
{:ok, document_ast}
end
@impl true
def render(ast, _opts) do
# English-specific text generation
{:ok, text}
end
end
Summary
Types
Options passed to language processing functions.
Parse result containing the AST and optional metadata.
Render result.
Tokenization result.
Callbacks
Returns the ISO 639-1 language code for this implementation.
Returns metadata about the language implementation.
Parses tokens into a complete AST (Document structure).
Renders an AST back to natural language text.
Tags tokens with part-of-speech information.
Tokenizes text into a list of tokens.
Functions
Validates that a module implements the Language.Behaviour correctly.
Types
@type options() :: keyword()
Options passed to language processing functions.
Common options:
:generate_embeddings- Generate semantic embeddings (default: false):parse_dependencies- Extract dependency relations (default: true):extract_entities- Perform named entity recognition (default: false):resolve_coreferences- Resolve coreferences (default: false)- Custom language-specific options
@type parse_result() :: {:ok, Nasty.AST.Document.t()} | {:error, term()}
Parse result containing the AST and optional metadata.
Render result.
@type tokenize_result() :: {:ok, [Nasty.AST.Token.t()]} | {:error, term()}
Tokenization result.
Callbacks
@callback language_code() :: atom()
Returns the ISO 639-1 language code for this implementation.
Examples
iex> Nasty.Language.English.language_code()
:en
iex> Nasty.Language.Spanish.language_code()
:es
@callback metadata() :: map()
Returns metadata about the language implementation.
Optional callback providing information about the implementation:
- Version
- Supported features
- Performance characteristics
- Dependencies
Examples
iex> Nasty.Language.English.metadata()
%{
version: "1.0.0",
features: [:tokenization, :pos_tagging, :parsing, :ner],
parser_type: :nimble_parsec
}
@callback parse(tokens :: [Nasty.AST.Token.t()], opts :: options()) :: parse_result()
Parses tokens into a complete AST (Document structure).
Parsing includes:
- Phrase structure building (NP, VP, PP, etc.)
- Clause and sentence identification
- Dependency relation extraction (if enabled)
- Semantic analysis (if enabled)
Parameters
tokens- POS-tagged tokensopts- Parsing options:parse_dependencies- Extract dependency relations (default: true):extract_entities- Perform NER (default: false):resolve_coreferences- Resolve references (default: false)
Returns
{:ok, document}- Complete Document AST{:error, reason}- Parse error with details
Examples
iex> tokens = [tagged_tokens...]
iex> Nasty.Language.English.parse(tokens, parse_dependencies: true)
{:ok, %Document{paragraphs: [...], ...}}
@callback render(ast :: struct(), opts :: options()) :: render_result()
Renders an AST back to natural language text.
Rendering includes:
- Surface realization (choosing word forms)
- Agreement (subject-verb, determiner-noun, etc.)
- Word order (language-specific ordering rules)
- Punctuation insertion
- Formatting (capitalization, spacing)
Parameters
ast- AST node to render (Document, Sentence, Phrase, etc.)opts- Rendering options
Returns
{:ok, text}- Rendered natural language text{:error, reason}- Rendering error
Examples
iex> doc = %Document{...}
iex> Nasty.Language.English.render(doc, [])
{:ok, "The cat sat on the mat."}
@callback tag_pos(tokens :: [Nasty.AST.Token.t()], opts :: options()) :: tokenize_result()
Tags tokens with part-of-speech information.
POS tagging assigns Universal Dependencies tags to each token and extracts morphological features.
Parameters
tokens- List of tokens from tokenizationopts- Tagging options
Returns
{:ok, tagged_tokens}- Tokens with pos_tag and morphology filled{:error, reason}- Error during tagging
Examples
iex> tokens = [%Token{text: "cat", ...}]
iex> Nasty.Language.English.tag_pos(tokens, [])
{:ok, [%Token{text: "cat", pos_tag: :noun, ...}]}
@callback tokenize(text :: String.t(), opts :: options()) :: tokenize_result()
Tokenizes text into a list of tokens.
Tokenization includes:
- Sentence boundary detection
- Word segmentation
- Handling of contractions, hyphenation, compounds
- Position tracking for each token
Parameters
text- Raw text to tokenizeopts- Tokenization options
Returns
{:ok, tokens}- List of Token structs with position information{:error, reason}- Error during tokenization
Examples
iex> Nasty.Language.English.tokenize("Hello world.", [])
{:ok, [
%Token{text: "Hello", ...},
%Token{text: "world", ...},
%Token{text: ".", ...}
]}