Nasty.Semantic.Coreference.Neural.MentionEncoder (Nasty v0.3.0)

View Source

Neural mention encoder using BiLSTM with attention.

Encodes mentions into fixed-size vector representations by processing the mention tokens and surrounding context through a bidirectional LSTM with attention mechanism.

Architecture

  1. Token embeddings (GloVe or trainable)
  2. BiLSTM over context tokens
  3. Attention over mention span
  4. Concatenate: [mention_repr, head_word, context_repr]

Example

# Build model
model = MentionEncoder.build_model(
  vocab_size: 50_000,
  embedding_dim: 100,
  hidden_dim: 128
)

# Encode mention
encoding = MentionEncoder.encode_mention(
  model,
  params,
  mention_tokens,
  context_tokens,
  mention_span
)

Summary

Functions

Build the mention encoder model.

Build vocabulary from training data.

Load pre-trained GloVe embeddings.

Types

encoding()

@type encoding() :: Nx.Tensor.t()

model()

@type model() :: Axon.t()

params()

@type params() :: map()

Functions

batch_encode_mentions(model, params, mention_context_pairs, vocab)

@spec batch_encode_mentions(
  model(),
  params(),
  [{Nasty.AST.Semantic.Mention.t(), [Nasty.AST.Token.t()]}],
  map()
) :: Nx.Tensor.t()

Batch encode multiple mentions.

More efficient than encoding one at a time.

Parameters

  • model - Trained Axon model
  • params - Model parameters
  • mentions - List of mentions with contexts
  • vocab - Token to ID mapping

Returns

Tensor of shape [batch_size, hidden_dim * 2]

build_model(opts \\ [])

@spec build_model(keyword()) :: model()

Build the mention encoder model.

Options

  • :vocab_size - Vocabulary size (required)
  • :embedding_dim - Embedding dimension (default: 100)
  • :hidden_dim - LSTM hidden dimension (default: 128)
  • :context_window - Context window size (default: 10)
  • :dropout - Dropout rate (default: 0.3)
  • :use_pretrained - Use pre-trained embeddings (default: false)

Returns

Axon model that takes token IDs and returns mention encodings

build_vocab(documents, opts \\ [])

@spec build_vocab(
  [map()],
  keyword()
) :: map()

Build vocabulary from training data.

Parameters

  • documents - OntoNotes documents
  • min_count - Minimum token frequency (default: 2)
  • max_vocab_size - Maximum vocabulary size (default: 50_000)

Returns

Map from token text to ID

encode_mention(model, params, mention, context_tokens, vocab)

@spec encode_mention(
  model(),
  params(),
  Nasty.AST.Semantic.Mention.t(),
  [Nasty.AST.Token.t()],
  map()
) ::
  encoding()

Encode a mention with its context.

Parameters

  • model - Trained Axon model
  • params - Model parameters
  • mention - Mention struct
  • context_tokens - List of context tokens
  • vocab - Token to ID mapping

Returns

Tensor encoding of the mention [hidden_dim * 2]

load_glove_embeddings(path, vocab, embedding_dim)

@spec load_glove_embeddings(Path.t(), map(), pos_integer()) ::
  {:ok, Nx.Tensor.t()} | {:error, term()}

Load pre-trained GloVe embeddings.

Parameters

  • path - Path to GloVe file (e.g., "glove.6B.100d.txt")
  • vocab - Vocabulary map
  • embedding_dim - Embedding dimension

Returns

Tensor of shape [vocab_size, embedding_dim] with pre-trained embeddings