Nasty.Statistics.Neural.Embeddings (Nasty v0.3.0)

View Source

Word and character embedding utilities for neural models.

Provides:

  • Pre-trained embedding loading (GloVe, FastText)
  • Random embedding initialization
  • Vocabulary management
  • Efficient embedding lookup
  • Embedding caching

Example

# Create vocabulary from corpus
{:ok, vocab} = Embeddings.build_vocabulary(corpus, min_freq: 2)

# Initialize random embeddings
{:ok, embeddings} = Embeddings.init_random(vocab, embedding_dim: 300)

# Load pre-trained GloVe embeddings
{:ok, embeddings} = Embeddings.load_glove("glove.6B.300d.txt", vocab)

# Look up word embeddings
{:ok, vector} = Embeddings.lookup(embeddings, "cat")

Summary

Functions

Builds character vocabulary from a list of words.

Builds a vocabulary from a corpus of sentences.

Creates a character embedding layer (placeholder for Axon integration).

Creates an embedding layer (placeholder for Axon integration).

Converts a tensor of word IDs back to words.

Initializes random embeddings for a vocabulary.

Loads pre-trained GloVe embeddings.

Looks up the embedding vector for a word.

Returns special token IDs.

Converts a single word to its vocabulary index.

Converts a list of words to a tensor of word IDs.

Converts list of words to list of indices.

Types

embeddings()

@type embeddings() :: %{
  vocab: vocabulary(),
  vectors: Nx.Tensor.t(),
  embedding_dim: pos_integer()
}

vocabulary()

@type vocabulary() :: %{
  word_to_id: map(),
  id_to_word: map(),
  frequencies: map(),
  size: non_neg_integer()
}

Functions

build_char_vocabulary(words_nested, opts \\ [])

@spec build_char_vocabulary(
  [[String.t()]] | [String.t()],
  keyword()
) :: map()

Builds character vocabulary from a list of words.

Parameters

  • words - List of words (can be nested lists)
  • opts - Vocabulary options

Returns

  • {:ok, char_vocab} - Character to ID mapping

build_vocabulary(corpus, opts \\ [])

@spec build_vocabulary(
  [[String.t()]],
  keyword()
) :: map() | {:ok, vocabulary()}

Builds a vocabulary from a corpus of sentences.

Returns a simple word -> id map when used without explicit return_struct option. Returns vocabulary struct with {:ok, vocab} when called from code that expects it.

Parameters

  • corpus - List of sentences (each sentence is a list of words)
  • opts - Vocabulary options

Options

  • :min_freq - Minimum word frequency to include (default: 1)
  • :max_size - Maximum vocabulary size (default: unlimited)
  • :special_tokens - Include special tokens (default: true)
  • :lowercase - Convert all words to lowercase (default: false)
  • :return_struct - Return full struct (default: false)

Returns

  • Simple map %{word => id} by default
  • {:ok, vocabulary} when return_struct: true

create_char_embedding_layer(char_vocab, opts \\ [])

Creates a character embedding layer (placeholder for Axon integration).

Parameters

  • char_vocab - Character vocabulary map
  • opts - Layer options

Options

  • :embedding_dim - Embedding dimension (default: 50)

Returns

A function that can be used to create character embeddings.

create_embedding_layer(vocab, opts \\ [])

Creates an embedding layer (placeholder for Axon integration).

Parameters

  • vocab - Vocabulary map
  • opts - Layer options

Options

  • :embedding_dim - Embedding dimension (default: 300)

Returns

A function that can be used to create embeddings.

ids_to_words(vocab, id_tensor, opts \\ [])

@spec ids_to_words(vocabulary(), Nx.Tensor.t(), keyword()) :: {:ok, [String.t()]}

Converts a tensor of word IDs back to words.

Parameters

  • vocab - Vocabulary struct
  • id_tensor - Tensor of word IDs
  • opts - Conversion options

Returns

  • {:ok, words} - List of words

init_random(vocab, opts \\ [])

@spec init_random(
  vocabulary(),
  keyword()
) :: {:ok, embeddings()}

Initializes random embeddings for a vocabulary.

Parameters

  • vocab - Vocabulary struct
  • opts - Embedding options

Options

  • :embedding_dim - Embedding dimensionality (default: 300)
  • :init_method - Initialization method: :uniform, :normal, :xavier (default: :uniform)
  • :scale - Initialization scale (default: 0.1)

Returns

  • {:ok, embeddings} - Embeddings struct with random vectors

load_glove(path, vocab, opts \\ [])

@spec load_glove(Path.t(), vocabulary(), keyword()) ::
  {:ok, embeddings()} | {:error, term()}

Loads pre-trained GloVe embeddings.

Parameters

  • path - Path to GloVe file (e.g., "glove.6B.300d.txt")
  • vocab - Vocabulary to load embeddings for
  • opts - Loading options

Options

  • :embedding_dim - Expected embedding dimension (auto-detected if not provided)
  • :lowercase - Lowercase words when matching (default: true)

Returns

  • {:ok, embeddings} - Embeddings struct with pre-trained vectors
  • {:error, reason} - Loading error

GloVe Format

Each line: word val1 val2 ... valn

lookup(embeddings, word, opts \\ [])

@spec lookup(embeddings(), String.t(), keyword()) ::
  {:ok, Nx.Tensor.t()} | {:error, term()}

Looks up the embedding vector for a word.

Parameters

  • embeddings - Embeddings struct
  • word - Word to look up
  • opts - Lookup options

Options

  • :default - Return this if word not found (default: UNK embedding)

Returns

  • {:ok, vector} - Embedding vector (Nx.Tensor)
  • {:error, :not_found} - Word not in vocabulary

special_token_ids(vocab)

@spec special_token_ids(vocabulary()) :: map()

Returns special token IDs.

word_to_index(word, vocab, unk_value \\ nil)

@spec word_to_index(String.t(), map() | vocabulary(), integer()) :: integer()

Converts a single word to its vocabulary index.

Parameters

  • word - Word to look up
  • vocab - Vocabulary map or vocabulary struct
  • unk_value - Value to return if word not found (default: UNK id)

Returns

Integer index.

words_to_ids(vocab, words, opts \\ [])

@spec words_to_ids(vocabulary(), [String.t()], keyword()) :: {:ok, Nx.Tensor.t()}

Converts a list of words to a tensor of word IDs.

Parameters

  • vocab - Vocabulary struct
  • words - List of words
  • opts - Conversion options

Options

  • :max_length - Truncate or pad to this length (default: no padding)
  • :pad_value - Value to use for padding (default: PAD token ID)

Returns

  • {:ok, tensor} - Tensor of word IDs

words_to_indices(words, vocab)

@spec words_to_indices([String.t()], map() | vocabulary()) :: [integer()]

Converts list of words to list of indices.

Parameters

  • words - List of words
  • vocab - Vocabulary map or struct

Returns

List of indices.