Nasty.Statistics.Neural.Embeddings (Nasty v0.3.0)

Word and character embedding utilities for neural models.

Provides:

Pre-trained embedding loading (GloVe, FastText)
Random embedding initialization
Vocabulary management
Efficient embedding lookup
Embedding caching

Example

# Create vocabulary from corpus
{:ok, vocab} = Embeddings.build_vocabulary(corpus, min_freq: 2)

# Initialize random embeddings
{:ok, embeddings} = Embeddings.init_random(vocab, embedding_dim: 300)

# Load pre-trained GloVe embeddings
{:ok, embeddings} = Embeddings.load_glove("glove.6B.300d.txt", vocab)

# Look up word embeddings
{:ok, vector} = Embeddings.lookup(embeddings, "cat")

Summary

Types

embeddings()

vocabulary()

Functions

build_char_vocabulary(words_nested, opts \\ [])

Builds character vocabulary from a list of words.

build_vocabulary(corpus, opts \\ [])

Builds a vocabulary from a corpus of sentences.

create_char_embedding_layer(char_vocab, opts \\ [])

Creates a character embedding layer (placeholder for Axon integration).

create_embedding_layer(vocab, opts \\ [])

Creates an embedding layer (placeholder for Axon integration).

ids_to_words(vocab, id_tensor, opts \\ [])

Converts a tensor of word IDs back to words.

init_random(vocab, opts \\ [])

Initializes random embeddings for a vocabulary.

load_glove(path, vocab, opts \\ [])

Loads pre-trained GloVe embeddings.

lookup(embeddings, word, opts \\ [])

Looks up the embedding vector for a word.

special_token_ids(vocab)

Returns special token IDs.

word_to_index(word, vocab, unk_value \\ nil)

Converts a single word to its vocabulary index.

words_to_ids(vocab, words, opts \\ [])

Converts a list of words to a tensor of word IDs.

words_to_indices(words, vocab)

Converts list of words to list of indices.

Types

embeddings()

@type embeddings() :: %{
  vocab: vocabulary(),
  vectors: Nx.Tensor.t(),
  embedding_dim: pos_integer()
}

vocabulary()

@type vocabulary() :: %{
  word_to_id: map(),
  id_to_word: map(),
  frequencies: map(),
  size: non_neg_integer()
}

Functions

build_char_vocabulary(words_nested, opts \\ [])

@spec build_char_vocabulary(
  [[String.t()]] | [String.t()],
  keyword()
) :: map()

Builds character vocabulary from a list of words.

Parameters

words - List of words (can be nested lists)
opts - Vocabulary options

Returns

{:ok, char_vocab} - Character to ID mapping

build_vocabulary(corpus, opts \\ [])

@spec build_vocabulary(
  [[String.t()]],
  keyword()
) :: map() | {:ok, vocabulary()}

Builds a vocabulary from a corpus of sentences.

Returns a simple word -> id map when used without explicit return_struct option. Returns vocabulary struct with {:ok, vocab} when called from code that expects it.

Parameters

corpus - List of sentences (each sentence is a list of words)
opts - Vocabulary options

Options

:min_freq - Minimum word frequency to include (default: 1)
:max_size - Maximum vocabulary size (default: unlimited)
:special_tokens - Include special tokens (default: true)
:lowercase - Convert all words to lowercase (default: false)
:return_struct - Return full struct (default: false)

Returns

Simple map %{word => id} by default
{:ok, vocabulary} when return_struct: true

create_char_embedding_layer(char_vocab, opts \\ [])

Creates a character embedding layer (placeholder for Axon integration).

Parameters

char_vocab - Character vocabulary map
opts - Layer options

Options

:embedding_dim - Embedding dimension (default: 50)

Returns

A function that can be used to create character embeddings.

create_embedding_layer(vocab, opts \\ [])

Creates an embedding layer (placeholder for Axon integration).

Parameters

vocab - Vocabulary map
opts - Layer options

Options

:embedding_dim - Embedding dimension (default: 300)

Returns

A function that can be used to create embeddings.

ids_to_words(vocab, id_tensor, opts \\ [])

@spec ids_to_words(vocabulary(), Nx.Tensor.t(), keyword()) :: {:ok, [String.t()]}

Converts a tensor of word IDs back to words.

Parameters

vocab - Vocabulary struct
id_tensor - Tensor of word IDs
opts - Conversion options

Returns

{:ok, words} - List of words

init_random(vocab, opts \\ [])

@spec init_random(
  vocabulary(),
  keyword()
) :: {:ok, embeddings()}

Initializes random embeddings for a vocabulary.

Parameters

vocab - Vocabulary struct
opts - Embedding options

Options

:embedding_dim - Embedding dimensionality (default: 300)
:init_method - Initialization method: :uniform, :normal, :xavier (default: :uniform)
:scale - Initialization scale (default: 0.1)

Returns

{:ok, embeddings} - Embeddings struct with random vectors

load_glove(path, vocab, opts \\ [])

@spec load_glove(Path.t(), vocabulary(), keyword()) ::
  {:ok, embeddings()} | {:error, term()}

Loads pre-trained GloVe embeddings.

Parameters

path - Path to GloVe file (e.g., "glove.6B.300d.txt")
vocab - Vocabulary to load embeddings for
opts - Loading options

Options

:embedding_dim - Expected embedding dimension (auto-detected if not provided)
:lowercase - Lowercase words when matching (default: true)

Returns

{:ok, embeddings} - Embeddings struct with pre-trained vectors
{:error, reason} - Loading error

GloVe Format

Each line: word val1 val2 ... valn

lookup(embeddings, word, opts \\ [])

@spec lookup(embeddings(), String.t(), keyword()) ::
  {:ok, Nx.Tensor.t()} | {:error, term()}

Looks up the embedding vector for a word.

Parameters

embeddings - Embeddings struct
word - Word to look up
opts - Lookup options

Options

:default - Return this if word not found (default: UNK embedding)

Returns

{:ok, vector} - Embedding vector (Nx.Tensor)
{:error, :not_found} - Word not in vocabulary

special_token_ids(vocab)

@spec special_token_ids(vocabulary()) :: map()

Returns special token IDs.

word_to_index(word, vocab, unk_value \\ nil)

@spec word_to_index(String.t(), map() | vocabulary(), integer()) :: integer()

Converts a single word to its vocabulary index.

Parameters

word - Word to look up
vocab - Vocabulary map or vocabulary struct
unk_value - Value to return if word not found (default: UNK id)

Returns

Integer index.

words_to_ids(vocab, words, opts \\ [])

@spec words_to_ids(vocabulary(), [String.t()], keyword()) :: {:ok, Nx.Tensor.t()}

Converts a list of words to a tensor of word IDs.

Parameters

vocab - Vocabulary struct
words - List of words
opts - Conversion options

Options

:max_length - Truncate or pad to this length (default: no padding)
:pad_value - Value to use for padding (default: PAD token ID)

Returns

{:ok, tensor} - Tensor of word IDs

words_to_indices(words, vocab)

@spec words_to_indices([String.t()], map() | vocabulary()) :: [integer()]

Converts list of words to list of indices.

Parameters

words - List of words
vocab - Vocabulary map or struct

Returns

List of indices.