Nasty.Statistics.Neural.Embeddings (Nasty v0.3.0)
View SourceWord and character embedding utilities for neural models.
Provides:
- Pre-trained embedding loading (GloVe, FastText)
- Random embedding initialization
- Vocabulary management
- Efficient embedding lookup
- Embedding caching
Example
# Create vocabulary from corpus
{:ok, vocab} = Embeddings.build_vocabulary(corpus, min_freq: 2)
# Initialize random embeddings
{:ok, embeddings} = Embeddings.init_random(vocab, embedding_dim: 300)
# Load pre-trained GloVe embeddings
{:ok, embeddings} = Embeddings.load_glove("glove.6B.300d.txt", vocab)
# Look up word embeddings
{:ok, vector} = Embeddings.lookup(embeddings, "cat")
Summary
Functions
Builds character vocabulary from a list of words.
Builds a vocabulary from a corpus of sentences.
Creates a character embedding layer (placeholder for Axon integration).
Creates an embedding layer (placeholder for Axon integration).
Converts a tensor of word IDs back to words.
Initializes random embeddings for a vocabulary.
Loads pre-trained GloVe embeddings.
Looks up the embedding vector for a word.
Returns special token IDs.
Converts a single word to its vocabulary index.
Converts a list of words to a tensor of word IDs.
Converts list of words to list of indices.
Types
@type embeddings() :: %{ vocab: vocabulary(), vectors: Nx.Tensor.t(), embedding_dim: pos_integer() }
@type vocabulary() :: %{ word_to_id: map(), id_to_word: map(), frequencies: map(), size: non_neg_integer() }
Functions
Builds character vocabulary from a list of words.
Parameters
words- List of words (can be nested lists)opts- Vocabulary options
Returns
{:ok, char_vocab}- Character to ID mapping
@spec build_vocabulary( [[String.t()]], keyword() ) :: map() | {:ok, vocabulary()}
Builds a vocabulary from a corpus of sentences.
Returns a simple word -> id map when used without explicit return_struct option. Returns vocabulary struct with {:ok, vocab} when called from code that expects it.
Parameters
corpus- List of sentences (each sentence is a list of words)opts- Vocabulary options
Options
:min_freq- Minimum word frequency to include (default: 1):max_size- Maximum vocabulary size (default: unlimited):special_tokens- Include special tokens (default: true):lowercase- Convert all words to lowercase (default: false):return_struct- Return full struct (default: false)
Returns
- Simple map %{word => id} by default
{:ok, vocabulary}when return_struct: true
Creates a character embedding layer (placeholder for Axon integration).
Parameters
char_vocab- Character vocabulary mapopts- Layer options
Options
:embedding_dim- Embedding dimension (default: 50)
Returns
A function that can be used to create character embeddings.
Creates an embedding layer (placeholder for Axon integration).
Parameters
vocab- Vocabulary mapopts- Layer options
Options
:embedding_dim- Embedding dimension (default: 300)
Returns
A function that can be used to create embeddings.
@spec ids_to_words(vocabulary(), Nx.Tensor.t(), keyword()) :: {:ok, [String.t()]}
Converts a tensor of word IDs back to words.
Parameters
vocab- Vocabulary structid_tensor- Tensor of word IDsopts- Conversion options
Returns
{:ok, words}- List of words
@spec init_random( vocabulary(), keyword() ) :: {:ok, embeddings()}
Initializes random embeddings for a vocabulary.
Parameters
vocab- Vocabulary structopts- Embedding options
Options
:embedding_dim- Embedding dimensionality (default: 300):init_method- Initialization method::uniform,:normal,:xavier(default::uniform):scale- Initialization scale (default: 0.1)
Returns
{:ok, embeddings}- Embeddings struct with random vectors
@spec load_glove(Path.t(), vocabulary(), keyword()) :: {:ok, embeddings()} | {:error, term()}
Loads pre-trained GloVe embeddings.
Parameters
path- Path to GloVe file (e.g., "glove.6B.300d.txt")vocab- Vocabulary to load embeddings foropts- Loading options
Options
:embedding_dim- Expected embedding dimension (auto-detected if not provided):lowercase- Lowercase words when matching (default: true)
Returns
{:ok, embeddings}- Embeddings struct with pre-trained vectors{:error, reason}- Loading error
GloVe Format
Each line: word val1 val2 ... valn
@spec lookup(embeddings(), String.t(), keyword()) :: {:ok, Nx.Tensor.t()} | {:error, term()}
Looks up the embedding vector for a word.
Parameters
embeddings- Embeddings structword- Word to look upopts- Lookup options
Options
:default- Return this if word not found (default: UNK embedding)
Returns
{:ok, vector}- Embedding vector (Nx.Tensor){:error, :not_found}- Word not in vocabulary
@spec special_token_ids(vocabulary()) :: map()
Returns special token IDs.
@spec word_to_index(String.t(), map() | vocabulary(), integer()) :: integer()
Converts a single word to its vocabulary index.
Parameters
word- Word to look upvocab- Vocabulary map or vocabulary structunk_value- Value to return if word not found (default: UNK id)
Returns
Integer index.
@spec words_to_ids(vocabulary(), [String.t()], keyword()) :: {:ok, Nx.Tensor.t()}
Converts a list of words to a tensor of word IDs.
Parameters
vocab- Vocabulary structwords- List of wordsopts- Conversion options
Options
:max_length- Truncate or pad to this length (default: no padding):pad_value- Value to use for padding (default: PAD token ID)
Returns
{:ok, tensor}- Tensor of word IDs
@spec words_to_indices([String.t()], map() | vocabulary()) :: [integer()]
Converts list of words to list of indices.
Parameters
words- List of wordsvocab- Vocabulary map or struct
Returns
List of indices.