API Reference – Penelope v0.5.0

Modules

This is the library application for the Penelope framework. It starts the supervision tree for the library’s processes

Penelope.ML.CRF.Tagger

The CRF tagger is a thin wrapper over the CRFSuite library for sequence inference. It provides the ability to train sequence models, use them for inference, and import/export them

Penelope.ML.Feature.ContextFeaturizer

This is a sequence featurizer that extracts a constant value from the prediction context and adds it as a feature for each element in each sequence for each sample. This is useful for biasing a sequence classifier at the sample level

Penelope.ML.Feature.MergeFeaturizer

This sequence featurizer invokes a set of inner featurizers and merges their results into a single map per sequence element

Penelope.ML.Feature.StackVectorizer

This vectorizer horizontally stacks the results of a sequence of inner vectorizers applied to an incoming feature matrix. This is analogous to the behavior of the FeatureUnion component in sklearn

Penelope.ML.Linear.Classifier

The linear classifier uses liblinear for multi-class classification. It provides support for training a model, compiling/extracting model parameters to/from erlang data structures, and predicting classes or probabilities

Penelope.ML.Pipeline

The ML pipeline provides the ability to express an inference graph as a data structure, and to fit/export/compile/predict based on the graph. A pipeline is represented as a sequence of stages, each of which is a component module that supports the pipeline interface. This structure is modeled after sklearn’s pipeline

Penelope.ML.Registry

The ML pipeline registry decouples the names of pipeline components from their module names, so that modules can be refactored without breaking stored models. The built-in Penelope components are registered automatically, but custom components can be added via the register function

Penelope.ML.SVM.Classifier

The SVM classifier uses libsvm for multi-class classification. It provides support for training a model, compiling/extracting model parameters to/from erlang data structures, and predicting classes or probabilities

Penelope.ML.Text.CountVectorizer

The CountVectorizer simply counts the number of tokens in the incoming documents. It assumes that samples have already been tokenized into a list per sample. This vectorizer is useful for biasing a model for longer/shorter documents

Penelope.ML.Text.LowercasePreprocessor

downcasing document preprocessor

Penelope.ML.Text.POSFeaturizer

The POS featurizer converts a list of lists of tokens into nested lists containing feature maps relevant to POS tagging for each token

Penelope.ML.Text.PTBDigitTokenizer

This pipeline component adapts the treebank tokenizer + the digit token preprocessor to the pipeline transformer conventions. It produces a sequence of tokens for each incoming document string

Penelope.ML.Text.PTBTokenizer

This pipeline component adapts the treebank tokenizer to the pipeline transformer conventions. It produces a sequence of tokens for each incoming document string

Penelope.ML.Text.RegexVectorizer

The regex vectorizer applies a list of regexes to each incoming document and produces an output vector of 0/1 values based on the results

Penelope.ML.Text.TokenFeaturizer

The token featurizer converts a list of tokenized documents into a map per token, in the format used for sequence classification

Penelope.ML.Text.TokenFilter

The the token filter removes tokens from a token list in the pipeline

Penelope.ML.Vector

This is a the vector library used by the ML modules. It provides an interface to an efficient binary representation of 32-bit floating point values. Math is done via the BLAS interface, wrapped in a NIF module

Penelope.ML.Word2vec.Index

This module represents a word2vec-style vectorset, compiled into a set of hash-partitioned DETS files. Each record is a tuple consisting of the term (word) and a set of weights (vector). This module also supports parsing the standard text representation of word vectors via the compile function

Penelope.ML.Word2vec.MeanVectorizer

This module vectorizes a list of tokens using word vectors. Token vectors are retrieved from the word2vec index (see index.ex). These are combined into a single document vector by taking their vector mean

Penelope.NIF

NIF wrapper module

Penelope.NLP.IntentClassifier

The intent classifier transforms a natural language utterance into a named intent and a set of named parameters. It uses an ML classifier to infer the intent name and an entity recognizer to extract named entities as parameters. These components are both represented as ML pipelines

Penelope.NLP.POSTagger

The part-of-speech tagger transforms a tokenized sentence into a list of {token, pos_tag} tuples. The tagger takes no responsibility for tokenization; this means that callers must be careful to maintain the same tokenization scheme between training and evaluating to ensure the best results

Penelope.NLP.Tokenize.BertTokenizer

This is a BERT-compatible wordpiece tokenizer/vectorizer implementation. It provides the ability to encode a text string into an integer vector containing values derived from a wordpiece vocabulary. The encoded results can also be converted back to the original text or a substring of it

Penelope.NLP.Tokenize.PennTreebankTokenizer

The tokenization scheme used for the creation of the Penn Treebank corpus. See ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html

Penelope.NLP.Tokenize.Tokenizer

The behaviour implemented by all tokenizers