Penelope v0.5.0 API Reference

Modules

This is the library application for the Penelope framework. It starts the supervision tree for the library’s processes

The CRF tagger is a thin wrapper over the CRFSuite library for sequence inference. It provides the ability to train sequence models, use them for inference, and import/export them

This is a sequence featurizer that extracts a constant value from the prediction context and adds it as a feature for each element in each sequence for each sample. This is useful for biasing a sequence classifier at the sample level

This sequence featurizer invokes a set of inner featurizers and merges their results into a single map per sequence element

This vectorizer horizontally stacks the results of a sequence of inner vectorizers applied to an incoming feature matrix. This is analogous to the behavior of the FeatureUnion component in sklearn

The linear classifier uses liblinear for multi-class classification. It provides support for training a model, compiling/extracting model parameters to/from erlang data structures, and predicting classes or probabilities

The ML pipeline provides the ability to express an inference graph as a data structure, and to fit/export/compile/predict based on the graph. A pipeline is represented as a sequence of stages, each of which is a component module that supports the pipeline interface. This structure is modeled after sklearn’s pipeline

The ML pipeline registry decouples the names of pipeline components from their module names, so that modules can be refactored without breaking stored models. The built-in Penelope components are registered automatically, but custom components can be added via the register function

The SVM classifier uses libsvm for multi-class classification. It provides support for training a model, compiling/extracting model parameters to/from erlang data structures, and predicting classes or probabilities

The CountVectorizer simply counts the number of tokens in the incoming documents. It assumes that samples have already been tokenized into a list per sample. This vectorizer is useful for biasing a model for longer/shorter documents

downcasing document preprocessor

The POS featurizer converts a list of lists of tokens into nested lists containing feature maps relevant to POS tagging for each token

This pipeline component adapts the treebank tokenizer + the digit token preprocessor to the pipeline transformer conventions. It produces a sequence of tokens for each incoming document string

This pipeline component adapts the treebank tokenizer to the pipeline transformer conventions. It produces a sequence of tokens for each incoming document string

The regex vectorizer applies a list of regexes to each incoming document and produces an output vector of 0/1 values based on the results

The token featurizer converts a list of tokenized documents into a map per token, in the format used for sequence classification

The the token filter removes tokens from a token list in the pipeline

This is a the vector library used by the ML modules. It provides an interface to an efficient binary representation of 32-bit floating point values. Math is done via the BLAS interface, wrapped in a NIF module

This module represents a word2vec-style vectorset, compiled into a set of hash-partitioned DETS files. Each record is a tuple consisting of the term (word) and a set of weights (vector). This module also supports parsing the standard text representation of word vectors via the compile function

This module vectorizes a list of tokens using word vectors. Token vectors are retrieved from the word2vec index (see index.ex). These are combined into a single document vector by taking their vector mean

NIF wrapper module

The intent classifier transforms a natural language utterance into a named intent and a set of named parameters. It uses an ML classifier to infer the intent name and an entity recognizer to extract named entities as parameters. These components are both represented as ML pipelines

The part-of-speech tagger transforms a tokenized sentence into a list of {token, pos_tag} tuples. The tagger takes no responsibility for tokenization; this means that callers must be careful to maintain the same tokenization scheme between training and evaluating to ensure the best results

This is a BERT-compatible wordpiece tokenizer/vectorizer implementation. It provides the ability to encode a text string into an integer vector containing values derived from a wordpiece vocabulary. The encoded results can also be converted back to the original text or a substring of it

The tokenization scheme used for the creation of the Penn Treebank corpus. See ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html

The behaviour implemented by all tokenizers

Exceptions

DETS index processing error

Mix Tasks

Common utility functions for part-of-speech tagger Mix tasks

This task tests a pretrained part-of-speech tagger model using a file containing tokenized text and POS tags

This task trains and optionally tests a part-of-speech tagger using files containing tokenized text and POS tags

This task compiles a word vector text file into a set of DETS indexes