Penelope v0.5.0 API Reference
Modules
This is the library application for the Penelope framework. It starts the supervision tree for the library’s processes
The CRF tagger is a thin wrapper over the CRFSuite library for sequence inference. It provides the ability to train sequence models, use them for inference, and import/export them
This is a sequence featurizer that extracts a constant value from the prediction context and adds it as a feature for each element in each sequence for each sample. This is useful for biasing a sequence classifier at the sample level
This sequence featurizer invokes a set of inner featurizers and merges their results into a single map per sequence element
This vectorizer horizontally stacks the results of a sequence of
inner vectorizers applied to an incoming feature matrix. This is analogous
to the behavior of the FeatureUnion
component in sklearn
The linear classifier uses liblinear for multi-class classification. It provides support for training a model, compiling/extracting model parameters to/from erlang data structures, and predicting classes or probabilities
The ML pipeline provides the ability to express an inference graph as a data structure, and to fit/export/compile/predict based on the graph. A pipeline is represented as a sequence of stages, each of which is a component module that supports the pipeline interface. This structure is modeled after sklearn’s pipeline
The ML pipeline registry decouples the names of pipeline components from
their module names, so that modules can be refactored without breaking
stored models. The built-in Penelope components are registered automatically,
but custom components can be added via the register
function
The SVM classifier uses libsvm for multi-class classification. It provides support for training a model, compiling/extracting model parameters to/from erlang data structures, and predicting classes or probabilities
The CountVectorizer simply counts the number of tokens in the incoming documents. It assumes that samples have already been tokenized into a list per sample. This vectorizer is useful for biasing a model for longer/shorter documents
downcasing document preprocessor
The POS featurizer converts a list of lists of tokens into nested lists containing feature maps relevant to POS tagging for each token
This pipeline component adapts the treebank tokenizer + the digit token preprocessor to the pipeline transformer conventions. It produces a sequence of tokens for each incoming document string
This pipeline component adapts the treebank tokenizer to the pipeline transformer conventions. It produces a sequence of tokens for each incoming document string
The regex vectorizer applies a list of regexes to each incoming document and produces an output vector of 0/1 values based on the results
The token featurizer converts a list of tokenized documents into a map per token, in the format used for sequence classification
The the token filter removes tokens from a token list in the pipeline
This is a the vector library used by the ML modules. It provides an interface to an efficient binary representation of 32-bit floating point values. Math is done via the BLAS interface, wrapped in a NIF module
This module represents a word2vec-style vectorset, compiled into a set of hash-partitioned DETS files. Each record is a tuple consisting of the term (word) and a set of weights (vector). This module also supports parsing the standard text representation of word vectors via the compile function
This module vectorizes a list of tokens using word vectors. Token vectors are retrieved from the word2vec index (see index.ex). These are combined into a single document vector by taking their vector mean
NIF wrapper module
The intent classifier transforms a natural language utterance into a named intent and a set of named parameters. It uses an ML classifier to infer the intent name and an entity recognizer to extract named entities as parameters. These components are both represented as ML pipelines
The part-of-speech tagger transforms a tokenized sentence into a list of
{token, pos_tag}
tuples. The tagger takes no responsibility for
tokenization; this means that callers must be careful to maintain the same
tokenization scheme between training and evaluating to ensure the best
results
This is a BERT-compatible wordpiece tokenizer/vectorizer implementation. It provides the ability to encode a text string into an integer vector containing values derived from a wordpiece vocabulary. The encoded results can also be converted back to the original text or a substring of it
The tokenization scheme used for the creation of the Penn Treebank corpus. See ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html
The behaviour implemented by all tokenizers
Exceptions
DETS index processing error
Mix Tasks
Common utility functions for part-of-speech tagger Mix tasks
This task tests a pretrained part-of-speech tagger model using a file containing tokenized text and POS tags
This task trains and optionally tests a part-of-speech tagger using files containing tokenized text and POS tags
This task compiles a word vector text file into a set of DETS indexes