Penelope v0.5.0 Penelope.NLP.Tokenize.PennTreebankTokenizer View Source

The tokenization scheme used for the creation of the Penn Treebank corpus. See ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html.

Some alterations have been made to the original script to better handle common Unicode replacement characters.

Link to this section Summary

Functions

Detokenize a string tokenized by the Penn Treebank tokenizer. The PTB tokenization scheme is lossy; attributes like capitalization, multiple spaces, and padding around certain punctuation will be removed from the output

Separate a string into a list of tokens

Link to this section Functions

Detokenize a string tokenized by the Penn Treebank tokenizer. The PTB tokenization scheme is lossy; attributes like capitalization, multiple spaces, and padding around certain punctuation will be removed from the output.

Separate a string into a list of tokens.

Callback implementation for Penelope.NLP.Tokenize.Tokenizer.tokenize/1.