Penelope v0.5.0 Penelope.NLP.Tokenize.PennTreebankTokenizer View Source
The tokenization scheme used for the creation of the Penn Treebank corpus. See ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html.
Some alterations have been made to the original script to better handle common Unicode replacement characters.
Link to this section Summary
Functions
Detokenize a string tokenized by the Penn Treebank tokenizer. The PTB tokenization scheme is lossy; attributes like capitalization, multiple spaces, and padding around certain punctuation will be removed from the output
Separate a string into a list of tokens
Link to this section Functions
Detokenize a string tokenized by the Penn Treebank tokenizer. The PTB tokenization scheme is lossy; attributes like capitalization, multiple spaces, and padding around certain punctuation will be removed from the output.
Separate a string into a list of tokens.
Callback implementation for Penelope.NLP.Tokenize.Tokenizer.tokenize/1
.