essence v0.1.1 Essence.Tokenizer

The Essence.Tokenizer module exposes useful functions for transforming text into tokens and for dealing with tokens.

Summary

Functions

Splits a given String into tokens on punctuation, and include the punctuation as a token. This method supports Unicode text

Splits a given text into tokens on punctuation, but omits the punctuation tokens. This method supports Unicode text

Splits a given String into tokens. A token is a sequence of characters to be treated as a group. The tokenize method will split on whitespace and punctuation, treating words and punctutation as tokens, and removing whitespace

Tokenizes a given stream, and returns a list of tokens. Commonly the given stream is a :line stream

Functions

split_with_punctuation(text)

Specs

split_with_punctuation(String.t) :: List.t
split_with_punctuation(String.t) :: List.t

Splits a given String into tokens on punctuation, and include the punctuation as a token. This method supports Unicode text.

split_without_punctuation(text)

Splits a given text into tokens on punctuation, but omits the punctuation tokens. This method supports Unicode text.

tokenize(text, opts \\ [])

Splits a given String into tokens. A token is a sequence of characters to be treated as a group. The tokenize method will split on whitespace and punctuation, treating words and punctutation as tokens, and removing whitespace.

tokenize_s(stream)

Specs

tokenize_s(File.Stream.t) :: List.t

Tokenizes a given stream, and returns a list of tokens. Commonly the given stream is a :line stream.