scrape v3.1.0 Scrape.IR.Text

Collection of text mining algorithms, like summarization, classification and clustering.

Details are hidden within the algorithms, so a clean interface can be provided.

Link to this section Summary

Functions

Removes all junk from a given text, like javascript, html or mixed whitespace.

Find out in which natural language the given text is written in.

Dissect a text into sentences, weight their stemmed keywords against each other and return the 3 semantically most important sentences.

A text paragraph shall not include any whitespace except single spaces between words.

Similar to semantic_tokenize/2, but also determines the n (default: 20) most relevant stemmed tokens from the list.

Dissect a text into word tokens similar to tokenize/1 but strips words that carry no semantic value.

Dissect a text into word tokens.

Dissect a text into word tokens.

Strip all HTML tags from a text.

Remove all occurences of javascript from a HTML snippet.

Link to this section Functions

Removes all junk from a given text, like javascript, html or mixed whitespace.

Example

iex> Scrape.IR.Text.clean("\t hello, \r<b>world</b>!")
"hello, world!"
Link to this function

detect_language(text)
detect_language(String.t()) :: :de | :en

Find out in which natural language the given text is written in.

Currently only german and (fallback) english are valid results. Uses external library Paasaa.

Example

iex> Scrape.IR.Text.detect_language("the quick brown fox jumps over...")
:en

iex> Scrape.IR.Text.detect_language("Es ist ein schönes Wetter heute...")
:de
Link to this function

extract_summary(text, start_words, language \\ :en)

Dissect a text into sentences, weight their stemmed keywords against each other and return the 3 semantically most important sentences.

Link to this function

normalize_whitespace(text)
normalize_whitespace(String.t()) :: String.t()

A text paragraph shall not include any whitespace except single spaces between words.

Example

iex> Scrape.IR.Text.normalize_whitespace("\r\thello world\r ")
"hello world"
Link to this function

semantic_keywords(text, n \\ 20, language \\ :en)

Similar to semantic_tokenize/2, but also determines the n (default: 20) most relevant stemmed tokens from the list.

Link to this function

semantic_tokenize(text, language \\ :en)
semantic_tokenize(String.t(), :de | :en) :: [String.t()]

Dissect a text into word tokens similar to tokenize/1 but strips words that carry no semantic value.

Examples

iex> Scrape.IR.Text.semantic_tokenize("A beautiful day!", :en)
["beautiful", "day"]
Link to this function

tokenize(text)
tokenize(String.t()) :: [String.t()]

Dissect a text into word tokens.

The resulting list is a list of downcased words with all non-word-characters stripped.

Examples

iex> Scrape.IR.Text.tokenize("Hello, world!")
["hello", "world"]
Link to this function

tokenize_preserve_delimiters(text)
tokenize_preserve_delimiters(String.t()) :: [String.t()]

Dissect a text into word tokens.

The resulting list is a list of downcased words with all non-word-characters stripped, but common phrase delimiters still included.

Examples

iex> Scrape.IR.Text.tokenize_preserve_delimiters("Hello, world!")
["hello", ",", "world", "!"]
Link to this function

without_html(text)
without_html(String.t()) :: String.t()

Strip all HTML tags from a text.

Example

iex> Scrape.IR.Text.without_html("<p>stuff</p>")
"stuff"
Link to this function

without_js(text)
without_js(String.t()) :: String.t()

Remove all occurences of javascript from a HTML snippet.

Uses a regex (!)

Example

iex> Scrape.IR.Text.without_js("a<script>b</script>c")
"ac"