scrape v3.1.0 Scrape.IR.Text

Collection of text mining algorithms, like summarization, classification and clustering.

Details are hidden within the algorithms, so a clean interface can be provided.

Link to this section Summary

Functions

clean(text)

Removes all junk from a given text, like javascript, html or mixed whitespace.

detect_language(text)

Find out in which natural language the given text is written in.

extract_summary(text, start_words, language \\ :en)

Dissect a text into sentences, weight their stemmed keywords against each other and return the 3 semantically most important sentences.

normalize_whitespace(text)

A text paragraph shall not include any whitespace except single spaces between words.

semantic_keywords(text, n \\ 20, language \\ :en)

Similar to semantic_tokenize/2, but also determines the n (default: 20) most relevant stemmed tokens from the list.

semantic_tokenize(text, language \\ :en)

Dissect a text into word tokens similar to tokenize/1 but strips words that carry no semantic value.

tokenize(text)

Dissect a text into word tokens.

tokenize_preserve_delimiters(text)

Dissect a text into word tokens.

without_html(text)

Strip all HTML tags from a text.

without_js(text)

Remove all occurences of javascript from a HTML snippet.

Link to this section Functions

clean(text)

Removes all junk from a given text, like javascript, html or mixed whitespace.

Example

iex> Scrape.IR.Text.clean("\t hello, \r<b>world</b>!")
"hello, world!"

detect_language(text)

detect_language(String.t()) :: :de | :en

Find out in which natural language the given text is written in.

Currently only german and (fallback) english are valid results. Uses external library Paasaa.

Example

iex> Scrape.IR.Text.detect_language("the quick brown fox jumps over...")
:en

iex> Scrape.IR.Text.detect_language("Es ist ein schönes Wetter heute...")
:de

extract_summary(text, start_words, language \\ :en)

Dissect a text into sentences, weight their stemmed keywords against each other and return the 3 semantically most important sentences.

normalize_whitespace(text)

normalize_whitespace(String.t()) :: String.t()

A text paragraph shall not include any whitespace except single spaces between words.

Example

iex> Scrape.IR.Text.normalize_whitespace("\r\thello world\r ")
"hello world"

semantic_keywords(text, n \\ 20, language \\ :en)

Similar to semantic_tokenize/2, but also determines the n (default: 20) most relevant stemmed tokens from the list.

semantic_tokenize(text, language \\ :en)

semantic_tokenize(String.t(), :de | :en) :: [String.t()]

Dissect a text into word tokens similar to tokenize/1 but strips words that carry no semantic value.

Examples

iex> Scrape.IR.Text.semantic_tokenize("A beautiful day!", :en)
["beautiful", "day"]

tokenize(text)

tokenize(String.t()) :: [String.t()]

Dissect a text into word tokens.

The resulting list is a list of downcased words with all non-word-characters stripped.

Examples

iex> Scrape.IR.Text.tokenize("Hello, world!")
["hello", "world"]

tokenize_preserve_delimiters(text)

tokenize_preserve_delimiters(String.t()) :: [String.t()]

Dissect a text into word tokens.

The resulting list is a list of downcased words with all non-word-characters stripped, but common phrase delimiters still included.

Examples

iex> Scrape.IR.Text.tokenize_preserve_delimiters("Hello, world!")
["hello", ",", "world", "!"]

without_html(text)

without_html(String.t()) :: String.t()

Strip all HTML tags from a text.

Example

iex> Scrape.IR.Text.without_html("<p>stuff</p>")
"stuff"

without_js(text)

without_js(String.t()) :: String.t()

Remove all occurences of javascript from a HTML snippet.

Uses a regex (!)

Example

iex> Scrape.IR.Text.without_js("a<script>b</script>c")
"ac"

scrape v3.1.0 Scrape.IR.Text

Link to this section Summary

Functions

Link to this section Functions

clean(text)

Example

detect_language(text)

detect_language(String.t()) :: :de | :en

Example

extract_summary(text, start_words, language \\ :en)

normalize_whitespace(text)

normalize_whitespace(String.t()) :: String.t()

Example

semantic_keywords(text, n \\ 20, language \\ :en)

semantic_tokenize(text, language \\ :en)

semantic_tokenize(String.t(), :de | :en) :: [String.t()]

Examples

tokenize(text)

tokenize(String.t()) :: [String.t()]

Examples

tokenize_preserve_delimiters(text)

tokenize_preserve_delimiters(String.t()) :: [String.t()]

Examples

without_html(text)

without_html(String.t()) :: String.t()

Example

without_js(text)

without_js(String.t()) :: String.t()

Example

v3.1.0 v3.0.3 v3.0.2 v3.0.1 v3.0.0 v2.0.0

scrape v3.1.0 Scrape.IR.Text

Link to this section Summary

Functions

Link to this section Functions

clean(text)

Example

detect_language(text) detect_language(String.t()) :: :de | :en

Example

extract_summary(text, start_words, language \\ :en)

normalize_whitespace(text) normalize_whitespace(String.t()) :: String.t()

Example

semantic_keywords(text, n \\ 20, language \\ :en)

semantic_tokenize(text, language \\ :en) semantic_tokenize(String.t(), :de | :en) :: [String.t()]

Examples

tokenize(text) tokenize(String.t()) :: [String.t()]

Examples

tokenize_preserve_delimiters(text) tokenize_preserve_delimiters(String.t()) :: [String.t()]

Examples

without_html(text) without_html(String.t()) :: String.t()

Example

without_js(text) without_js(String.t()) :: String.t()

Example

detect_language(text)

detect_language(String.t()) :: :de | :en

normalize_whitespace(text)

normalize_whitespace(String.t()) :: String.t()

semantic_tokenize(text, language \\ :en)

semantic_tokenize(String.t(), :de | :en) :: [String.t()]

tokenize(text)

tokenize(String.t()) :: [String.t()]

tokenize_preserve_delimiters(text)

tokenize_preserve_delimiters(String.t()) :: [String.t()]

without_html(text)

without_html(String.t()) :: String.t()

without_js(text)

without_js(String.t()) :: String.t()