scrape v3.1.0 Scrape.IR.Text
Collection of text mining algorithms, like summarization, classification and clustering.
Details are hidden within the algorithms, so a clean interface can be provided.
Link to this section Summary
Functions
Removes all junk from a given text, like javascript, html or mixed whitespace.
Find out in which natural language the given text is written in.
Dissect a text into sentences, weight their stemmed keywords against each other and return the 3 semantically most important sentences.
A text paragraph shall not include any whitespace except single spaces between words.
Similar to semantic_tokenize/2
, but also determines the n (default: 20)
most relevant stemmed tokens from the list.
Dissect a text into word tokens similar to tokenize/1
but strips words
that carry no semantic value.
Dissect a text into word tokens.
Dissect a text into word tokens.
Strip all HTML tags from a text.
Remove all occurences of javascript from a HTML snippet.
Link to this section Functions
clean(text)
Removes all junk from a given text, like javascript, html or mixed whitespace.
Example
iex> Scrape.IR.Text.clean("\t hello, \r<b>world</b>!")
"hello, world!"
detect_language(text)
detect_language(String.t()) :: :de | :en
detect_language(String.t()) :: :de | :en
Find out in which natural language the given text is written in.
Currently only german and (fallback) english are valid results. Uses external library Paasaa.
Example
iex> Scrape.IR.Text.detect_language("the quick brown fox jumps over...")
:en
iex> Scrape.IR.Text.detect_language("Es ist ein schönes Wetter heute...")
:de
extract_summary(text, start_words, language \\ :en)
Dissect a text into sentences, weight their stemmed keywords against each other and return the 3 semantically most important sentences.
normalize_whitespace(text)
A text paragraph shall not include any whitespace except single spaces between words.
Example
iex> Scrape.IR.Text.normalize_whitespace("\r\thello world\r ")
"hello world"
semantic_keywords(text, n \\ 20, language \\ :en)
Similar to semantic_tokenize/2
, but also determines the n (default: 20)
most relevant stemmed tokens from the list.
semantic_tokenize(text, language \\ :en)
Dissect a text into word tokens similar to tokenize/1
but strips words
that carry no semantic value.
Examples
iex> Scrape.IR.Text.semantic_tokenize("A beautiful day!", :en)
["beautiful", "day"]
tokenize(text)
Dissect a text into word tokens.
The resulting list is a list of downcased words with all non-word-characters stripped.
Examples
iex> Scrape.IR.Text.tokenize("Hello, world!")
["hello", "world"]
tokenize_preserve_delimiters(text)
Dissect a text into word tokens.
The resulting list is a list of downcased words with all non-word-characters stripped, but common phrase delimiters still included.
Examples
iex> Scrape.IR.Text.tokenize_preserve_delimiters("Hello, world!")
["hello", ",", "world", "!"]
without_html(text)
Strip all HTML tags from a text.
Example
iex> Scrape.IR.Text.without_html("<p>stuff</p>")
"stuff"
without_js(text)
Remove all occurences of javascript from a HTML snippet.
Uses a regex (!)
Example
iex> Scrape.IR.Text.without_js("a<script>b</script>c")
"ac"