BubbleMatch.Sentence (bubble_match v0.6.3) View Source

A data structure which holds a tokenized sentence.

The struct contains the text of the sentence (in the text property), and a list of tokenizations. Normally, a sentence has just one tokenization, but adding entities to the sentence might cause several tokens in the sentence to be replaed with an entity token, thus creating the need for multiple tokenizations (as you still might want to match on the original sentence, e.g. in the case of a falsely identified entitiy)

Link to this section Summary

Functions

Enrich the given sentence with entities extracted via Duckling

Convert a JSON blob from Spacy NLP data into a sentence.

Tokenize an input into individual tokens.

Link to this section Types

Specs

t() :: BubbleMatch.Sentence

Link to this section Functions

Link to this function

add_duckling_entities(sentence, entities)

View Source

Specs

add_duckling_entities(sentence :: t(), entities :: list()) :: t()

Enrich the given sentence with entities extracted via Duckling

This function takes the output of the Duckling JSON format and enriches the given sentence with the entities that were found using Duckling.

Specs

from_spacy(spacy_json :: map()) :: t()

Convert a JSON blob from Spacy NLP data into a sentence.

This function takes the output of Spacy's Doc.to_json function and converts it into a sentence.

Note that the Spacy tokenizer detects multiple sentences. However, in many cases the result is suboptimal and therefore we always construct a single sentence, given our use case of chat messages.

Specs

naive_tokenize(input :: String.t()) :: [t()]

Tokenize an input into individual tokens.

As the name suggests, this tokenization is quite naive. It only splits strings on whitespace and punctuation, disregarding any language-specific information. However, for 'basic' use cases, and for our test suite, it is good enough.