bubble_match v0.2.6 BubbleMatch View Source

Bubblescript Matching Language (BML)

Build status Hex pm

BML is a rule language for matching natural language against a rule base. Think of it as regular expressions for sentences. Whereas regular expressions work on individual characters, BML rules primarily work on a tokenized representation of the string.

BML ships with a builtin string tokenizer, but for production usage you should look into using a language-specific tokenizer, e.g. to use the output of Spacy's Doc.to_json function.

This project is still in development, and as such, the BML syntax is still subject to change.

The full documentation on the BML syntax and the API reference is available on hexdocs.pm. To try out BML, check out the demo environment, powered by Phoenix Liveview.

Examples

Matching basic sequences of words:

Match stringExampleMatches?
hello worldHello, world!yes
hello worldWell hello worldyes
hello worldhello there worldno
hello worldworld hellono

Matching regular expressions:

Match stringExampleMatches?
/[a-z]+/abcdyes

Match entities, with the help of Spacy and Duckling preprocessing and tokenizing the input:

Match stringMatchesDoes not match
[person]George BakerHello world
[time]I walked to the store yesterdayMy name is John

Rules overview

The match syntax is composed of adjacent and optionally nested, rules. Each individual has the following syntax:

Basic words

hello world

Basic words; rules consisting of only alphanumeric characters.

Matching is done on both the lowercased, normalized version of the word, and on the lemmatization of the word.

Use a dash (-) to match on compound nouns: was-machine matches all of wasmachine, was-machine and was machine.

Literals

"Literal word sequence"

Matches a literal piece of text, which can span multiple tokens. Matching is case insensitive.

Ignoring tokens: _

hello _ world

The standalone occurence of _ matches 0-5 of any available token, greedy.

Stand-alone range specifiers

  • [1] match exactly one token; any token
  • [2+] match 2 or more tokens (greedy)
  • [1-3] match 1 to 3 tokens (greedy)
  • [2+?] match 2 or more tokens (non-greedy)
  • [1-3?] match 1 to 3 tokens (non-greedy)

Entities

Entity tokens: [email] matches a token of type :entity with value.kind == email. Entities are extracted by external means, e.g. by an NLP NER engine like Duckling.

Entities are automatically captured under a variable with the same name as the entity's kind.

Regular expressions

/regex/

Matches the given regex against the sentence. Regexes can span multiple tokens, thus you can match on whitespace and other token separators. Regular expressions are case insensitive.

Regular expression named capture groups are also supported, to capture a specific part of a string: /KL(?<flight_number>\d+)/ matches KL12345 and extracts 12345 as the flight_number capture.

OR / grouping construct

  • pizza | fries | chicken - OR-clause on the root level without parens, matches either token

  • a ( a | b | c ) - use parentheses to separate OR-clauses; matches one token consisting of first a, and then a, b or c.

  • ( a )[3+] matches 3 or more token consisting of a

  • ( hi | hello )[=greeting] matches 1 token and stores it in greeting

Permutation construct

  • < a b c > matches any permutation of the sequence a b c; a c b, or b a c, or c a b, etc

Start / end sentence markers

  • [Start] Matches the start of a sentence
  • [End] Matches the end of a sentence

Word collections ("concepts")

  • @food matches any token in the food collection.
  • @food.subcat matches any token in the given subcategory.

Concept compilation is done as part of the parse phase; the concepts compiler must must return an {m, f, a} triple. In runtime, this MFA is called while matching, and thus, it must be a fast function.

Part-of-speech tags (word kinds)

  • %VERB matches any verb
  • %NOUN matches any noun
  • Any other POS Spacy tags are valid as well

Rule modifiers

Any rule can have a [] block which contains a repetition modifier and/or a capture expression.

Entity blocks are automatically captured as the entity kind.

Sentence tokenization

The expression matching works on a per-sentence basis; the idea is that it does not make sense to create expressions that span over sentences.

The builtin sentence tokenizer (BubbleMatch.Sentence.Tokenizer) does not have the concept of sentences, and thus treats each input as a single sentence, even in the existence of periods in the input.

However, the prefered way of using this library is by running the input through an NLP preprocessor like Spacy, which does tokenize an input into individual sentences.

Sigil

For use within Elixir, it is possible to use a ~m sigil which parses the given BML query on compile-time:

defmodule MyModule do
  use BubbleMatch.Sigil

  def greeting?(input) do
    BubbleMatch.match(~m"hello | hi | howdy", input) != :nomatch
  end
end

Installation

If available in Hex, the package can be installed by adding bubble_match to your list of dependencies in mix.exs:

def deps do
  [
    {:bubble_match, "~> 0.1.0"}
  ]
end

Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/bubble_match.

Link to this section Summary

Functions

Match a given input against a BML query.

Parse a string into a BML expression.

Parse a string into a BML expression, raises on error.

Link to this section Types

Specs

input() :: [input()] | String.t() | BubbleMatch.Sentence.t()

Specs

match_result() :: :nomatch | {:match, captures :: map()}

Specs

parse_opt() :: {:expand, boolean()} | {:concepts_compiler, (... -> any())}

Specs

parse_opts() :: [parse_opt()]

Specs

t() :: BubbleMatch

Link to this section Functions

Specs

match(expr :: t() | String.t(), input :: input()) :: match_result()

Match a given input against a BML query.

Specs

parse(expr :: String.t(), parse_opts()) :: {:ok, t()} | {:error, String.t()}

Parse a string into a BML expression.

Link to this function

parse!(expr, opts \\ [])

View Source

Specs

parse!(expr :: String.t(), parse_opts()) :: t()

Parse a string into a BML expression, raises on error.