Sifter.Query.Lexer (Sifter v0.2.0)

View Source

A lexical analyzer for the Sifter query language that tokenizes search queries.

Sifter provides a query language for filtering data with support for:

  • Field-based predicates with various operators (field:value, field>10)
  • Set operations (field IN (value1, value2))
  • Boolean logic with AND, OR, and NOT operators
  • Quoted strings and bare text search
  • Wildcard prefix/suffix matching (field:prefix*, field:*suffix)

Grammar

Query         = [ whitespace ] , [ Term , { ( whitespace , Connective , whitespace | whitespace ) , Term } ] , [ whitespace ] ;

Connective    = "AND" | "OR" ;           (* AND has higher precedence than OR *)

Term          = [ Modifier ] , ( "(" , [ whitespace ] , Query , [ whitespace ] , ")" | Predicate | FullText ) ;

Modifier      = "-" | "NOT" , whitespace ;          (* "-" has no following space *)

Predicate     = Field , ( ColonOp , ValueOrList | SetOp , List ) ;

ColonOp       = ":" | "<" | "<=" | ">" | ">=" ;

SetOp         = whitespace , "IN" , whitespace | whitespace , "NOT" , whitespace , "IN" , whitespace | whitespace , "ALL" , whitespace ;

Field         = Name , { "." , Name } ;      (* dot paths, e.g. tags.name, project.client.name *)

ValueOrList   = List | Value ;
List          = "(" , [ whitespace ] , Value , { [ whitespace ] , "," , [ whitespace ] , Value } , [ whitespace ] , ")" ;  (* non-empty *)

(* STRICT wildcard rules - only for fielded values:
   field:value*  → starts_with match
   field:*value  → ends_with match
   Note: No middle wildcards like *value* - use FullText for contains-across-fields *)
Value         = PrefixValue | SuffixValue | ScalarValue | NullValue ;
PrefixValue   = ScalarNoStar , "*" ;                       (* starts_with *)
SuffixValue   = "*" , ScalarNoStar ;                       (* ends_with *)
ScalarValue   = Quoted | BareNoStar ;
NullValue     = NULL

(* Bare terms perform FullText search across configured fields *)
FullText      = Quoted | Bare ;

(* Lexical rules *)
Name          = NameStart , { NameCont } ;
NameStart     = ALNUM | "_" ;
NameCont      = ALNUM | "_" | "-" ;                      (* allow hyphen inside names *)
BareNoStar    = { Visible - Special - "*" }- ;          (* one or more visible chars excluding special and asterisk *)
Bare          = { Visible - Special }- ;                 (* one or more visible chars excluding special *)
Quoted        = "'" , { CharEsc | ? not "'" ? } , "'"
              | '"' , { CharEsc | ? not '"' ? } , '"' ;
CharEsc       = "\" , ? any character ? ;
Special       = whitespace | "(" | ")" | ":" | "<" | ">" | "=" | "," ;
whitespace    = { ? space | tab | carriage return | line feed ? }- ;  (* one or more whitespace chars *)
Visible       = ? any visible character ? ;
ALNUM         = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

Behavior Notes

  • Implied AND: Missing connectives between terms default to AND operation
  • Case-sensitive keywords: AND, OR, NOT, IN, ALL, NULL are case-sensitive (must be uppercase)
  • Bare text search: Unfielded terms perform "contains" search across configured fields
  • Wildcard constraints: Prefix/suffix wildcards (*) only work in fielded values
  • Forward progress: Every tokenization step consumes ≥1 byte or returns an error

Token Types

The lexer produces tokens with the following structure: {type, lexeme, literal, location}

  • type: Atom identifying the token type (:STRING_VALUE, :FIELD_IDENTIFIER, etc.)
  • lexeme: Original text from the source
  • literal: Processed/decoded value (e.g., unescaped strings)
  • location: {byte_offset, byte_length} tuple for source position

Summary

Types

Source location (byte-based): {byte_offset, byte_length}.

Token: {type, lexeme, literal, loc}

Functions

Tokenizes a Sifter query string into a list of tokens for parsing.

Types

byte_length()

@type byte_length() :: non_neg_integer()

byte_offset()

@type byte_offset() :: non_neg_integer()

loc()

@type loc() :: {byte_offset(), byte_length()}

Source location (byte-based): {byte_offset, byte_length}.

token()

@type token() ::
  {:STRING_VALUE, binary(), binary(), loc()}
  | {:FIELD_IDENTIFIER, binary(), binary(), loc()}
  | {:EQUALITY_COMPARATOR, binary(), nil, loc()}
  | {:LESS_THAN_COMPARATOR, binary(), nil, loc()}
  | {:LESS_THAN_OR_EQUAL_TO_COMPARATOR, binary(), nil, loc()}
  | {:GREATER_THAN_COMPARATOR, binary(), nil, loc()}
  | {:GREATER_THAN_OR_EQUAL_TO_COMPARATOR, binary(), nil, loc()}
  | {:SET_IN, binary(), atom(), loc()}
  | {:SET_NOT_IN, binary(), atom(), loc()}
  | {:SET_CONTAINS_ALL, binary(), atom(), loc()}
  | {:AND_CONNECTOR, binary(), binary(), loc()}
  | {:OR_CONNECTOR, binary(), binary(), loc()}
  | {:LEFT_PAREN, binary(), nil, loc()}
  | {:RIGHT_PAREN, binary(), nil, loc()}
  | {:COMMA, binary(), nil, loc()}
  | {:NOT_MODIFIER, binary(), nil, loc()}
  | {:EOF, binary(), nil, loc()}

Token: {type, lexeme, literal, loc}

  • type: atom tag
  • lexeme: exact substring
  • literal: unescaped/decoded value
  • loc: {offset_bytes, length_bytes}

Functions

tokenize(src)

@spec tokenize(String.t()) :: {:ok, [token()]} | {:error, term()}

Tokenizes a Sifter query string into a list of tokens for parsing.

This is the main entry point for the lexer. It processes a query string and produces a list of tokens that can be consumed by Sifter.Query.Parser.

Parameters

  • src - The query string to tokenize

Return Values

  • {:ok, tokens} - Successfully tokenized list of tokens, always ending with :EOF
  • {:error, reason} - Tokenization error with details