Sifter.Query.Lexer (Sifter v0.2.0)

A lexical analyzer for the Sifter query language that tokenizes search queries.

Sifter provides a query language for filtering data with support for:

Field-based predicates with various operators (field:value, field>10)
Set operations (field IN (value1, value2))
Boolean logic with AND, OR, and NOT operators
Quoted strings and bare text search
Wildcard prefix/suffix matching (field:prefix*, field:*suffix)

Grammar

Query         = [ whitespace ] , [ Term , { ( whitespace , Connective , whitespace | whitespace ) , Term } ] , [ whitespace ] ;

Connective    = "AND" | "OR" ;           (* AND has higher precedence than OR *)

Term          = [ Modifier ] , ( "(" , [ whitespace ] , Query , [ whitespace ] , ")" | Predicate | FullText ) ;

Modifier      = "-" | "NOT" , whitespace ;          (* "-" has no following space *)

Predicate     = Field , ( ColonOp , ValueOrList | SetOp , List ) ;

ColonOp       = ":" | "<" | "<=" | ">" | ">=" ;

SetOp         = whitespace , "IN" , whitespace | whitespace , "NOT" , whitespace , "IN" , whitespace | whitespace , "ALL" , whitespace ;

Field         = Name , { "." , Name } ;      (* dot paths, e.g. tags.name, project.client.name *)

ValueOrList   = List | Value ;
List          = "(" , [ whitespace ] , Value , { [ whitespace ] , "," , [ whitespace ] , Value } , [ whitespace ] , ")" ;  (* non-empty *)

(* STRICT wildcard rules - only for fielded values:
   field:value*  → starts_with match
   field:*value  → ends_with match
   Note: No middle wildcards like *value* - use FullText for contains-across-fields *)
Value         = PrefixValue | SuffixValue | ScalarValue | NullValue ;
PrefixValue   = ScalarNoStar , "*" ;                       (* starts_with *)
SuffixValue   = "*" , ScalarNoStar ;                       (* ends_with *)
ScalarValue   = Quoted | BareNoStar ;
NullValue     = NULL

(* Bare terms perform FullText search across configured fields *)
FullText      = Quoted | Bare ;

(* Lexical rules *)
Name          = NameStart , { NameCont } ;
NameStart     = ALNUM | "_" ;
NameCont      = ALNUM | "_" | "-" ;                      (* allow hyphen inside names *)
BareNoStar    = { Visible - Special - "*" }- ;          (* one or more visible chars excluding special and asterisk *)
Bare          = { Visible - Special }- ;                 (* one or more visible chars excluding special *)
Quoted        = "'" , { CharEsc | ? not "'" ? } , "'"
              | '"' , { CharEsc | ? not '"' ? } , '"' ;
CharEsc       = "\" , ? any character ? ;
Special       = whitespace | "(" | ")" | ":" | "<" | ">" | "=" | "," ;
whitespace    = { ? space | tab | carriage return | line feed ? }- ;  (* one or more whitespace chars *)
Visible       = ? any visible character ? ;
ALNUM         = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

Behavior Notes

Implied AND: Missing connectives between terms default to AND operation
Case-sensitive keywords: AND, OR, NOT, IN, ALL, NULL are case-sensitive (must be uppercase)
Bare text search: Unfielded terms perform "contains" search across configured fields
Wildcard constraints: Prefix/suffix wildcards (*) only work in fielded values
Forward progress: Every tokenization step consumes ≥1 byte or returns an error

Token Types

The lexer produces tokens with the following structure: {type, lexeme, literal, location}

type: Atom identifying the token type (:STRING_VALUE, :FIELD_IDENTIFIER, etc.)
lexeme: Original text from the source
literal: Processed/decoded value (e.g., unescaped strings)
location: {byte_offset, byte_length} tuple for source position

Summary

Types

byte_length()

byte_offset()

loc()

Source location (byte-based): {byte_offset, byte_length}.

token()

Token: {type, lexeme, literal, loc}

Functions

tokenize(src)

Tokenizes a Sifter query string into a list of tokens for parsing.