Sifter.Query.Lexer (Sifter v0.2.0)
View SourceA lexical analyzer for the Sifter query language that tokenizes search queries.
Sifter provides a query language for filtering data with support for:
- Field-based predicates with various operators (
field:value,field>10) - Set operations (
field IN (value1, value2)) - Boolean logic with
AND,OR, andNOToperators - Quoted strings and bare text search
- Wildcard prefix/suffix matching (
field:prefix*,field:*suffix)
Grammar
Query = [ whitespace ] , [ Term , { ( whitespace , Connective , whitespace | whitespace ) , Term } ] , [ whitespace ] ;
Connective = "AND" | "OR" ; (* AND has higher precedence than OR *)
Term = [ Modifier ] , ( "(" , [ whitespace ] , Query , [ whitespace ] , ")" | Predicate | FullText ) ;
Modifier = "-" | "NOT" , whitespace ; (* "-" has no following space *)
Predicate = Field , ( ColonOp , ValueOrList | SetOp , List ) ;
ColonOp = ":" | "<" | "<=" | ">" | ">=" ;
SetOp = whitespace , "IN" , whitespace | whitespace , "NOT" , whitespace , "IN" , whitespace | whitespace , "ALL" , whitespace ;
Field = Name , { "." , Name } ; (* dot paths, e.g. tags.name, project.client.name *)
ValueOrList = List | Value ;
List = "(" , [ whitespace ] , Value , { [ whitespace ] , "," , [ whitespace ] , Value } , [ whitespace ] , ")" ; (* non-empty *)
(* STRICT wildcard rules - only for fielded values:
field:value* → starts_with match
field:*value → ends_with match
Note: No middle wildcards like *value* - use FullText for contains-across-fields *)
Value = PrefixValue | SuffixValue | ScalarValue | NullValue ;
PrefixValue = ScalarNoStar , "*" ; (* starts_with *)
SuffixValue = "*" , ScalarNoStar ; (* ends_with *)
ScalarValue = Quoted | BareNoStar ;
NullValue = NULL
(* Bare terms perform FullText search across configured fields *)
FullText = Quoted | Bare ;
(* Lexical rules *)
Name = NameStart , { NameCont } ;
NameStart = ALNUM | "_" ;
NameCont = ALNUM | "_" | "-" ; (* allow hyphen inside names *)
BareNoStar = { Visible - Special - "*" }- ; (* one or more visible chars excluding special and asterisk *)
Bare = { Visible - Special }- ; (* one or more visible chars excluding special *)
Quoted = "'" , { CharEsc | ? not "'" ? } , "'"
| '"' , { CharEsc | ? not '"' ? } , '"' ;
CharEsc = "\" , ? any character ? ;
Special = whitespace | "(" | ")" | ":" | "<" | ">" | "=" | "," ;
whitespace = { ? space | tab | carriage return | line feed ? }- ; (* one or more whitespace chars *)
Visible = ? any visible character ? ;
ALNUM = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;Behavior Notes
- Implied AND: Missing connectives between terms default to AND operation
- Case-sensitive keywords:
AND,OR,NOT,IN,ALL,NULLare case-sensitive (must be uppercase) - Bare text search: Unfielded terms perform "contains" search across configured fields
- Wildcard constraints: Prefix/suffix wildcards (
*) only work in fielded values - Forward progress: Every tokenization step consumes ≥1 byte or returns an error
Token Types
The lexer produces tokens with the following structure: {type, lexeme, literal, location}
type: Atom identifying the token type (:STRING_VALUE,:FIELD_IDENTIFIER, etc.)lexeme: Original text from the sourceliteral: Processed/decoded value (e.g., unescaped strings)location:{byte_offset, byte_length}tuple for source position
Summary
Functions
Tokenizes a Sifter query string into a list of tokens for parsing.
Types
@type byte_length() :: non_neg_integer()
@type byte_offset() :: non_neg_integer()
@type loc() :: {byte_offset(), byte_length()}
Source location (byte-based): {byte_offset, byte_length}.
@type token() :: {:STRING_VALUE, binary(), binary(), loc()} | {:FIELD_IDENTIFIER, binary(), binary(), loc()} | {:EQUALITY_COMPARATOR, binary(), nil, loc()} | {:LESS_THAN_COMPARATOR, binary(), nil, loc()} | {:LESS_THAN_OR_EQUAL_TO_COMPARATOR, binary(), nil, loc()} | {:GREATER_THAN_COMPARATOR, binary(), nil, loc()} | {:GREATER_THAN_OR_EQUAL_TO_COMPARATOR, binary(), nil, loc()} | {:SET_IN, binary(), atom(), loc()} | {:SET_NOT_IN, binary(), atom(), loc()} | {:SET_CONTAINS_ALL, binary(), atom(), loc()} | {:AND_CONNECTOR, binary(), binary(), loc()} | {:OR_CONNECTOR, binary(), binary(), loc()} | {:LEFT_PAREN, binary(), nil, loc()} | {:RIGHT_PAREN, binary(), nil, loc()} | {:COMMA, binary(), nil, loc()} | {:NOT_MODIFIER, binary(), nil, loc()} | {:EOF, binary(), nil, loc()}
Token: {type, lexeme, literal, loc}
- type: atom tag
- lexeme: exact substring
- literal: unescaped/decoded value
- loc: {offset_bytes, length_bytes}
Functions
Tokenizes a Sifter query string into a list of tokens for parsing.
This is the main entry point for the lexer. It processes a query string and produces
a list of tokens that can be consumed by Sifter.Query.Parser.
Parameters
src- The query string to tokenize
Return Values
{:ok, tokens}- Successfully tokenized list of tokens, always ending with:EOF{:error, reason}- Tokenization error with details