LeXtract.Alignment (lextract v0.1.2)
View SourceAligns extracted entities back to their positions in source text.
This module provides token-based text alignment for matching extracted text (which may be paraphrased or slightly modified by LLMs) back to the original source text positions. It uses multiple matching strategies with fallback to handle various text transformation scenarios.
Matching Strategies
The alignment process uses these strategies in priority order:
- Exact Match - Perfect token sequence match (case-sensitive)
- Case-Insensitive Match - Lowercase token comparison
- Fuzzy Match - Jaro distance-based similarity matching for minor variations
- Partial Match - Substring/contains matching
- No Match - Returns nil when no match is found
Examples
iex> extraction = %LeXtract.Extraction{
...> extraction_class: "person",
...> extraction_text: "John Doe",
...> extraction_index: 0
...> }
iex> {:ok, source_encoding} = LeXtract.Tokenizer.tokenize("The patient John Doe was prescribed...")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, source_encoding)
iex> aligned.alignment_status
:exact
iex> extraction = %LeXtract.Extraction{
...> extraction_class: "person",
...> extraction_text: "JOHN DOE",
...> extraction_index: 0
...> }
iex> {:ok, source_encoding} = LeXtract.Tokenizer.tokenize("Patient: john doe")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, source_encoding)
iex> aligned.alignment_status
:exact
Summary
Functions
Aligns an extraction to its position in the source text.
Finds a match for extraction text in source encoding using multiple strategies.
Searches for a token sequence in source tokens.
Types
@type encoding() :: LeXtract.Tokenizer.encoding()
@type match_result() :: %{ char_interval: LeXtract.CharInterval.t(), alignment_status: LeXtract.AlignmentStatus.t() }
Functions
@spec align_extraction(LeXtract.Extraction.t(), encoding(), keyword()) :: LeXtract.Extraction.t()
Aligns an extraction to its position in the source text.
Takes an extraction and source encoding, attempts to find the extraction text in the source using multiple matching strategies, and returns an updated extraction with character interval and alignment status.
Parameters
extraction- The extraction to alignsource_encoding- Token encoding of the source text fromTokenizer.tokenize/1
Options
:fuzzy_threshold- Minimum Jaro similarity for fuzzy matching, 0.0-1.0 (default: 0.85):min_partial_length- Minimum token overlap for partial matching (default: 2):max_text_length- Maximum allowed extraction text length (default: 10_000)
Returns
Returns an updated %Extraction{} with char_interval and alignment_status
fields populated. If no match is found, returns the extraction with
alignment_status: :none and char_interval: nil.
Examples
iex> extraction = %LeXtract.Extraction{
...> extraction_class: "medication",
...> extraction_text: "aspirin",
...> extraction_index: 0
...> }
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Patient takes aspirin daily")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, encoding)
iex> aligned.char_interval
%LeXtract.CharInterval{start_pos: 14, end_pos: 21}
iex> aligned.alignment_status
:exact
@spec find_match(String.t(), encoding(), non_neg_integer(), keyword()) :: match_result() | nil
Finds a match for extraction text in source encoding using multiple strategies.
Attempts to find the extraction text in the source text using various matching
strategies. The occurrence_index parameter allows selecting a specific
occurrence when the text appears multiple times (0-based).
Parameters
extraction_text- The text to find in the sourcesource_encoding- Token encoding of the source textoccurrence_index- Which occurrence to match (0-based, default: 0)opts- Options (seealign_extraction/3)
Returns
Returns a map with :char_interval and :alignment_status, or nil if no
match is found.
Examples
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("John loves John")
iex> LeXtract.Alignment.find_match("John", encoding, 0)
%{char_interval: %LeXtract.CharInterval{start_pos: 0, end_pos: 4}, alignment_status: :exact}
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("John loves John")
iex> result = LeXtract.Alignment.find_match("John", encoding, 1)
iex> result.char_interval.start_pos
11
@spec search_tokens( [String.t()], [String.t()], encoding(), non_neg_integer(), keyword() ) :: LeXtract.CharInterval.t() | nil
Searches for a token sequence in source tokens.
Helper function that finds all occurrences of a token sequence and returns the character interval for a specific occurrence.
Parameters
needle_tokens- Token sequence to search forhaystack_tokens- Source token sequence to search insource_encoding- Source encoding for offset mappingoccurrence_index- Which occurrence to return (0-based)opts- Matching options
Returns
Returns a %CharInterval{} for the requested occurrence, or nil if not found.