LeXtract.Alignment (lextract v0.1.2)

Aligns extracted entities back to their positions in source text.

This module provides token-based text alignment for matching extracted text (which may be paraphrased or slightly modified by LLMs) back to the original source text positions. It uses multiple matching strategies with fallback to handle various text transformation scenarios.

Matching Strategies

The alignment process uses these strategies in priority order:

Exact Match - Perfect token sequence match (case-sensitive)
Case-Insensitive Match - Lowercase token comparison
Fuzzy Match - Jaro distance-based similarity matching for minor variations
Partial Match - Substring/contains matching
No Match - Returns nil when no match is found

Examples

iex> extraction = %LeXtract.Extraction{
...>   extraction_class: "person",
...>   extraction_text: "John Doe",
...>   extraction_index: 0
...> }
iex> {:ok, source_encoding} = LeXtract.Tokenizer.tokenize("The patient John Doe was prescribed...")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, source_encoding)
iex> aligned.alignment_status
:exact

iex> extraction = %LeXtract.Extraction{
...>   extraction_class: "person",
...>   extraction_text: "JOHN DOE",
...>   extraction_index: 0
...> }
iex> {:ok, source_encoding} = LeXtract.Tokenizer.tokenize("Patient: john doe")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, source_encoding)
iex> aligned.alignment_status
:exact

Summary

Types

encoding()

match_result()

Functions

align_extraction(extraction, source_encoding, opts \\ [])

Aligns an extraction to its position in the source text.

find_match(extraction_text, source_encoding, occurrence_index \\ 0, opts \\ [])

Finds a match for extraction text in source encoding using multiple strategies.

search_tokens(needle_tokens, haystack_tokens, source_encoding, occurrence_index \\ 0, opts \\ [])

Searches for a token sequence in source tokens.

Types

encoding()

@type encoding() :: LeXtract.Tokenizer.encoding()

match_result()

@type match_result() :: %{
  char_interval: LeXtract.CharInterval.t(),
  alignment_status: LeXtract.AlignmentStatus.t()
}

Functions

align_extraction(extraction, source_encoding, opts \\ [])

@spec align_extraction(LeXtract.Extraction.t(), encoding(), keyword()) ::
  LeXtract.Extraction.t()

Aligns an extraction to its position in the source text.

Takes an extraction and source encoding, attempts to find the extraction text in the source using multiple matching strategies, and returns an updated extraction with character interval and alignment status.

Parameters

extraction - The extraction to align
source_encoding - Token encoding of the source text from Tokenizer.tokenize/1

Options

:fuzzy_threshold - Minimum Jaro similarity for fuzzy matching, 0.0-1.0 (default: 0.85)
:min_partial_length - Minimum token overlap for partial matching (default: 2)
:max_text_length - Maximum allowed extraction text length (default: 10_000)

Returns

Returns an updated %Extraction{} with char_interval and alignment_status fields populated. If no match is found, returns the extraction with alignment_status: :none and char_interval: nil.

Examples

iex> extraction = %LeXtract.Extraction{
...>   extraction_class: "medication",
...>   extraction_text: "aspirin",
...>   extraction_index: 0
...> }
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Patient takes aspirin daily")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, encoding)
iex> aligned.char_interval
%LeXtract.CharInterval{start_pos: 14, end_pos: 21}
iex> aligned.alignment_status
:exact

find_match(extraction_text, source_encoding, occurrence_index \\ 0, opts \\ [])

@spec find_match(String.t(), encoding(), non_neg_integer(), keyword()) ::
  match_result() | nil

Finds a match for extraction text in source encoding using multiple strategies.

Attempts to find the extraction text in the source text using various matching strategies. The occurrence_index parameter allows selecting a specific occurrence when the text appears multiple times (0-based).

Parameters

extraction_text - The text to find in the source
source_encoding - Token encoding of the source text
occurrence_index - Which occurrence to match (0-based, default: 0)
opts - Options (see align_extraction/3)

Returns

Returns a map with :char_interval and :alignment_status, or nil if no match is found.

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("John loves John")
iex> LeXtract.Alignment.find_match("John", encoding, 0)
%{char_interval: %LeXtract.CharInterval{start_pos: 0, end_pos: 4}, alignment_status: :exact}

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("John loves John")
iex> result = LeXtract.Alignment.find_match("John", encoding, 1)
iex> result.char_interval.start_pos
11

search_tokens(needle_tokens, haystack_tokens, source_encoding, occurrence_index \\ 0, opts \\ [])

@spec search_tokens(
  [String.t()],
  [String.t()],
  encoding(),
  non_neg_integer(),
  keyword()
) ::
  LeXtract.CharInterval.t() | nil

Searches for a token sequence in source tokens.

Helper function that finds all occurrences of a token sequence and returns the character interval for a specific occurrence.

Parameters

needle_tokens - Token sequence to search for
haystack_tokens - Source token sequence to search in
source_encoding - Source encoding for offset mapping
occurrence_index - Which occurrence to return (0-based)
opts - Matching options

Returns

Returns a %CharInterval{} for the requested occurrence, or nil if not found.