LeXtract.Alignment (lextract v0.1.2)

View Source

Aligns extracted entities back to their positions in source text.

This module provides token-based text alignment for matching extracted text (which may be paraphrased or slightly modified by LLMs) back to the original source text positions. It uses multiple matching strategies with fallback to handle various text transformation scenarios.

Matching Strategies

The alignment process uses these strategies in priority order:

  1. Exact Match - Perfect token sequence match (case-sensitive)
  2. Case-Insensitive Match - Lowercase token comparison
  3. Fuzzy Match - Jaro distance-based similarity matching for minor variations
  4. Partial Match - Substring/contains matching
  5. No Match - Returns nil when no match is found

Examples

iex> extraction = %LeXtract.Extraction{
...>   extraction_class: "person",
...>   extraction_text: "John Doe",
...>   extraction_index: 0
...> }
iex> {:ok, source_encoding} = LeXtract.Tokenizer.tokenize("The patient John Doe was prescribed...")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, source_encoding)
iex> aligned.alignment_status
:exact

iex> extraction = %LeXtract.Extraction{
...>   extraction_class: "person",
...>   extraction_text: "JOHN DOE",
...>   extraction_index: 0
...> }
iex> {:ok, source_encoding} = LeXtract.Tokenizer.tokenize("Patient: john doe")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, source_encoding)
iex> aligned.alignment_status
:exact

Summary

Functions

Aligns an extraction to its position in the source text.

Finds a match for extraction text in source encoding using multiple strategies.

Types

encoding()

@type encoding() :: LeXtract.Tokenizer.encoding()

match_result()

@type match_result() :: %{
  char_interval: LeXtract.CharInterval.t(),
  alignment_status: LeXtract.AlignmentStatus.t()
}

Functions

align_extraction(extraction, source_encoding, opts \\ [])

@spec align_extraction(LeXtract.Extraction.t(), encoding(), keyword()) ::
  LeXtract.Extraction.t()

Aligns an extraction to its position in the source text.

Takes an extraction and source encoding, attempts to find the extraction text in the source using multiple matching strategies, and returns an updated extraction with character interval and alignment status.

Parameters

  • extraction - The extraction to align
  • source_encoding - Token encoding of the source text from Tokenizer.tokenize/1

Options

  • :fuzzy_threshold - Minimum Jaro similarity for fuzzy matching, 0.0-1.0 (default: 0.85)
  • :min_partial_length - Minimum token overlap for partial matching (default: 2)
  • :max_text_length - Maximum allowed extraction text length (default: 10_000)

Returns

Returns an updated %Extraction{} with char_interval and alignment_status fields populated. If no match is found, returns the extraction with alignment_status: :none and char_interval: nil.

Examples

iex> extraction = %LeXtract.Extraction{
...>   extraction_class: "medication",
...>   extraction_text: "aspirin",
...>   extraction_index: 0
...> }
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Patient takes aspirin daily")
iex> aligned = LeXtract.Alignment.align_extraction(extraction, encoding)
iex> aligned.char_interval
%LeXtract.CharInterval{start_pos: 14, end_pos: 21}
iex> aligned.alignment_status
:exact

find_match(extraction_text, source_encoding, occurrence_index \\ 0, opts \\ [])

@spec find_match(String.t(), encoding(), non_neg_integer(), keyword()) ::
  match_result() | nil

Finds a match for extraction text in source encoding using multiple strategies.

Attempts to find the extraction text in the source text using various matching strategies. The occurrence_index parameter allows selecting a specific occurrence when the text appears multiple times (0-based).

Parameters

  • extraction_text - The text to find in the source
  • source_encoding - Token encoding of the source text
  • occurrence_index - Which occurrence to match (0-based, default: 0)
  • opts - Options (see align_extraction/3)

Returns

Returns a map with :char_interval and :alignment_status, or nil if no match is found.

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("John loves John")
iex> LeXtract.Alignment.find_match("John", encoding, 0)
%{char_interval: %LeXtract.CharInterval{start_pos: 0, end_pos: 4}, alignment_status: :exact}

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("John loves John")
iex> result = LeXtract.Alignment.find_match("John", encoding, 1)
iex> result.char_interval.start_pos
11

search_tokens(needle_tokens, haystack_tokens, source_encoding, occurrence_index \\ 0, opts \\ [])

@spec search_tokens(
  [String.t()],
  [String.t()],
  encoding(),
  non_neg_integer(),
  keyword()
) ::
  LeXtract.CharInterval.t() | nil

Searches for a token sequence in source tokens.

Helper function that finds all occurrences of a token sequence and returns the character interval for a specific occurrence.

Parameters

  • needle_tokens - Token sequence to search for
  • haystack_tokens - Source token sequence to search in
  • source_encoding - Source encoding for offset mapping
  • occurrence_index - Which occurrence to return (0-based)
  • opts - Matching options

Returns

Returns a %CharInterval{} for the requested occurrence, or nil if not found.