JsonRemedy.Layer3.HardcodedPatterns (JsonRemedy v0.1.11)

View Source

Hard-coded cleanup patterns ported from json_repair Python library.

This module contains additional normalization patterns beyond standard syntax fixes, addressing common edge cases found in LLM output, legacy systems, and data pipelines.

Pattern Categories

1. Extended String Delimiters

  • Converts smart quotes ("", "", ‹›, «») to standard JSON quotes
  • Handles mixed quote styles in same document
  • Context-aware to preserve string content

2. Enhanced Escape Sequence Normalization

  • Converts literal escape sequences (\t, \n, \r, \b, \f) to actual characters
  • Handles Unicode escape sequences (\uXXXX)
  • Handles hexadecimal escape sequences (\xXX)
  • Only processes sequences within string contexts

3. Number Format Normalization

  • Removes thousands separators from numbers (1,234 → 1234)
  • Handles currency format variations
  • Preserves decimals and scientific notation
  • Context-aware to avoid modifying strings

4. Doubled Quote Detection and Repair

  • Fixes ""value"" → "value" patterns
  • Distinguishes doubled quotes from empty strings
  • Handles nested and multiple occurrences

Source Attribution

These patterns are based on the json_repair Python library by Stefano Baccianella: https://github.com/mangiucugna/json_repair

Patterns have been adapted to Elixir's functional paradigm and integrated into JsonRemedy's multi-layer repair architecture.

Usage

Functions can be called independently or as part of Layer 3 processing:

iex> HardcodedPatterns.normalize_smart_quotes(~s({"key": "value"}))
~s({"key": "value"})

iex> HardcodedPatterns.normalize_escape_sequences(~s({"text": "hello\\nworld"}))
~s({"text": "hello\nworld"})

Performance Considerations

  • String replacements use Elixir's optimized binary pattern matching
  • Context tracking minimizes unnecessary processing
  • Nil handling avoids crashes on edge cases
  • Character-by-character parsing only when necessary

Summary

Functions

Fixes doubled quote patterns that sometimes appear in malformed JSON.

Normalizes escape sequences within JSON strings.

Normalizes number formats by removing thousands separators and handling currency.

Converts smart quotes and alternative quote characters to standard JSON double quotes.

Types

json_string()

@type json_string() :: String.t() | nil

Functions

fix_doubled_quotes(input)

@spec fix_doubled_quotes(json_string()) :: json_string()

Fixes doubled quote patterns that sometimes appear in malformed JSON.

Note: This feature is deferred to Layer 5 (Tolerant Parsing) and this function is currently a no-op pass-through.

Rationale for Deferral

After comprehensive TDD investigation, we determined that doubled quote patterns require context-aware parsing beyond regex capabilities. Attempting to fix them with simple regex preprocessing creates more problems than it solves due to:

  1. Ambiguity: Cannot distinguish ""value"" (malformed) from "He said ""hello""" (quotes inside string content) without full parsing context
  2. False matches: Regex patterns match legitimate empty strings, escaped quotes, and other valid JSON constructs
  3. Pipeline interactions: Early preprocessing can interfere with Layer 2's structural analysis and Layer 3's character-by-character parsing

Patterns Requiring Layer 5 Implementation

The following patterns will be addressed in Layer 5 with full JSON state machine and position tracking, following the json_repair Python library's parse_string.py implementation:

  1. Symmetric doubled quotes: ""value"""value"
  2. Asymmetric doubled quotes: ""value" or "value"""value"
  3. Tripled quotes: """value""""value"
  4. Quadruple quotes (empty string): """"""
  5. Doubled quotes in object keys: {""key"": "value"}
  6. Doubled quotes inside string content: "He said ""hello"" to me"

Test Coverage

Comprehensive test suite in test/missing_patterns/doubled_quotes_test.exs (21 tests). Tests are tagged with :layer5_target and excluded from main test run until Layer 5 implementation.

Examples (Future Layer 5 Behavior)

iex> fix_doubled_quotes(~s({"key": ""value""}))
~s({"key": "value"})  # Layer 5 will handle this

iex> fix_doubled_quotes(~s({"empty": ""}))
~s({"empty": ""})  # Currently: pass-through (no change)

iex> fix_doubled_quotes(nil)
nil

normalize_escape_sequences(input)

@spec normalize_escape_sequences(json_string()) :: json_string()

Normalizes escape sequences within JSON strings.

Converts literal escape sequences (like \t, \n) to their actual character representations. Handles Unicode (\uXXXX) and hex (\xXX) escape sequences.

Examples

iex> normalize_escape_sequences(~s({"text": "hello\\tworld"}))
~s({"text": "hello\tworld"})

iex> normalize_escape_sequences(~s({"emoji": "\\u263a"}))
~s({"emoji": "☺"})

iex> normalize_escape_sequences(nil)
nil

normalize_number_formats(input)

@spec normalize_number_formats(json_string()) :: json_string()

Normalizes number formats by removing thousands separators and handling currency.

Converts numbers with commas (1,234) to standard format (1234) while preserving decimals and scientific notation.

Examples

iex> normalize_number_formats(~s({"amount": 1,234}))
~s({"amount": 1234})

iex> normalize_number_formats(~s({"price": -1,234.56}))
~s({"price": -1234.56})

iex> normalize_number_formats(nil)
nil

normalize_smart_quotes(input)

@spec normalize_smart_quotes(json_string()) :: json_string()

Converts smart quotes and alternative quote characters to standard JSON double quotes.

Supports curly quotes (""), angle quotes (‹›), and guillemets («»).

Note: Regular single quotes (') are NOT converted here as they may be legitimate content inside string values. Single quote normalization happens later in Layer 3 character-by-character parsing where context is available.

Examples

iex> normalize_smart_quotes(~s({"key": "value"}))
~s({"key": "value"})

iex> normalize_smart_quotes(~s(«hello»))
~s("hello")

iex> normalize_smart_quotes(nil)
nil

iex> normalize_smart_quotes("")
""