Changelog

View Source

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

0.1.11 - 2025-12-03

Added

  • Trailing dots truncation handling: Layer 1 now detects and strips trailing dots from truncated JSON content
    • Pattern: {"key": "value............. โ†’ {"key": "value
    • Handles Gemini API max_output_tokens truncation pattern where remaining tokens are filled with dots
    • Threshold: 10+ consecutive trailing dots trigger cleanup (preserves legitimate ellipsis)
    • Works with mixed whitespace/newlines between dots
    • Test status: โœ… 14/14 new tests passing
  • New public API function: ContentCleaning.strip_trailing_dots/1 for direct access to trailing dots removal

Technical Details

  • Module: Enhanced JsonRemedy.Layer1.ContentCleaning
  • New functions:
    • strip_trailing_dots/1 - Public API for trailing dots removal
    • strip_trailing_dots_internal/1 - Pipeline integration function
    • find_trailing_dots_start/1 - Position detection
    • find_dots_at_end/1 - End-of-string dot detection
    • count_trailing_dots/4 - Dot counting with whitespace tolerance
  • Integration: Runs in Layer 1 pipeline after content extraction, before encoding normalization
  • Performance: Grapheme-based processing for proper UTF-8 handling

Documentation

  • Test case reference: Based on real Gemini API response in docs/20251203/test_case/gemini_max_tokens_trailing_dots.md
  • Layer 1 enhancements: Added trailing dots stripping to content cleaning pipeline

0.1.10 - 2025-10-28

Fixed

  • Windows structural repairs: Treat CRLF pairs as whitespace when scanning for the next significant character so Layer 2 keeps nested braces open on Windows-style inputs reported by CI.

0.1.9 - 2025-10-28

Fixed

  • Multi-item arrays: Layer 2 now closes missing braces and brackets between array elements, avoiding mismatched delimiter swaps during comma handling and restoring proper validation (#8).

Added

  • Regression coverage: Added LF and CRLF variants of the reporter's samples to ensure multi-element arrays with missing terminators stay repaired.

0.1.8 - 2025-10-27

Added

  • HTML metadata: HtmlHandlers.extract_html_content/2 now returns byte counts alongside grapheme totals so every caller gets accurate offsets without recomputing (#6).
  • Examples: New html_metadata_examples.exs and windows_ci_examples.exs scripts document the metadata workflow and Windows CI parity (#6, #7).

Fixed

  • Syntax normalization: Layer 3 consumes the shared metadata directly, removing duplicated byte math around HTML quoting (#6).

CI

  • Windows coverage: Introduced a windows-2022 PowerShell job to the GitHub Actions matrix to run mix deps.get, mix compile --warnings-as-errors, and mix test, ensuring CRLF regressions are caught ahead of releases (#7).

0.1.7 - 2025-10-27

Fixed

  • Windows HTML payloads: Layer 3's binary optimization now trims HTML fragments using byte-accurate offsets, ensuring CRLF-terminated bodies are quoted without leaving stray delimiters.
  • Regression coverage: Added a direct regression test mirroring the reporter's CRLF_html.json sample so Windows-style newlines stay guarded.

Added

  • Examples: Extended examples/html_content_examples.exs with a Windows newline scenario to demonstrate the repaired behaviour via mix run.
  • Documentation: HexDocs now surfaces grouped README, changelog, and license pages for easier navigation.

0.1.6 - 2025-10-24

Added

๐Ÿ”ข Advanced Number Edge Case Handling - Critical Pattern Enhancement

Comprehensive support for non-standard number formats commonly found in real-world malformed JSON, inspired by json_repair Python library.

New Number Patterns Supported:

  • Fractions: {"ratio": 1/3} โ†’ {"ratio": "1/3"} (convert to string)
  • Ranges: {"years": 1990-2020} โ†’ {"years": "1990-2020"} (convert to string)
  • Invalid decimals: {"version": 1.1.1} โ†’ {"version": "1.1.1"} (convert to string)
  • Leading decimals: {"probability": .25} โ†’ {"probability": 0.25} (prepend zero)
  • Text-number hybrids: {"code": 1notanumber} โ†’ {"code": "1notanumber"} (convert to string)
  • Trailing operators: {"value": 1e} โ†’ {"value": 1} (remove incomplete exponent)
  • Trailing decimals: {"num": 1.} โ†’ {"num": 1.0} (complete decimal)
  • Currency symbols: {"price": $100} โ†’ {"price": "$100"} (quote as string)
  • Thousands separators: {"population": 1,234,567} โ†’ {"population": 1234567} (already supported, now enhanced)

Implementation Details:

  • Module: Enhanced JsonRemedy.Layer3.BinaryProcessors
  • New functions:
    • consume_number_with_edge_cases/3 - Extended number consumption with special character support
    • analyze_and_normalize_number/2 - Intelligent pattern detection and conversion
  • Character support: Handles /, -, ., currency symbols ($, โ‚ฌ, ยฃ, ยฅ), commas, and text
  • Smart detection: Distinguishes negative numbers from ranges, thousands separators from delimiters
  • Test status: โœ… 42/43 tests passing (98% success rate)

๐Ÿ” Pattern Investigation & Documentation

  • Comprehensive analysis: Deep investigation of json_repair Python library patterns
  • Test infrastructure: Created test/missing_patterns/ directory for pattern validation
  • Layer 5 roadmap: Documented patterns requiring state machine implementation:
    • Doubled quotes detection (""value"" โ†’ "value")
    • Misplaced quote detection with lookahead
    • Stream stability mode for incomplete JSON
    • Unicode escape normalization
    • Object merge patterns
    • Array extension patterns

Enhanced

  • Layer 3 Syntax Normalization: Expanded number detection to include . and $ triggers
  • Binary Processors: Character-by-character number consumption with edge case awareness
  • Pipeline Architecture: Early hardcoded pattern preprocessing (before Layer 2) to prevent structural misinterpretation
  • Test organization: New :layer5_target tag for deferred features
  • Documentation: Comprehensive rationale for architectural decisions

Fixed

  • Leading decimal numbers: .25 now correctly normalized to 0.25
  • Negative leading decimals: -.5 now correctly normalized to -0.5
  • Fraction detection: 1/3 properly detected and quoted as string
  • Range vs negative: 10-20 (range) distinguished from -20 (negative number)
  • Scientific notation edge cases: Incomplete exponents (1e, 1e-) handled gracefully
  • Number-text hybrids: 123abc properly detected and quoted
  • Multiple decimal points: 1.1.1 correctly identified as invalid and quoted
  • Thousands separator parsing: Only consumes commas followed by exactly 3 digits

Technical Details

  • Pattern consumption: Enhanced binary pattern matching in consume_number_with_edge_cases/3
  • Context-aware normalization: analyze_and_normalize_number/2 with 9 distinct pattern checks
  • Repair tracking: Detailed repair actions for all number normalizations
  • UTF-8 safe: Proper handling of unicode characters in number-like values
  • Zero regressions: All 82 critical tests remain passing

Deferred to Layer 5 (Tolerant Parsing)

The following patterns require full JSON state machine with position tracking and lookahead:

  • Doubled quotes: Context-sensitive quote repair (21 tests tagged :layer5_target)
  • Misplaced quotes: Lookahead analysis for quote-in-quote detection
  • Stream stability: Handling incomplete streaming JSON from LLMs
  • Complex structural issues: Severe malformations requiring aggressive heuristics

Documentation

  • Pattern analysis: Documented 12 missing pattern categories from json_repair comparison
  • Test coverage: Added 64 new tests (43 number edge cases + 21 doubled quotes)
  • Architectural insights: Documented regex limitations and Layer 5 requirements
  • Known limitations: Clear documentation of deferred features with rationale

Test Suite Status

  • Total tests: 618 tests, 0 failures (100% pass rate)
  • Excluded: 63 tests (38 existing + 25 deferred Layer 5 targets)
  • Critical tests: 82/82 passing (100%)
  • Number edge cases: 42/43 passing (98%)
  • New test infrastructure: test/missing_patterns/ directory established

0.1.5 - 2025-10-24

Added

๐Ÿ”„ Pre-processing Pipeline - Major Architectural Enhancement

A new pre-processing stage now runs before the main layer pipeline to handle complex patterns that would otherwise be broken by subsequent layers. This is inspired by the json_repair Python library.

New Pre-processing Modules:

  • MultipleJsonDetector utility: Detects and aggregates consecutive JSON values

    • Pattern: []{} โ†’ [[], {}]
    • Prevents Layer 1 from treating subsequent JSON as "wrapper text"
    • Runs first in the pipeline before any layer processing
    • Test status: โœ… 10/10 tests passing
  • ObjectMerger (Layer 3): Merges key-value pairs after premature closing braces

    • Pattern: {"a":"b"},"c":"d"} โ†’ {"a":"b","c":"d"}
    • Handles malformed objects with extra closing braces
    • Merges additional pairs erroneously placed outside objects
    • Test status: โœ… 10/10 tests passing

New Layer 3 Filters:

  • EllipsisFilter: Removes unquoted ellipsis (...) placeholders

    • Pattern: [1,2,3,...] โ†’ [1,2,3]
    • Common in LLM-generated content to indicate truncation
    • Preserves quoted "..." as valid string values
    • Test status: โœ… 10/10 tests passing
  • KeywordFilter: Removes unquoted comment-like keywords

    • Pattern: {"a":1, COMMENT "b":2} โ†’ {"a":1,"b":2}
    • Filters: COMMENT, SHOULD_NOT_EXIST, DEBUG_INFO, PLACEHOLDER, TODO, FIXME, etc.
    • Test status: โœ… 10/10 tests passing

๐ŸŒ HTML Content Handling in Layer 3

  • HtmlHandlers module: Intelligent detection and quoting of unquoted HTML values
    • DOCTYPE declarations: <!DOCTYPE HTML ...> properly detected and quoted
    • HTML comments: <!-- ... --> handled correctly without breaking tag depth tracking
    • Void elements: Self-closing tags (<meta>, <br>, <hr>, <img>, etc.) tracked without expecting closing tags
    • Self-closing syntax: <tag /> properly recognized and depth-adjusted
    • Nested structures: HTML with JSON-like content in attributes handled correctly
    • Smart extraction: Tracks HTML tag depth to determine end of HTML block
    • Proper escaping: Quotes, newlines, tabs, backslashes all escaped for valid JSON
    • Array support: HTML values in arrays work correctly
    • Test status: โœ… 15/15 tests passing

๐Ÿ“š Documentation & Examples

  • Professional hex-shaped logo: New assets/json_remedy_logo.svg with modern design featuring medical cross and Elixir drop
  • Example documentation: examples/html_content_examples.exs with 5 challenging real-world scenarios
  • Comprehensive test suite: 65 new tests total (40 pattern tests + 15 HTML tests + 10 pre-processing tests)

Enhanced

  • Pre-processing architecture: New stage before layer pipeline prevents pattern interference
  • Layer 1 (ContentCleaning): Smarter trailing wrapper text removal - checks if trailing content is valid JSON before removing
  • Pipeline orchestration: Integrated pre-processing with main repair pipeline for seamless operation
  • Documentation: Logo integrated in README.md and HexDocs
  • Package assets: Logo included in hex package for professional documentation display
  • README.md: Added pre-processing pipeline documentation, HTML handling, and updated pattern status
  • Test coverage: All critical tests passing (82 + 65 new tests = 147 tests, 100% success rate)

Fixed

  • Multiple JSON values: Consecutive JSON values like []{} now properly aggregated
  • Object boundary issues: Extra key-value pairs after closing braces now merged correctly
  • Ellipsis placeholders: Unquoted ... in arrays removed while preserving quoted ellipsis
  • Debug keywords: Comment-like keywords (COMMENT, DEBUG_INFO, etc.) filtered from output
  • HTML in JSON values: Unquoted HTML after colons (e.g., "body":<!DOCTYPE HTML>) now properly quoted and escaped
  • API error pages: Full HTML error responses from APIs (503, 404, etc.) now handled correctly
  • Complex HTML: Nested tags, attributes with quotes, special entities all work properly

Technical Details

  • Pre-processing stage: Runs before Layer 1 to handle patterns that would otherwise break
  • Smart JSON detection: Parses multiple consecutive JSON values with proper position tracking
  • Object boundary analysis: Tracks brace balance to identify and merge split objects
  • Context-aware filtering: Preserves quoted ellipsis and keywords while removing unquoted ones
  • HTML depth tracking: Monitors both HTML tag depth and JSON-like structure depth
  • Context awareness: Only stops at JSON delimiters when all HTML tags are closed
  • Void element list: 15 HTML5 void elements recognized (area, base, br, col, embed, hr, img, input, link, meta, param, source, track, wbr)
  • Binary optimization: HTML detection integrated into Layer 3's binary processing pipeline
  • Zero regressions: All existing tests remain passing

Cleanup

  • Removed temporary test scripts: test_boolean.exs, test_weiss.exs

0.1.4 - 2025-10-07

Added

  • Hardcoded Patterns Module: New JsonRemedy.Layer3.HardcodedPatterns module with battle-tested cleanup patterns ported from Python's json_repair library
    • Smart quotes normalization: Converts curly quotes (""), guillemets (ยซยป), and angle quotes (โ€นโ€บ) to standard JSON quotes
    • Doubled quotes repair: Fixes ""value"" โ†’ "value" while preserving empty strings ""
    • Number format normalization: Removes thousands separators from numbers: 1,234,567 โ†’ 1234567
    • Unicode escape sequences: Converts \u263a โ†’ โ˜บ (opt-in via :enable_escape_normalization)
    • Hex escape sequences: Converts \x41 โ†’ A (opt-in via :enable_escape_normalization)
  • Comprehensive examples: New examples/hardcoded_patterns_examples.exs with 8 detailed examples demonstrating:
    • Smart quotes with international text (French, Japanese, German)
    • Doubled quotes edge cases
    • Number format cleaning while preserving string content
    • Unicode/hex escape handling
    • Combined patterns for real-world LLM output
    • Full Layer 3 pipeline integration
  • Feature flags: Configurable pattern processing with safe defaults
    • :enable_hardcoded_patterns (default: true)
    • :enable_escape_normalization (default: false for safety)

Enhanced

  • Layer 3 integration: Hardcoded patterns run as pre-processing step before main syntax normalization
  • Context-aware processing: Number format normalization preserves commas in string values
  • International support: Full UTF-8 support with smart quotes from multiple languages
  • Documentation: README updated with dedicated hardcoded patterns section and attribution to source library

Technical Details

  • Test coverage: 47 new tests for hardcoded patterns (100% pass rate)
  • Source attribution: Patterns ported from json_repair by Stefano Baccianella
  • Architecture: Cleanly organized as Layer 3 subsection with proper separation of concerns
  • Type safety: Full Dialyzer compliance with zero warnings
  • Performance: Regex-based optimizations with minimal overhead

Performance

  • Test suite: 499 total tests, 0 failures (added 47 new tests)
  • Zero regressions: All existing functionality preserved
  • Efficient processing: Smart quotes and number normalization use optimized regex patterns

0.1.3 - 2025-07-05

Fixed

  • Fixed issue where wrapper text following JSON blocks was not recognized (#1)
    • Added dedicated remove_trailing_wrapper_text/1 function in Layer 1
    • Now properly removes trailing text after valid JSON structures
    • Example: [{"id": 1}]\n1 Volume(s) created โ†’ [{"id": 1}]

0.1.2 - 2025-06-08

Added

  • Advanced debugging capabilities: New repair_with_debug/2 function with comprehensive step-by-step instrumentation
  • Multi-word unquoted value support: Enhanced Layer 3 to handle complex cases like Weiss Savage โ†’ "Weiss Savage"
  • Detailed repair reporting: Enhanced logging with position tracking, original values, and replacement details
  • Layer 3 architecture refactor: Modularized syntax normalization into specialized processors:
    • BinaryProcessors: High-performance binary pattern matching
    • CharacterParsers: Context-aware character-by-character parsing
    • QuoteProcessors: Advanced quote normalization
    • RuleProcessors: Rule-based transformation engine
    • SyntaxDetectors: Pattern detection and classification
    • PostProcessors: Final cleanup and validation

Enhanced

  • Performance optimization: Improved Layer 3 processing with sophisticated multi-word value detection
  • Type safety: Added comprehensive Dialyzer type specifications for all debug functions
  • Test coverage: Added integration tests for real-world JSON repair scenarios
  • Error handling: Enhanced context sanitization for layer module compatibility

Fixed

  • Complex unquoted values: Fixed handling of multi-word unquoted string values with spaces
  • Boolean preservation: Maintained proper boolean/null normalization while adding quote support
  • Structural repair: Improved missing bracket and brace detection
  • Type specifications: Resolved all Dialyzer warnings for enhanced type safety

Technical Details

  • Debug instrumentation: Added process_through_pipeline_with_debug/3 and process_layer_with_debug/5
  • Context management: Enhanced repair context with debug step tracking
  • Performance metrics: Added processing time tracking and repair counting
  • Real-world testing: Added 271-line test file with complex JSON repair scenarios

Performance

  • Efficient processing: ~48ms for 9KB malformed JSON files
  • Minimal repairs: Optimized to make only necessary changes (2 repairs vs 8+ in previous versions)
  • Layer 3 efficiency: Enhanced binary processing with reduced repair overhead

0.1.1 - 2025-06-07

Changed

  • BREAKING: Complete architectural rewrite - Brand new 5-layer pipeline design
  • New layered approach: Regex โ†’ State Machine โ†’ Character Parsing โ†’ Validation โ†’ Tolerant Parsing
  • Improved performance: Significantly faster with intelligent fast-path optimization
  • Better reliability: More robust handling of complex malformed JSON
  • Enhanced API: More intuitive function signatures and options
  • Superior test coverage: Comprehensive test suite with real-world scenarios

Added

  • Layer 1 - Content Cleaning: Advanced code fence removal, comment stripping, encoding normalization
  • Layer 2 - Structural Repair: Sophisticated state machine for delimiter repair and object concatenation
  • Layer 3 - Syntax Normalization: Context-aware quote, boolean, and comma normalization
  • Layer 4 - Fast Validation: Jason.decode optimization with early exit for valid JSON
  • Implementation Status Documentation: Clear roadmap and current capabilities
  • Real-world Examples: Comprehensive examples for LLM output, legacy systems, streaming data
  • Advanced Benchmarking: Performance testing suite with memory profiling

Technical Details

  • Complete codebase rewrite: All modules redesigned from ground up
  • New design patterns: LayerBehaviour protocol, Context tracking, Pipeline architecture
  • Enhanced maintainability: Modular design with clear separation of concerns
  • Production readiness: Comprehensive error handling and edge case coverage

Future

  • Layer 5 - Tolerant Parsing: Planned for next major release (aggressive error recovery)

Note

This is a 100% rewrite - all previous code has been replaced with the new layered architecture. While the API maintains compatibility, the internal implementation is entirely new.

0.1.0 - 2025-06-06

Added

  • Initial release of JsonRemedy
  • Binary pattern matching JSON parser with repair capabilities
  • Support for repairing common JSON malformations:
    • Missing quotes around keys and values
    • Single quotes instead of double quotes
    • Trailing commas in arrays and objects
    • Missing commas between elements
    • Incomplete objects and arrays
    • Boolean variants (True/False/TRUE/FALSE)
    • Null variants (None/NULL/Null)
    • Code fence removal (triple-backtick json fences)
    • Comment stripping (// and / /)
  • Core API functions:
  • Optional repair logging with detailed action tracking
  • Multiple parsing strategies (binary patterns, combinators, streaming)
  • Comprehensive test suite with 10 doctests and 25 unit tests
  • Performance benchmarking suite
  • CLI tool with escript support
  • Complete documentation with examples
  • MIT license

Performance

  • 4.32M operations/sec for valid JSON parsing
  • 90,000+ operations/sec for malformed JSON repair
  • Sub-microsecond parsing for small valid JSON
  • Minimal memory overhead (< 8KB for repairs)
  • All operations pass performance thresholds