lexbor_erl

View Source

CI lexbor_erl version Hex.pm

An Erlang wrapper for the Lexbor HTML parser and DOM library via a port-based architecture.

Overview

lexbor_erl provides safe, fast HTML parsing, CSS selector querying, DOM manipulation, and streaming parser capabilities for Erlang applications. It wraps the high-performance Lexbor C library using a port-based worker pool architecture for isolation, safety, and parallel processing.

Features

  • HTML5-tolerant parsing with automatic error recovery
  • CSS selector queries (class, ID, tag, attributes, combinators, pseudo-classes)
  • DOM manipulation - modify attributes, text content, and tree structure
  • Streaming parser - parse large HTML documents incrementally
  • Stateless operations for quick one-off tasks
  • Stateful document management for complex workflows
  • Parallel processing - worker pool architecture for concurrent operations
  • Safe for the BEAM - crashes in native code don't bring down the VM
  • No atom leaks - all user input stays as binaries

Prerequisites

  • Erlang/OTP (tested with OTP 24+)
  • CMake 3.10+
  • Lexbor library installed on your system

Installing Lexbor

On macOS with Homebrew:

brew install lexbor

On Ubuntu/Debian:

sudo apt-get install liblexbor-dev

Or build from source:

git clone https://github.com/lexbor/lexbor.git
cd lexbor
mkdir build && cd build
cmake ..
make
sudo make install

Building

make

Quick Start

1> lexbor_erl:start().
ok

%% Stateless: parse and serialize
2> {ok, Html} = lexbor_erl:parse_serialize(<<"<div>Hello<span>World">>).
{ok,<<"<html><head></head><body><div>Hello<span>World</span></div></body></html>">>}

%% Stateless: select elements
3> {ok, List} = lexbor_erl:select_html(
     <<"<ul><li class=a>A</li><li class=b>B</li></ul>">>, 
     <<"li.b">>).
{ok,[<<"<li class=\"b\">B</li>">>]}

%% Stateful: parse document
4> {ok, Doc} = lexbor_erl:parse(
     <<"<div id=app><ul><li class=a>A</li><li class=b>B</li></ul></div>">>).
{ok,1}

%% Select nodes
5> {ok, Nodes} = lexbor_erl:select(Doc, <<"#app li">>).
{ok,[{node,140735108544752},{node,140735108544896}]}

%% Get node HTML
6> [lexbor_erl:outer_html(Doc, N) || N <- Nodes].
[{ok,<<"<li class=\"a\">A</li>">>},{ok,<<"<li class=\"b\">B</li>">>}]

%% DOM manipulation: modify attributes
7> {ok, [Li]} = lexbor_erl:select(Doc, <<"li.a">>).
{ok,[{node,140735108544752}]}

8> lexbor_erl:set_attribute(Doc, Li, <<"class">>, <<"modified">>).
ok

9> lexbor_erl:get_attribute(Doc, Li, <<"class">>).
{ok,<<"modified">>}

%% DOM manipulation: modify text content
10> lexbor_erl:set_text(Doc, Li, <<"New Text">>).
ok

11> lexbor_erl:get_text(Doc, Li).
{ok,<<"New Text">>}

%% Content manipulation: append HTML to matching elements
12> {ok, NumModified} = lexbor_erl:append_content(Doc, <<"ul">>, <<"<li>New Item</li>">>).
{ok,1}

13> {ok, Html} = lexbor_erl:serialize(Doc).
{ok,<<"<!DOCTYPE html><html><head></head><body><div id=\"app\"><ul><li class=\"modified\">New Text</li><li class=\"b\">B</li><li>New Item</li></ul></div></body></html>">>}

%% Streaming parser: parse incrementally
14> {ok, Session} = lexbor_erl:parse_stream_begin().
{ok,72057594037927937}

15> ok = lexbor_erl:parse_stream_chunk(Session, <<"<div><p>He">>).
ok

16> ok = lexbor_erl:parse_stream_chunk(Session, <<"llo</p></div>">>).
ok

17> {ok, StreamDoc} = lexbor_erl:parse_stream_end(Session).
{ok,72057594037927938}

%% Release documents
18> ok = lexbor_erl:release(Doc).
ok

19> ok = lexbor_erl:release(StreamDoc).
ok

20> lexbor_erl:stop().
ok

Also check out examples/ directory.

Supported Operations

Document Lifecycle

  • parse/1 - Parse HTML document, returns document handle
  • release/1 - Release document and free resources
  • serialize/1 - Serialize document to HTML5 binary

Stateless Operations

  • parse_serialize/1 - Parse and serialize in one call (convenience function)
  • select_html/2 - Parse, select elements, return HTML fragments

CSS Selectors

  • select/2 - Find elements using CSS selectors
  • Supports: ID (#id), class (.class), tag (div), attributes ([attr=value])
  • Supports: combinators (Descendant , Child >, Adjacent sibling +, General sibling ~), pseudo-classes (:first-child, :nth-child(), etc.)

DOM Queries

  • outer_html/2 - Get outer HTML of element (including the element tag)
  • inner_html/2 - Get inner HTML of element (children only)

Attribute Manipulation

  • get_attribute/3 - Get attribute value
  • set_attribute/4 - Set attribute value
  • remove_attribute/3 - Remove attribute

Text Content

  • get_text/2 - Get text content recursively
  • set_text/3 - Set text content (removes all children, replaces with text)

HTML Content Manipulation

  • set_inner_html/3 - Replace element's children with parsed HTML
  • append_content/3 - Append HTML content to all elements matching selector
  • prepend_content/3 - Prepend as first child
  • insert_before_content/3 - Insert HTML as sibling before matched elements
  • insert_after_content/3 - Insert HTML as sibling after matched elements
  • replace_content/3 - Replace matched elements with HTML content

DOM Tree Manipulation

  • create_element/2 - Create new element
  • append_child/3 - Append child node to parent
  • insert_before/4 - Insert node before reference node
  • remove_node/2 - Remove node from tree

Streaming Parser

  • parse_stream_begin/0 - Start streaming parse session
  • parse_stream_chunk/2 - Add HTML chunk to stream
  • parse_stream_end/1 - Finalize stream and get document

Application Management

  • start/0 - Start lexbor_erl application
  • stop/0 - Stop lexbor_erl application
  • alive/0 - Check if service is running

How to use it in your application?

Add to your rebar.config:

{deps, [
    {lexbor_erl, "0.3.0"}
]}.

Then run:

rebar3 get-deps
rebar3 compile

Note: lexbor_erl is a port-based application and cannot be packaged as an escript. It must be used as a library dependency with access to the compiled C port executable.

See the demo/ directory for complete working application.

Additional configuration

In your sys.config:

{lexbor_erl, [
  {port_cmd, "priv/lexbor_port"},
  {op_timeout_ms, 3000}
]}.

Parallelism and Concurrency

lexbor_erl uses a worker pool architecture to enable true parallel processing of HTML operations:

Architecture

  • Multiple port workers: Configurable pool of independent C port processes
  • Smart routing:
    • Stateless operations (e.g., parse_serialize/1, select_html/2) use time-based hash distribution for load balancing
    • Stateful operations route by DocId to ensure the same worker handles all operations for a given document
  • Isolation: Each worker process is independent with its own document registry
  • Individual supervision: Each worker is supervised independently - if one crashes, only that worker restarts
  • Fault tolerance: Worker crashes don't affect other workers or the BEAM VM; documents on crashed worker are lost but other workers continue serving

Configuration

Set the pool size in your sys.config:

{lexbor_erl, [
  {pool_size, 8},              % Number of parallel workers (default: scheduler count)
  {op_timeout_ms, 3000}        % Timeout per operation
]}.

Or via environment variable when starting the application:

application:set_env(lexbor_erl, pool_size, 8).

Thread Safety and Fault Tolerance

  • Safe by design: Each worker is single-threaded, processing one request at a time
  • No shared state: Documents are isolated to their respective workers
  • Concurrent operations: Multiple workers can process different documents simultaneously
  • Deterministic routing: A document always routes to the same worker via the worker ID encoded in the DocId
  • Individual worker restart: If a worker crashes, only that worker is restarted by the supervisor
  • Limited blast radius: Worker crashes only affect documents on that specific worker
  • Automatic recovery: Crashed workers are automatically restarted and can accept new documents

Performance Characteristics

  • Parallelism: Leverages all CPU cores for concurrent HTML parsing and manipulation
  • No contention: No locks or shared mutable state between workers
  • Linear scaling: Performance scales linearly with the number of workers (up to CPU core count)
  • Stateless optimization: Stateless operations (parse_serialize, select_html) can use any available worker

License

LGPL-2.1-or-later

Credits

Built on top of the Lexbor HTML parser library.