lexbor_erl

An Erlang wrapper for the Lexbor HTML parser and DOM library via a port-based architecture.

Overview

lexbor_erl provides safe, fast HTML parsing, CSS selector querying, DOM manipulation, and streaming parser capabilities for Erlang applications. It wraps the high-performance Lexbor C library using a port-based worker pool architecture for isolation, safety, and parallel processing.

Features

HTML5-tolerant parsing with automatic error recovery
CSS selector queries (class, ID, tag, attributes, combinators, pseudo-classes)
DOM manipulation - modify attributes, text content, and tree structure
Streaming parser - parse large HTML documents incrementally
Stateless operations for quick one-off tasks
Stateful document management for complex workflows
Parallel processing - worker pool architecture for concurrent operations
Safe for the BEAM - crashes in native code don't bring down the VM
No atom leaks - all user input stays as binaries

Prerequisites

Erlang/OTP (tested with OTP 24+)
CMake 3.10+
Lexbor library installed on your system

Installing Lexbor

On macOS with Homebrew:

brew install lexbor

On Ubuntu/Debian:

sudo apt-get install liblexbor-dev

Or build from source:

git clone https://github.com/lexbor/lexbor.git
cd lexbor
mkdir build && cd build
cmake ..
make
sudo make install

Building

make

Quick Start

1> lexbor_erl:start().
ok

%% Stateless: parse and serialize
2> {ok, Html} = lexbor_erl:parse_serialize(<<"<div>Hello<span>World">>).
{ok,<<"<html><head></head><body><div>Hello<span>World</span></div></body></html>">>}

%% Stateless: select elements
3> {ok, List} = lexbor_erl:select_html(
     <<"<ul><li class=a>A</li><li class=b>B</li></ul>">>, 
     <<"li.b">>).
{ok,[<<"<li class=\"b\">B</li>">>]}

%% Stateful: parse document
4> {ok, Doc} = lexbor_erl:parse(
     <<"<div id=app><ul><li class=a>A</li><li class=b>B</li></ul></div>">>).
{ok,1}

%% Select nodes
5> {ok, Nodes} = lexbor_erl:select(Doc, <<"#app li">>).
{ok,[{node,140735108544752},{node,140735108544896}]}

%% Get node HTML
6> [lexbor_erl:outer_html(Doc, N) || N <- Nodes].
[{ok,<<"<li class=\"a\">A</li>">>},{ok,<<"<li class=\"b\">B</li>">>}]

%% DOM manipulation: modify attributes
7> {ok, [Li]} = lexbor_erl:select(Doc, <<"li.a">>).
{ok,[{node,140735108544752}]}

8> lexbor_erl:set_attribute(Doc, Li, <<"class">>, <<"modified">>).
ok

9> lexbor_erl:get_attribute(Doc, Li, <<"class">>).
{ok,<<"modified">>}

%% DOM manipulation: modify text content
10> lexbor_erl:set_text(Doc, Li, <<"New Text">>).
ok

11> lexbor_erl:get_text(Doc, Li).
{ok,<<"New Text">>}

%% Content manipulation: append HTML to matching elements
12> {ok, NumModified} = lexbor_erl:append_content(Doc, <<"ul">>, <<"<li>New Item</li>">>).
{ok,1}

13> {ok, Html} = lexbor_erl:serialize(Doc).
{ok,<<"<!DOCTYPE html><html><head></head><body><div id=\"app\"><ul><li class=\"modified\">New Text</li><li class=\"b\">B</li><li>New Item</li></ul></div></body></html>">>}

%% Streaming parser: parse incrementally
14> {ok, Session} = lexbor_erl:parse_stream_begin().
{ok,72057594037927937}

15> ok = lexbor_erl:parse_stream_chunk(Session, <<"<div><p>He">>).
ok

16> ok = lexbor_erl:parse_stream_chunk(Session, <<"llo</p></div>">>).
ok

17> {ok, StreamDoc} = lexbor_erl:parse_stream_end(Session).
{ok,72057594037927938}

%% Release documents
18> ok = lexbor_erl:release(Doc).
ok

19> ok = lexbor_erl:release(StreamDoc).
ok

20> lexbor_erl:stop().
ok

Also check out examples/ directory.

Supported Operations

Document Lifecycle

parse/1 - Parse HTML document, returns document handle
release/1 - Release document and free resources
serialize/1 - Serialize document to HTML5 binary

Stateless Operations

parse_serialize/1 - Parse and serialize in one call (convenience function)
select_html/2 - Parse, select elements, return HTML fragments

CSS Selectors

select/2 - Find elements using CSS selectors
Supports: ID (#id), class (.class), tag (div), attributes ([attr=value])
Supports: combinators (Descendant , Child >, Adjacent sibling +, General sibling ~), pseudo-classes (:first-child, :nth-child(), etc.)

DOM Queries

outer_html/2 - Get outer HTML of element (including the element tag)
inner_html/2 - Get inner HTML of element (children only)

Attribute Manipulation

get_attribute/3 - Get attribute value
set_attribute/4 - Set attribute value
remove_attribute/3 - Remove attribute

Text Content

get_text/2 - Get text content recursively
set_text/3 - Set text content (removes all children, replaces with text)

HTML Content Manipulation

set_inner_html/3 - Replace element's children with parsed HTML
append_content/3 - Append HTML content to all elements matching selector
prepend_content/3 - Prepend as first child
insert_before_content/3 - Insert HTML as sibling before matched elements
insert_after_content/3 - Insert HTML as sibling after matched elements
replace_content/3 - Replace matched elements with HTML content

DOM Tree Manipulation

create_element/2 - Create new element
append_child/3 - Append child node to parent
insert_before/4 - Insert node before reference node
remove_node/2 - Remove node from tree

Streaming Parser

parse_stream_begin/0 - Start streaming parse session
parse_stream_chunk/2 - Add HTML chunk to stream
parse_stream_end/1 - Finalize stream and get document

Application Management

start/0 - Start lexbor_erl application
stop/0 - Stop lexbor_erl application
alive/0 - Check if service is running

How to use it in your application?

Add to your rebar.config:

{deps, [
    {lexbor_erl, "0.3.0"}
]}.

Then run:

rebar3 get-deps
rebar3 compile

Note: lexbor_erl is a port-based application and cannot be packaged as an escript. It must be used as a library dependency with access to the compiled C port executable.

See the demo/ directory for complete working application.

Additional configuration

In your sys.config:

{lexbor_erl, [
  {port_cmd, "priv/lexbor_port"},
  {op_timeout_ms, 3000}
]}.

Parallelism and Concurrency

lexbor_erl uses a worker pool architecture to enable true parallel processing of HTML operations:

Architecture

Multiple port workers: Configurable pool of independent C port processes
Smart routing:
- Stateless operations (e.g., parse_serialize/1, select_html/2) use time-based hash distribution for load balancing
- Stateful operations route by DocId to ensure the same worker handles all operations for a given document
Isolation: Each worker process is independent with its own document registry
Individual supervision: Each worker is supervised independently - if one crashes, only that worker restarts
Fault tolerance: Worker crashes don't affect other workers or the BEAM VM; documents on crashed worker are lost but other workers continue serving

Configuration

Set the pool size in your sys.config:

{lexbor_erl, [
  {pool_size, 8},              % Number of parallel workers (default: scheduler count)
  {op_timeout_ms, 3000}        % Timeout per operation
]}.

Or via environment variable when starting the application:

application:set_env(lexbor_erl, pool_size, 8).

Thread Safety and Fault Tolerance

Safe by design: Each worker is single-threaded, processing one request at a time
No shared state: Documents are isolated to their respective workers
Concurrent operations: Multiple workers can process different documents simultaneously
Deterministic routing: A document always routes to the same worker via the worker ID encoded in the DocId
Individual worker restart: If a worker crashes, only that worker is restarted by the supervisor
Limited blast radius: Worker crashes only affect documents on that specific worker
Automatic recovery: Crashed workers are automatically restarted and can accept new documents

Performance Characteristics

Parallelism: Leverages all CPU cores for concurrent HTML parsing and manipulation
No contention: No locks or shared mutable state between workers
Linear scaling: Performance scales linearly with the number of workers (up to CPU core count)
Stateless optimization: Stateless operations (parse_serialize, select_html) can use any available worker

License

LGPL-2.1-or-later

Credits

Built on top of the Lexbor HTML parser library.

Next Page → Changelog