lexbor_erl
View SourceAn Erlang wrapper for the Lexbor HTML parser and DOM library via a port-based architecture.
Overview
lexbor_erl provides safe, fast HTML parsing, CSS selector querying, DOM manipulation, and streaming parser capabilities for Erlang applications. It wraps the high-performance Lexbor C library using a port-based worker pool architecture for isolation, safety, and parallel processing.
Features
- HTML5-tolerant parsing with automatic error recovery
- CSS selector queries (class, ID, tag, attributes, combinators, pseudo-classes)
- DOM manipulation - modify attributes, text content, and tree structure
- Streaming parser - parse large HTML documents incrementally
- Stateless operations for quick one-off tasks
- Stateful document management for complex workflows
- Parallel processing - worker pool architecture for concurrent operations
- Safe for the BEAM - crashes in native code don't bring down the VM
- No atom leaks - all user input stays as binaries
Prerequisites
- Erlang/OTP (tested with OTP 24+)
- CMake 3.10+
- Lexbor library installed on your system
Installing Lexbor
On macOS with Homebrew:
brew install lexbor
On Ubuntu/Debian:
sudo apt-get install liblexbor-dev
Or build from source:
git clone https://github.com/lexbor/lexbor.git
cd lexbor
mkdir build && cd build
cmake ..
make
sudo make install
Building
make
Quick Start
1> lexbor_erl:start().
ok
%% Stateless: parse and serialize
2> {ok, Html} = lexbor_erl:parse_serialize(<<"<div>Hello<span>World">>).
{ok,<<"<html><head></head><body><div>Hello<span>World</span></div></body></html>">>}
%% Stateless: select elements
3> {ok, List} = lexbor_erl:select_html(
<<"<ul><li class=a>A</li><li class=b>B</li></ul>">>,
<<"li.b">>).
{ok,[<<"<li class=\"b\">B</li>">>]}
%% Stateful: parse document
4> {ok, Doc} = lexbor_erl:parse(
<<"<div id=app><ul><li class=a>A</li><li class=b>B</li></ul></div>">>).
{ok,1}
%% Select nodes
5> {ok, Nodes} = lexbor_erl:select(Doc, <<"#app li">>).
{ok,[{node,140735108544752},{node,140735108544896}]}
%% Get node HTML
6> [lexbor_erl:outer_html(Doc, N) || N <- Nodes].
[{ok,<<"<li class=\"a\">A</li>">>},{ok,<<"<li class=\"b\">B</li>">>}]
%% DOM manipulation: modify attributes
7> {ok, [Li]} = lexbor_erl:select(Doc, <<"li.a">>).
{ok,[{node,140735108544752}]}
8> lexbor_erl:set_attribute(Doc, Li, <<"class">>, <<"modified">>).
ok
9> lexbor_erl:get_attribute(Doc, Li, <<"class">>).
{ok,<<"modified">>}
%% DOM manipulation: modify text content
10> lexbor_erl:set_text(Doc, Li, <<"New Text">>).
ok
11> lexbor_erl:get_text(Doc, Li).
{ok,<<"New Text">>}
%% Content manipulation: append HTML to matching elements
12> {ok, NumModified} = lexbor_erl:append_content(Doc, <<"ul">>, <<"<li>New Item</li>">>).
{ok,1}
13> {ok, Html} = lexbor_erl:serialize(Doc).
{ok,<<"<!DOCTYPE html><html><head></head><body><div id=\"app\"><ul><li class=\"modified\">New Text</li><li class=\"b\">B</li><li>New Item</li></ul></div></body></html>">>}
%% Streaming parser: parse incrementally
14> {ok, Session} = lexbor_erl:parse_stream_begin().
{ok,72057594037927937}
15> ok = lexbor_erl:parse_stream_chunk(Session, <<"<div><p>He">>).
ok
16> ok = lexbor_erl:parse_stream_chunk(Session, <<"llo</p></div>">>).
ok
17> {ok, StreamDoc} = lexbor_erl:parse_stream_end(Session).
{ok,72057594037927938}
%% Release documents
18> ok = lexbor_erl:release(Doc).
ok
19> ok = lexbor_erl:release(StreamDoc).
ok
20> lexbor_erl:stop().
okAlso check out examples/ directory.
Supported Operations
Document Lifecycle
parse/1- Parse HTML document, returns document handlerelease/1- Release document and free resourcesserialize/1- Serialize document to HTML5 binary
Stateless Operations
parse_serialize/1- Parse and serialize in one call (convenience function)select_html/2- Parse, select elements, return HTML fragments
CSS Selectors
select/2- Find elements using CSS selectors- Supports: ID (
#id), class (.class), tag (div), attributes ([attr=value]) - Supports: combinators (Descendant
, Child>, Adjacent sibling+, General sibling~), pseudo-classes (:first-child,:nth-child(), etc.)
DOM Queries
outer_html/2- Get outer HTML of element (including the element tag)inner_html/2- Get inner HTML of element (children only)
Attribute Manipulation
get_attribute/3- Get attribute valueset_attribute/4- Set attribute valueremove_attribute/3- Remove attribute
Text Content
get_text/2- Get text content recursivelyset_text/3- Set text content (removes all children, replaces with text)
HTML Content Manipulation
set_inner_html/3- Replace element's children with parsed HTMLappend_content/3- Append HTML content to all elements matching selectorprepend_content/3- Prepend as first childinsert_before_content/3- Insert HTML as sibling before matched elementsinsert_after_content/3- Insert HTML as sibling after matched elementsreplace_content/3- Replace matched elements with HTML content
DOM Tree Manipulation
create_element/2- Create new elementappend_child/3- Append child node to parentinsert_before/4- Insert node before reference noderemove_node/2- Remove node from tree
Streaming Parser
parse_stream_begin/0- Start streaming parse sessionparse_stream_chunk/2- Add HTML chunk to streamparse_stream_end/1- Finalize stream and get document
Application Management
start/0- Start lexbor_erl applicationstop/0- Stop lexbor_erl applicationalive/0- Check if service is running
How to use it in your application?
Add to your rebar.config:
{deps, [
{lexbor_erl, "0.3.0"}
]}.Then run:
rebar3 get-deps
rebar3 compile
Note: lexbor_erl is a port-based application and cannot be packaged as an escript. It must be used as a library dependency with access to the compiled C port executable.
See the demo/ directory for complete working application.
Additional configuration
In your sys.config:
{lexbor_erl, [
{port_cmd, "priv/lexbor_port"},
{op_timeout_ms, 3000}
]}.Parallelism and Concurrency
lexbor_erl uses a worker pool architecture to enable true parallel processing of HTML operations:
Architecture
- Multiple port workers: Configurable pool of independent C port processes
- Smart routing:
- Stateless operations (e.g.,
parse_serialize/1,select_html/2) use time-based hash distribution for load balancing - Stateful operations route by
DocIdto ensure the same worker handles all operations for a given document
- Stateless operations (e.g.,
- Isolation: Each worker process is independent with its own document registry
- Individual supervision: Each worker is supervised independently - if one crashes, only that worker restarts
- Fault tolerance: Worker crashes don't affect other workers or the BEAM VM; documents on crashed worker are lost but other workers continue serving
Configuration
Set the pool size in your sys.config:
{lexbor_erl, [
{pool_size, 8}, % Number of parallel workers (default: scheduler count)
{op_timeout_ms, 3000} % Timeout per operation
]}.Or via environment variable when starting the application:
application:set_env(lexbor_erl, pool_size, 8).Thread Safety and Fault Tolerance
- Safe by design: Each worker is single-threaded, processing one request at a time
- No shared state: Documents are isolated to their respective workers
- Concurrent operations: Multiple workers can process different documents simultaneously
- Deterministic routing: A document always routes to the same worker via the worker ID encoded in the
DocId - Individual worker restart: If a worker crashes, only that worker is restarted by the supervisor
- Limited blast radius: Worker crashes only affect documents on that specific worker
- Automatic recovery: Crashed workers are automatically restarted and can accept new documents
Performance Characteristics
- Parallelism: Leverages all CPU cores for concurrent HTML parsing and manipulation
- No contention: No locks or shared mutable state between workers
- Linear scaling: Performance scales linearly with the number of workers (up to CPU core count)
- Stateless optimization: Stateless operations (
parse_serialize,select_html) can use any available worker
License
LGPL-2.1-or-later
Credits
Built on top of the Lexbor HTML parser library.