lexbor_erl (lexbor_erl v0.3.0)

Erlang wrapper for Lexbor HTML parser via port interface.

This module provides a high-level API for parsing, querying, and manipulating HTML documents using the Lexbor C library. It supports both stateless operations (for one-shot parsing) and stateful operations (for multiple queries on the same document).

The implementation uses a pool of worker processes, each managing an independent C port to the Lexbor library. This provides:

True parallelism across CPU cores
Fault isolation - worker crashes don't affect other workers
Thread safety through message passing
Automatic recovery from crashes

Example Usage

  % Start the application
  ok = lexbor_erl:start().
 
  % Stateless: parse and normalize HTML
  {ok, CleanHtml} = lexbor_erl:parse_serialize(<<"<p>Hello</p>">>).
 
  % Stateless: parse and select elements
  {ok, Elements} = lexbor_erl:select_html(<<"<div><p>A</p><p>B</p></div>">>, <<"p">>).
 
  % Stateful: parse once, query multiple times
  {ok, Doc} = lexbor_erl:parse(<<"<html><body><p class='a'>Hello</p></body></html>">>),
  {ok, Nodes} = lexbor_erl:select(Doc, <<"p.a">>),
  {ok, Html} = lexbor_erl:outer_html(Doc, hd(Nodes)),
  ok = lexbor_erl:release(Doc).

Summary

Types

doc_id/0

html_bin/0

node_ref/0

result/1

selector/0

session_id/0

Functions

alive()

Check if the lexbor_erl service is alive and ready.

append_child(DocId, _, _)

Append a child node to a parent element.

append_content(DocId, Selector, Html)

Append HTML content to all elements matching a CSS selector.

create_element(DocId, TagName)

Create a new element node.

get_attribute(DocId, _, AttrName)

Get an attribute value from an element node.

get_text(DocId, _)

Get the text content of a node.

inner_html(DocId, _)

Get the inner HTML of a node.

insert_after_content(DocId, Selector, Html)

Insert HTML content after all elements matching a CSS selector.

insert_before(DocId, _, _, _)

Insert a node before a reference node.

insert_before_content(DocId, Selector, Html)

Insert HTML content before all elements matching a CSS selector.

outer_html(DocId, _)

Get the outer HTML of a node.

parse(Html)

Parse HTML and return a document handle (stateful operation).

parse_serialize(Html)

Parse HTML and serialize it back (stateless operation).

parse_stream_begin()

Begin a streaming parse session.

parse_stream_chunk(SessionId, Chunk)

Feed a chunk of HTML to a streaming parse session.

parse_stream_end(SessionId)

Finalize a streaming parse session and get the document.

prepend_content(DocId, Selector, Html)

Prepend HTML content to all elements matching a CSS selector.

release(DocId)

Release a document and free its resources.

remove_attribute(DocId, _, AttrName)

Remove an attribute from an element node.

remove_node(DocId, _)

Remove a node from its parent.

replace_content(DocId, Selector, Html)

Replace all elements matching a CSS selector with HTML content.

select(DocId, Css)

Select nodes from a document using a CSS selector.

select_html(Html, Css)

Parse HTML and select elements by CSS selector (stateless operation).

serialize(DocId)

Serialize the entire document to HTML.

set_attribute(DocId, _, AttrName, Value)

Set an attribute value on an element node.

set_inner_html(DocId, _, Html)

Set the inner HTML of a node.

set_text(DocId, _, Text)

Set the text content of a node.

start()

Start the lexbor_erl application and its dependencies.

stop()

Stop the lexbor_erl application.