lexbor_erl (lexbor_erl v0.3.0)
View SourceErlang wrapper for Lexbor HTML parser via port interface.
This module provides a high-level API for parsing, querying, and manipulating HTML documents using the Lexbor C library. It supports both stateless operations (for one-shot parsing) and stateful operations (for multiple queries on the same document).
The implementation uses a pool of worker processes, each managing an independent C port to the Lexbor library. This provides:
- True parallelism across CPU cores
- Fault isolation - worker crashes don't affect other workers
- Thread safety through message passing
- Automatic recovery from crashes
Example Usage
% Start the application
ok = lexbor_erl:start().
% Stateless: parse and normalize HTML
{ok, CleanHtml} = lexbor_erl:parse_serialize(<<"<p>Hello</p>">>).
% Stateless: parse and select elements
{ok, Elements} = lexbor_erl:select_html(<<"<div><p>A</p><p>B</p></div>">>, <<"p">>).
% Stateful: parse once, query multiple times
{ok, Doc} = lexbor_erl:parse(<<"<html><body><p class='a'>Hello</p></body></html>">>),
{ok, Nodes} = lexbor_erl:select(Doc, <<"p.a">>),
{ok, Html} = lexbor_erl:outer_html(Doc, hd(Nodes)),
ok = lexbor_erl:release(Doc).
Summary
Functions
Check if the lexbor_erl service is alive and ready.
Append a child node to a parent element.
Append HTML content to all elements matching a CSS selector.
Create a new element node.
Get an attribute value from an element node.
Get the text content of a node.
Get the inner HTML of a node.
Insert HTML content after all elements matching a CSS selector.
Insert a node before a reference node.
Insert HTML content before all elements matching a CSS selector.
Get the outer HTML of a node.
Parse HTML and return a document handle (stateful operation).
Parse HTML and serialize it back (stateless operation).
Begin a streaming parse session.
Feed a chunk of HTML to a streaming parse session.
Finalize a streaming parse session and get the document.
Prepend HTML content to all elements matching a CSS selector.
Release a document and free its resources.
Remove an attribute from an element node.
Remove a node from its parent.
Replace all elements matching a CSS selector with HTML content.
Select nodes from a document using a CSS selector.
Parse HTML and select elements by CSS selector (stateless operation).
Serialize the entire document to HTML.
Set an attribute value on an element node.
Set the inner HTML of a node.
Set the text content of a node.
Start the lexbor_erl application and its dependencies.
Stop the lexbor_erl application.
Types
Functions
-spec alive() -> boolean().
Check if the lexbor_erl service is alive and ready.
Returns true if at least one worker is alive and ready to accept requests.
Append a child node to a parent element.
Adds the child node as the last child of the parent. If the child was previously attached elsewhere, it will be moved to the new location.
Example
{ok, Doc} = lexbor_erl:parse(<<"<ul id='list'></ul>">>),
{ok, [List]} = lexbor_erl:select(Doc, <<"#list">>),
{ok, Item} = lexbor_erl:create_element(Doc, <<"li">>),
ok = lexbor_erl:set_text(Doc, Item, <<"Item 1">>),
ok = lexbor_erl:append_child(Doc, List, Item),
-spec append_content(doc_id(), selector(), html_bin()) -> {ok, non_neg_integer()} | {error, term()}.
Append HTML content to all elements matching a CSS selector.
Parses the CSS selector to find all matching elements in the document, then parses the HTML content and appends it as children to each matched element. Returns the number of elements that were modified.
This is a high-level operation that combines selector matching, HTML parsing, and DOM manipulation in a single atomic operation.
Note: This operation works on full HTML5 documents. The document is always serialized as complete HTML5. Scope extraction (body_children, body, head) is handled by ModestEx using regex after receiving the full HTML output.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Hello</p></div>">>),
{ok, 1} = lexbor_erl:append_content(Doc, <<"div">>, <<"<p>World</p>">>),
{ok, Html} = lexbor_erl:serialize(Doc),
% Html = <<"<!DOCTYPE html><html>...><div><p>Hello</p><p>World</p></div>...</html>">>
ok = lexbor_erl:release(Doc).Error Handling
Returns errors for:
doc_not_found- Document ID is invalidinvalid_selector- CSS selector syntax errorcss_parser_create_failed- Failed to create CSS parsercss_parser_init_failed- Failed to initialize CSS parserselectors_create_failed- Failed to create selectors engineselectors_init_failed- Failed to initialize selectors engineselector_match_failed- Selector matching failed
Create a new element node.
Creates a new element with the specified tag name. The element is created but not attached to the document tree. Use append_child/3 or insert_before/4 to add it to the document.
Example
{ok, Doc} = lexbor_erl:parse(<<"<html><body></body></html>">>),
{ok, NewDiv} = lexbor_erl:create_element(Doc, <<"div">>),
ok = lexbor_erl:set_attribute(Doc, NewDiv, <<"id">>, <<"new">>),
ok = lexbor_erl:set_text(Doc, NewDiv, <<"New content">>),
{ok, [Body]} = lexbor_erl:select(Doc, <<"body">>),
ok = lexbor_erl:append_child(Doc, Body, NewDiv),
-spec get_attribute(doc_id(), node_ref(), binary()) -> {ok, binary() | undefined} | {error, term()}.
Get an attribute value from an element node.
Retrieves the value of a specified attribute from an element node. Returns {ok, undefined} if the attribute does not exist.
Only works on element nodes - other node types will return an error.
Example
{ok, Doc} = lexbor_erl:parse(<<"<a href='/home' class='link'>Home</a>">>),
{ok, [Link]} = lexbor_erl:select(Doc, <<"a">>),
{ok, Href} = lexbor_erl:get_attribute(Doc, Link, <<"href">>),
% Href: <<"/home">>
{ok, Title} = lexbor_erl:get_attribute(Doc, Link, <<"title">>),
% Title: undefined (attribute doesn't exist)
Get the text content of a node.
Extracts all text content from the node and its descendants, without any HTML tags. This recursively collects all text nodes within the element.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div>Hello <b>World</b>!</div>">>),
{ok, [Div]} = lexbor_erl:select(Doc, <<"div">>),
{ok, Text} = lexbor_erl:get_text(Doc, Div),
% Text: <<"Hello World!">>
Get the inner HTML of a node.
Returns the HTML content of all children, excluding the element's own tags. Similar to outer_html/2 but without the container element.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Hello</p><p>World</p></div>">>),
{ok, [Div]} = lexbor_erl:select(Doc, <<"div">>),
{ok, Inner} = lexbor_erl:inner_html(Doc, Div),
% Inner: <<"<p>Hello</p><p>World</p>">>
-spec insert_after_content(doc_id(), selector(), html_bin()) -> {ok, non_neg_integer()} | {error, term()}.
Insert HTML content after all elements matching a CSS selector.
Parses the CSS selector to find all matching elements in the document, then parses the HTML content and inserts it AFTER each matched element (as siblings, not as children). Returns the number of elements that were processed.
This is a high-level operation that combines selector matching, HTML parsing, and DOM manipulation in a single atomic operation.
Key difference from insert_before_content: While insert_before_content/3 inserts content BEFORE the matched elements, this function inserts content AFTER them. Both insert as siblings, not as children.
Note: This operation works on full HTML5 documents. The document is always serialized as complete HTML5. Scope extraction (body_children, body, head) is handled by ModestEx using regex after receiving the full HTML output.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Hello</p></div>">>),
{ok, 1} = lexbor_erl:insert_after_content(Doc, <<"p">>, <<"<p>World</p>">>),
{ok, Html} = lexbor_erl:serialize(Doc),
% Html = <<"<!DOCTYPE html><html>...><div><p>Hello</p><p>World</p></div>...</html>">>
ok = lexbor_erl:release(Doc).Edge Cases
- Elements without a parent (document root) are skipped
- Multiple consecutive matches are processed in document order
- Inserted nodes maintain their order (A, B inserted after target = target, A, B)
Error Handling
Returns errors for:
doc_not_found- Document ID is invalidinvalid_selector- CSS selector syntax errorcss_parser_create_failed- Failed to create CSS parsercss_parser_init_failed- Failed to initialize CSS parserselectors_create_failed- Failed to create selectors engineselectors_init_failed- Failed to initialize selectors engineselector_match_failed- Selector matching failed
Insert a node before a reference node.
Inserts the new node as a child of parent, positioned before the reference node. The reference node must be a child of the parent.
Example
{ok, Doc} = lexbor_erl:parse(<<"<ul><li>Second</li></ul>">>),
{ok, [List]} = lexbor_erl:select(Doc, <<"ul">>),
{ok, [Second]} = lexbor_erl:select(Doc, <<"li">>),
{ok, First} = lexbor_erl:create_element(Doc, <<"li">>),
ok = lexbor_erl:set_text(Doc, First, <<"First">>),
ok = lexbor_erl:insert_before(Doc, List, First, Second),
% Now: <ul><li>First</li><li>Second</li></ul>
-spec insert_before_content(doc_id(), selector(), html_bin()) -> {ok, non_neg_integer()} | {error, term()}.
Insert HTML content before all elements matching a CSS selector.
Parses the CSS selector to find all matching elements in the document, then parses the HTML content and inserts it BEFORE each matched element (as siblings, not as children). Returns the number of elements that were processed.
This is a high-level operation that combines selector matching, HTML parsing, and DOM manipulation in a single atomic operation.
Key difference from append/prepend: While append_content/3 and prepend_content/3 insert content as CHILDREN of the matched elements, this function inserts content as SIBLINGS (before the matched element in its parent's child list).
Note: This operation works on full HTML5 documents. The document is always serialized as complete HTML5. Scope extraction (body_children, body, head) is handled by ModestEx using regex after receiving the full HTML output.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>World</p></div>">>),
{ok, 1} = lexbor_erl:insert_before_content(Doc, <<"p">>, <<"<p>Hello</p>">>),
{ok, Html} = lexbor_erl:serialize(Doc),
% Html = <<"<!DOCTYPE html><html>...><div><p>Hello</p><p>World</p></div>...</html>">>
ok = lexbor_erl:release(Doc).Edge Cases
- Elements without a parent (document root) are skipped
- Multiple consecutive matches are processed in document order
- Earlier insertions don't affect positions of later matched elements
Error Handling
Returns errors for:
doc_not_found- Document ID is invalidinvalid_selector- CSS selector syntax errorcss_parser_create_failed- Failed to create CSS parsercss_parser_init_failed- Failed to initialize CSS parserselectors_create_failed- Failed to create selectors engineselectors_init_failed- Failed to initialize selectors engineselector_match_failed- Selector matching failed
Get the outer HTML of a node.
Returns the HTML representation of a node, including the node itself and all its descendants. The node handle must be from a select/2 operation on the same document.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Hello</p></div>">>),
{ok, [Node]} = lexbor_erl:select(Doc, <<"p">>),
{ok, Html} = lexbor_erl:outer_html(Doc, Node),
% Html: <<"<p>Hello</p>">>
ok = lexbor_erl:release(Doc).
Parse HTML and return a document handle (stateful operation).
Parses HTML and stores the document in a worker's memory. Returns an opaque document ID that can be used for subsequent operations like select/2 and outer_html/2.
Important: You must call release/1 when done with the document to free resources. Documents are stored in C memory and are not garbage collected automatically.
The document is assigned to a worker and all subsequent operations on this document will be routed to the same worker automatically.
Example
{ok, Doc} = lexbor_erl:parse(<<"<html><body><p>Hello</p></body></html>">>),
% ... perform operations on Doc ...
ok = lexbor_erl:release(Doc). % Don't forget to release!
Parse HTML and serialize it back (stateless operation).
This is a minimal round-trip operation that parses HTML and returns the normalized/serialized form. Useful for cleaning up malformed HTML.
This is a stateless operation - the document is not retained in memory after the call completes. Use parse/1 if you need to perform multiple operations on the same document.
Example
{ok, Clean} = lexbor_erl:parse_serialize(<<"<p>Hello<br>World">>).
% Returns: {ok, <<"<html><head></head><body><p>Hello<br>World</p></body></html>">>}
-spec parse_stream_begin() -> result(session_id()).
Begin a streaming parse session.
Starts a new streaming parse session that allows you to feed HTML content incrementally as chunks. This is useful for parsing very large documents or when HTML is arriving over a network connection.
After calling this, use parse_stream_chunk/2 to feed chunks, and parse_stream_end/1 to finalize and get the document.
The session is stateful and tied to a specific worker. All chunks for this session will be routed to the same worker automatically.
Example
{ok, Session} = lexbor_erl:parse_stream_begin(),
ok = lexbor_erl:parse_stream_chunk(Session, <<"<html><body>">>),
ok = lexbor_erl:parse_stream_chunk(Session, <<"<p>Hello</p>">>),
ok = lexbor_erl:parse_stream_chunk(Session, <<"</body></html>">>),
{ok, Doc} = lexbor_erl:parse_stream_end(Session),
% Now use Doc like any other document
ok = lexbor_erl:release(Doc).
-spec parse_stream_chunk(session_id(), html_bin()) -> ok | {error, term()}.
Feed a chunk of HTML to a streaming parse session.
Sends a chunk of HTML data to an ongoing parse session. The chunk can be of any size and the HTML does not need to be complete at chunk boundaries (e.g., you can split in the middle of a tag).
The parser will buffer incomplete elements internally and continue parsing as more chunks arrive.
Example
{ok, Session} = lexbor_erl:parse_stream_begin(),
ok = lexbor_erl:parse_stream_chunk(Session, <<"<div><p>Part 1">>),
ok = lexbor_erl:parse_stream_chunk(Session, <<" Part 2</p></div>">>),
{ok, Doc} = lexbor_erl:parse_stream_end(Session).
-spec parse_stream_end(session_id()) -> result(doc_id()).
Finalize a streaming parse session and get the document.
Completes the streaming parse, finalizes the DOM tree, and returns a document handle that can be used with all normal document operations like select/2, outer_html/2, etc.
After calling this, the session is closed and cannot accept more chunks. The returned document must be released with release/1 when done.
Example
{ok, Session} = lexbor_erl:parse_stream_begin(),
lists:foreach(
fun(Chunk) -> ok = lexbor_erl:parse_stream_chunk(Session, Chunk) end,
HtmlChunks
),
{ok, Doc} = lexbor_erl:parse_stream_end(Session),
{ok, Nodes} = lexbor_erl:select(Doc, <<"p">>),
ok = lexbor_erl:release(Doc).
-spec prepend_content(doc_id(), selector(), html_bin()) -> {ok, non_neg_integer()} | {error, term()}.
Prepend HTML content to all elements matching a CSS selector.
Parses the CSS selector to find all matching elements in the document, then parses the HTML content and prepends it as children (before existing children) to each matched element. Returns the number of elements that were modified.
This is a high-level operation that combines selector matching, HTML parsing, and DOM manipulation in a single atomic operation.
Note: This operation works on full HTML5 documents. The document is always serialized as complete HTML5. Scope extraction (body_children, body, head) is handled by ModestEx using regex after receiving the full HTML output.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>World</p></div>">>),
{ok, 1} = lexbor_erl:prepend_content(Doc, <<"div">>, <<"<p>Hello</p>">>),
{ok, Html} = lexbor_erl:serialize(Doc),
% Html = <<"<!DOCTYPE html><html>...><div><p>Hello</p><p>World</p></div>...</html>">>
ok = lexbor_erl:release(Doc).Key Difference from append_content/3
While append_content/3 adds content as the last child, prepend_content/3 adds content as the first child. When multiple nodes are prepended, they are inserted in forward order to maintain their relative positions.
Error Handling
Returns errors for:
doc_not_found- Document ID is invalidinvalid_selector- CSS selector syntax errorcss_parser_create_failed- Failed to create CSS parsercss_parser_init_failed- Failed to initialize CSS parserselectors_create_failed- Failed to create selectors engineselectors_init_failed- Failed to initialize selectors engineselector_match_failed- Selector matching failed
Release a document and free its resources.
Frees the memory associated with a parsed document. After calling this, the DocId becomes invalid and any further operations on it will fail.
This MUST be called for every document created with parse/1 to avoid memory leaks in the C layer.
Example
{ok, Doc} = lexbor_erl:parse(<<"<p>Hello</p>">>),
% ... use Doc ...
ok = lexbor_erl:release(Doc).
% Doc is now invalid
Remove an attribute from an element node.
Removes the specified attribute from the element if it exists. Returns success even if the attribute didn't exist.
Only works on element nodes - other node types will return an error.
Example
{ok, Doc} = lexbor_erl:parse(<<"<a href='/' target='_blank'>Link</a>">>),
{ok, [Link]} = lexbor_erl:select(Doc, <<"a">>),
ok = lexbor_erl:remove_attribute(Doc, Link, <<"target">>),
{ok, Html} = lexbor_erl:outer_html(Doc, Link),
% Html: <<"<a href=\"/\">Link</a>">>
Remove a node from its parent.
Removes the node from its parent in the tree. The node is not destroyed and can potentially be reinserted elsewhere.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Remove me</p><p>Keep</p></div>">>),
{ok, Paragraphs} = lexbor_erl:select(Doc, <<"p">>),
[ToRemove|_] = Paragraphs,
ok = lexbor_erl:remove_node(Doc, ToRemove),
% Now only one <p> remains
-spec replace_content(doc_id(), selector(), html_bin()) -> {ok, non_neg_integer()} | {error, term()}.
Replace all elements matching a CSS selector with HTML content.
Parses the CSS selector to find all matching elements in the document, then parses the HTML content and replaces each matched element with the parsed content. The matched elements are removed from the document and destroyed to free memory. Returns the number of elements that were replaced.
This is a high-level operation that combines selector matching, HTML parsing, and DOM manipulation in a single atomic operation.
How it works: For each matched element, the new HTML content is inserted as siblings before the matched element, then the matched element is removed. This effectively replaces the element with the new content.
Note: This operation works on full HTML5 documents. The document is always serialized as complete HTML5. Scope extraction (body_children, body, head) is handled by ModestEx using regex after receiving the full HTML output.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Old</p></div>">>),
{ok, 1} = lexbor_erl:replace_content(Doc, <<"p">>, <<"<span>New</span>">>),
{ok, Html} = lexbor_erl:serialize(Doc),
% Html = <<"<!DOCTYPE html><html>...><div><span>New</span></div>...</html>">>
ok = lexbor_erl:release(Doc).Replacing with Multiple Elements
You can replace one element with multiple elements:
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Old</p></div>">>),
{ok, 1} = lexbor_erl:replace_content(Doc, <<"p">>, <<"<span>A</span><span>B</span>">>),
{ok, Html} = lexbor_erl:serialize(Doc),
% Html = <<"<!DOCTYPE html><html>...><div><span>A</span><span>B</span></div>...</html>">>Replacing with Empty Content
Replacing with empty string effectively removes the matched elements:
{ok, Doc} = lexbor_erl:parse(<<"<div><p>Remove me</p></div>">>),
{ok, 1} = lexbor_erl:replace_content(Doc, <<"p">>, <<"">>),
{ok, Html} = lexbor_erl:serialize(Doc),
% Html = <<"<!DOCTYPE html><html>...><div></div>...</html>">>Edge Cases
- Elements without a parent (document root) are skipped
- Multiple matches are processed in document order
- Replaced elements are destroyed and cannot be recovered
Error Handling
Returns errors for:
doc_not_found- Document ID is invalidinvalid_selector- CSS selector syntax errorcss_parser_create_failed- Failed to create CSS parsercss_parser_init_failed- Failed to initialize CSS parserselectors_create_failed- Failed to create selectors engineselectors_init_failed- Failed to initialize selectors engineselector_match_failed- Selector matching failed
Select nodes from a document using a CSS selector.
Queries a parsed document using a CSS selector and returns handles to all matching nodes. The node handles can be used with outer_html/2 to retrieve the HTML content.
Supports most CSS3 selectors including:
- Element selectors: "p", "div"
- Class selectors: ".class", "p.class"
- ID selectors: "#id", "div#id"
- Attribute selectors: "[attr]", "[attr=value]"
- Combinators: "div p" (descendant), "div > p" (child)
- Pseudo-classes: ":first-child", ":nth-child(n)"
Example
{ok, Doc} = lexbor_erl:parse(<<"<div><p class='a'>A</p><p class='b'>B</p></div>">>),
{ok, Nodes} = lexbor_erl:select(Doc, <<"p.a">>),
% Nodes: [{node, NodeHandle}]
ok = lexbor_erl:release(Doc).
Parse HTML and select elements by CSS selector (stateless operation).
Single-shot operation that parses HTML, selects matching elements using a CSS selector, and returns the outer HTML of each match.
This is stateless - the document is not retained after the call. Use parse/1 + select/2 + outer_html/2 for stateful operations with better performance for multiple queries.
Example
Html = <<"<div><p class='a'>First</p><p class='b'>Second</p></div>">>,
{ok, Elements} = lexbor_erl:select_html(Html, <<"p.a">>).
% Returns: {ok, [<<"<p class=\"a\">First</p>">>]}
Serialize the entire document to HTML.
Returns the complete HTML representation of the document, including the doctype and all elements. Use this after making modifications to get the final HTML output.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div>Hello</div>">>),
{ok, [Div]} = lexbor_erl:select(Doc, <<"div">>),
ok = lexbor_erl:set_attribute(Doc, Div, <<"id">>, <<"main">>),
{ok, FinalHtml} = lexbor_erl:serialize(Doc),
% FinalHtml: <<"<html><head></head><body><div id=\"main\">Hello</div></body></html>">>
Set an attribute value on an element node.
Sets the specified attribute to the given value. If the attribute already exists, its value is updated. If it doesn't exist, it's created.
Only works on element nodes - other node types will return an error.
Example
{ok, Doc} = lexbor_erl:parse(<<"<a href='/old'>Link</a>">>),
{ok, [Link]} = lexbor_erl:select(Doc, <<"a">>),
ok = lexbor_erl:set_attribute(Doc, Link, <<"href">>, <<"/new">>),
ok = lexbor_erl:set_attribute(Doc, Link, <<"target">>, <<"_blank">>),
{ok, Html} = lexbor_erl:outer_html(Doc, Link),
% Html: <<"<a href=\"/new\" target=\"_blank\">Link</a>">>
Set the inner HTML of a node.
Parses the HTML string and replaces all children of the element with the parsed content. The HTML is parsed in the context of the element's tag.
Warning: Be careful with untrusted input as this directly parses HTML. Consider using set_text/3 for plain text content.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div>Old</div>">>),
{ok, [Div]} = lexbor_erl:select(Doc, <<"div">>),
ok = lexbor_erl:set_inner_html(Doc, Div, <<"<p>New</p><p>Content</p>">>),
{ok, Html} = lexbor_erl:outer_html(Doc, Div),
% Html: <<"<div><p>New</p><p>Content</p></div>">>
Set the text content of a node.
Replaces all children of the node with a single text node containing the specified text. Any existing child elements will be removed.
Example
{ok, Doc} = lexbor_erl:parse(<<"<div>Old <b>text</b></div>">>),
{ok, [Div]} = lexbor_erl:select(Doc, <<"div">>),
ok = lexbor_erl:set_text(Doc, Div, <<"New text">>),
{ok, Html} = lexbor_erl:outer_html(Doc, Div),
% Html: <<"<div>New text</div>">>
-spec start() -> ok | {error, term()}.
Start the lexbor_erl application and its dependencies.
This starts the worker pool and all necessary processes. The pool size can be configured via application environment or defaults to the number of scheduler threads.
-spec stop() -> ok.
Stop the lexbor_erl application.
This shuts down all workers and releases all resources. Any documents held by the application will be lost.