RustyXML.Native (RustyXML v0.2.3)

Copy Markdown View Source

Low-level NIF bindings for XML parsing.

This module provides direct access to the Rust NIF functions. For normal use, prefer the higher-level RustyXML module with its ~x sigil support.

Strategies

The module exposes parsing strategies:

Memory Efficiency

The structural index (parse/1) uses ~4x input size vs SweetXml's ~600x. Strings are stored as byte offsets into the original input, not copies.

Scheduler Behaviour

NIFs that parse raw XML input run on the dirty CPU scheduler to avoid blocking BEAM schedulers. Query NIFs on pre-parsed documents run on normal schedulers for sub-millisecond lookups.

Summary

Types

Opaque reference to a parsed XML document (structural index)

Opaque reference to a streaming parser

XML event from parser

Functions

Feed a chunk of data to the document accumulator.

Create a new document accumulator for streaming SimpleForm parsing.

Validate, index, and convert accumulated data to SimpleForm.

Get the root element of a parsed document.

Get current Rust heap allocation in bytes.

Get peak Rust heap allocation since last reset.

Parse XML into a structural index document.

Parse XML and execute an XPath query in one call.

Parse and immediately query, returning text values for node sets.

Parse XML in strict mode (returns {:ok, doc} or {:error, reason}).

Parse XML directly into SimpleForm {name, attrs, children} tree.

Reset memory tracking statistics.

Parse XML and return SAX events.

Parse XML and return SAX events in Saxy-compatible format.

Get number of available complete elements.

Feed a chunk of XML data to the streaming parser.

Feed a chunk and return SAX events as a compact binary.

Finalize the streaming parser and get remaining events.

Finalize the streaming SAX parser, processing any remaining bytes.

Create a new streaming XML parser.

Create a streaming parser with a tag filter.

Create a new streaming SAX parser.

Get streaming parser status.

Take up to max complete elements from the streaming parser.

Take up to max events from the streaming parser.

Take events from streaming parser in Saxy-compatible format.

Execute an XPath query on a parsed document.

Execute an XPath query returning XML strings for node sets (fast path).

Execute XPath and return string value of result.

Execute XPath on document reference and return string value.

Execute XPath query returning text values for node sets (optimized fast path).

Execute parent XPath and evaluate subspecs for each result node.

Types

document_ref()

@opaque document_ref()

Opaque reference to a parsed XML document (structural index)

parser_ref()

@opaque parser_ref()

Opaque reference to a streaming parser

xml_event()

@type xml_event() ::
  {:start_element, binary(), [{binary(), binary()}]}
  | {:end_element, binary()}
  | {:empty_element, binary(), [{binary(), binary()}]}
  | {:text, binary()}
  | {:cdata, binary()}
  | {:comment, binary()}

XML event from parser

Functions

accumulator_feed(acc, chunk)

@spec accumulator_feed(reference(), binary()) :: :ok

Feed a chunk of data to the document accumulator.

accumulator_new()

@spec accumulator_new() :: reference()

Create a new document accumulator for streaming SimpleForm parsing.

Returns an opaque accumulator reference.

accumulator_to_simple_form(acc)

@spec accumulator_to_simple_form(reference()) :: {:ok, tuple()} | {:error, binary()}

Validate, index, and convert accumulated data to SimpleForm.

Returns {:ok, tree} or {:error, reason}.

get_root(doc)

@spec get_root(document_ref()) :: term() | nil

Get the root element of a parsed document.

Returns the root element as a tuple: {:element, name, attributes, children}

Examples

doc = RustyXML.Native.parse("<root attr="value"><child/></root>")
RustyXML.Native.get_root(doc)
#=> {:element, "root", [{"attr", "value"}], [...]}

get_rust_memory()

@spec get_rust_memory() :: non_neg_integer()

Get current Rust heap allocation in bytes.

Requires memory_tracking Cargo feature. Returns 0 otherwise.

get_rust_memory_peak()

@spec get_rust_memory_peak() :: non_neg_integer()

Get peak Rust heap allocation since last reset.

parse(xml)

@spec parse(binary()) :: document_ref()

Parse XML into a structural index document.

Runs on the dirty CPU scheduler since parse time scales with input size.

Returns an opaque document reference that can be used with xpath_query/2 and get_root/1. The document is cached and can be queried multiple times.

This is the primary parse function - uses ~4x input size memory.

Examples

doc = RustyXML.Native.parse("<root><item id="1"/></root>")
RustyXML.Native.xpath_query(doc, "//item")

parse_and_xpath(xml, xpath)

@spec parse_and_xpath(binary(), binary()) :: term()

Parse XML and execute an XPath query in one call.

Runs on the dirty CPU scheduler since it parses raw XML input.

More efficient than parse/1 + xpath_query/2 for single queries since it doesn't create a persistent document reference.

Examples

RustyXML.Native.parse_and_xpath("<root><item/></root>", "//item")

parse_and_xpath_text(xml, xpath)

@spec parse_and_xpath_text(binary(), binary()) :: [binary()] | term()

Parse and immediately query, returning text values for node sets.

Optimized path for is_value: true — avoids building element tuples.

parse_strict(xml)

@spec parse_strict(binary()) :: {:ok, document_ref()} | {:error, binary()}

Parse XML in strict mode (returns {:ok, doc} or {:error, reason}).

Runs on the dirty CPU scheduler since parse time scales with input size.

Returns {:ok, document_ref} on success, or {:error, reason} if the document is not well-formed per XML 1.0 specification.

Examples

{:ok, doc} = RustyXML.Native.parse_strict("<root>valid</root>")

{:error, reason} = RustyXML.Native.parse_strict("<1invalid/>")

parse_to_simple_form(xml)

@spec parse_to_simple_form(binary()) :: {:ok, tuple()} | {:error, binary()}

Parse XML directly into SimpleForm {name, attrs, children} tree.

Bypasses the SAX event pipeline — builds the tree in Rust from the structural index, decoding entities as needed.

Returns {:ok, tree} or {:error, reason}.

reset_rust_memory_stats()

@spec reset_rust_memory_stats() :: {non_neg_integer(), non_neg_integer()}

Reset memory tracking statistics.

Returns {current_bytes, previous_peak_bytes}.

sax_parse(xml)

@spec sax_parse(binary()) :: [tuple()]

Parse XML and return SAX events.

Events are returned as tuples similar to Saxy's format.

sax_parse_saxy(xml, cdata_as_chars)

@spec sax_parse_saxy(binary(), boolean()) :: [tuple()]

Parse XML and return SAX events in Saxy-compatible format.

Events are emitted directly in Saxy format:

  • {:start_element, {name, attrs}}
  • {:end_element, name}
  • {:characters, content}
  • {:cdata, content}

Comments and PIs are skipped. Empty elements emit start+end.

streaming_available_elements(parser)

@spec streaming_available_elements(parser_ref()) ::
  non_neg_integer() | {:error, :mutex_poisoned}

Get number of available complete elements.

streaming_feed(parser, chunk)

@spec streaming_feed(parser_ref(), binary()) ::
  {non_neg_integer(), non_neg_integer()} | {:error, :mutex_poisoned}

Feed a chunk of XML data to the streaming parser.

Returns {available_events, buffer_size} on success, or {:error, :mutex_poisoned} if the parser mutex is poisoned.

streaming_feed_sax(parser, chunk, cdata_as_chars)

@spec streaming_feed_sax(reference(), binary(), boolean()) :: binary()

Feed a chunk and return SAX events as a compact binary.

When the tail buffer is empty (common case), the NIF tokenizes the BEAM binary in-place (zero copy) and writes events directly into an OwnedBinary on the BEAM heap — no intermediate Rust Vec allocation. Only the unprocessed tail (~100 bytes) is saved between calls.

Format: sequence of <<type::8, ...>> where type 1=start, 2=end, 3=chars, 4=cdata.

streaming_finalize(parser)

@spec streaming_finalize(parser_ref()) :: [xml_event()] | {:error, :mutex_poisoned}

Finalize the streaming parser and get remaining events.

Returns {:error, :mutex_poisoned} if the parser mutex is poisoned.

streaming_finalize_sax(parser, cdata_as_chars)

@spec streaming_finalize_sax(reference(), boolean()) :: binary()

Finalize the streaming SAX parser, processing any remaining bytes.

Returns final events as a compact binary (same format as streaming_feed_sax/3).

streaming_new()

@spec streaming_new() :: parser_ref()

Create a new streaming XML parser.

The streaming parser processes XML in chunks with bounded memory usage.

Examples

parser = RustyXML.Native.streaming_new()
RustyXML.Native.streaming_feed(parser, "<root>")
RustyXML.Native.streaming_feed(parser, "<item/></root>")
events = RustyXML.Native.streaming_take_events(parser, 100)

streaming_new_with_filter(tag)

@spec streaming_new_with_filter(binary()) :: parser_ref()

Create a streaming parser with a tag filter.

Only events for the specified tag name will be emitted. Useful for extracting specific elements from large documents.

Examples

parser = RustyXML.Native.streaming_new_with_filter("item")
RustyXML.Native.streaming_feed(parser, "<root><item/><other/></root>")
# Only item events will be returned

streaming_sax_new()

@spec streaming_sax_new() :: reference()

Create a new streaming SAX parser.

Returns an opaque parser reference for use with streaming_feed_sax/3.

streaming_status(parser)

@spec streaming_status(parser_ref()) ::
  {non_neg_integer(), non_neg_integer(), boolean()} | {:error, :mutex_poisoned}

Get streaming parser status.

Returns {available_events, buffer_size, has_pending} on success, or {:error, :mutex_poisoned} if the parser mutex is poisoned.

streaming_take_elements(parser, max)

@spec streaming_take_elements(parser_ref(), non_neg_integer()) ::
  [binary()] | {:error, :mutex_poisoned}

Take up to max complete elements from the streaming parser.

Returns a list of XML binaries for complete elements. This is faster than using events because the element strings are built in Rust without needing reconstruction in Elixir.

streaming_take_events(parser, max)

@spec streaming_take_events(parser_ref(), non_neg_integer()) ::
  [xml_event()] | {:error, :mutex_poisoned}

Take up to max events from the streaming parser.

Returns {:error, :mutex_poisoned} if the parser mutex is poisoned.

streaming_take_saxy_events(parser, max, cdata_as_chars)

@spec streaming_take_saxy_events(reference(), non_neg_integer(), boolean()) ::
  [tuple()] | {:error, :mutex_poisoned}

Take events from streaming parser in Saxy-compatible format.

xpath_query(doc, xpath)

@spec xpath_query(document_ref(), binary()) :: term()

Execute an XPath query on a parsed document.

Returns the result based on the XPath expression:

  • Node-set queries return a list of element tuples
  • String queries return a string
  • Number queries return a float
  • Boolean queries return true/false

Examples

doc = RustyXML.Native.parse("<root><item>text</item></root>")
RustyXML.Native.xpath_query(doc, "//item")
#=> [{:element, "item", [], ["text"]}]

xpath_query_raw(doc, xpath)

@spec xpath_query_raw(document_ref(), binary()) :: [binary()] | term()

Execute an XPath query returning XML strings for node sets (fast path).

Instead of building nested Elixir tuples for each element, this returns the serialized XML string for each node. Much faster for queries returning many elements.

Examples

doc = RustyXML.Native.parse("<root><item>text</item></root>")
RustyXML.Native.xpath_query_raw(doc, "//item")
#=> ["<item>text</item>"]

xpath_string_value(xml, xpath)

@spec xpath_string_value(binary(), binary()) :: binary()

Execute XPath and return string value of result.

Runs on the dirty CPU scheduler since it parses raw XML input. For node-sets, returns text content of first node.

Examples

RustyXML.Native.xpath_string_value("<root>hello</root>", "//root/text()")
#=> "hello"

xpath_string_value_doc(doc, xpath)

@spec xpath_string_value_doc(document_ref(), binary()) :: binary()

Execute XPath on document reference and return string value.

xpath_text_list(doc, xpath)

@spec xpath_text_list(document_ref(), binary()) :: [binary()] | term()

Execute XPath query returning text values for node sets (optimized fast path).

Instead of building nested Elixir tuples for each element, returns the concatenated text content of each node as a string. Much faster for the common case where is_value: true (no e modifier).

For non-NodeSet results (numbers, strings, booleans), returns as-is.

xpath_with_subspecs(xml, parent_xpath, subspecs)

@spec xpath_with_subspecs(binary(), binary(), [{binary(), binary()}]) :: [map()]

Execute parent XPath and evaluate subspecs for each result node.

Runs on the dirty CPU scheduler since it parses raw XML input.

Returns a list of maps with each subspec evaluated relative to the parent nodes.

Examples

xml = "<items><item><id>1</id><name>A</name></item></items>"
RustyXML.Native.xpath_with_subspecs(xml, "//item", [{"id", "./id/text()"}, {"name", "./name/text()"}])
#=> [%{id: "1", name: "A"}]