EncodingRs (encoding_rs v0.2.3)

High-performance string encoding/decoding using Rust's encoding_rs crate.

This library provides fast character encoding conversion using the same encoding library that powers Firefox. It supports all encodings in the WHATWG Encoding Standard.

Features

High performance: Uses encoding_rs, the same library used by Firefox
Dirty schedulers: Large binaries automatically use dirty CPU schedulers to avoid blocking the BEAM (configurable threshold, default 64KB)
Safe error handling: Returns {:ok, result} or {:error, reason} tuples
WHATWG compliant: Supports all encodings from the WHATWG Encoding Standard

Configuration

Dirty Scheduler Threshold (compile-time)

The dirty scheduler threshold controls when operations are moved to dirty CPU schedulers. The BEAM VM has a limited number of normal schedulers, and long-running NIFs can block them, causing latency for other processes. By offloading large encoding/decoding operations to dirty schedulers, the normal schedulers remain available for other work.

Configure in your config.exs:

# Using multiplication for readability
config :encoding_rs, dirty_threshold: 128 * 1024

# Or using Elixir's underscore notation
config :encoding_rs, dirty_threshold: 131_072

The default is 64KB (65,536 bytes). This is a compile-time setting.

Increasing the threshold reduces context switching overhead, which benefits batch processing and throughput-focused workloads. However, larger operations will block normal schedulers longer, potentially causing latency for other processes.

Decreasing the threshold keeps normal schedulers more available, which benefits latency-sensitive and high-concurrency applications. However, more frequent context switching adds overhead that may reduce throughput.

Maximum Input Size (runtime)

A configurable safety limit on the maximum input size accepted by encoding/decoding operations. Inputs exceeding this limit return {:error, :input_too_large} before reaching the NIF. This guards against excessive memory allocation — a single large input can cause up to 3x memory amplification in the NIF (input buffer + output buffer + BEAM binary copy).

Configure in your config.exs or runtime.exs:

config :encoding_rs, max_input_size: 200 * 1024 * 1024

The default is 100MB (104,857,600 bytes). This is a runtime setting — it can be changed without recompiling.

Set to :infinity to disable the limit entirely:

# Trusted environment — no size cap
config :encoding_rs, max_input_size: :infinity

The value must be a non-negative integer or :infinity. Invalid values (e.g., strings, negative numbers) will raise an ArgumentError on first use.

Warning

Disabling the size limit or setting it very high removes a safety guardrail against memory exhaustion. Only do this when inputs are trusted and bounded by other means (e.g., request body limits, file size checks). For untrusted input, prefer the streaming decoder (EncodingRs.Decoder) with bounded chunk sizes.

See max_input_size/0 for more details.

Supported Encodings

UTF-8, UTF-16LE, UTF-16BE
Windows code pages: 874, 1250-1258, 949, 932
ISO-8859 family: 2, 3, 4, 5, 6, 7, 8, 8-I, 10, 13, 14, 15, 16
IBM866
KOI8-R, KOI8-U
macintosh, x-mac-cyrillic
Asian encodings: Shift_JIS, EUC-JP, ISO-2022-JP, EUC-KR, GBK, GB18030, Big5
x-user-defined

Examples

iex> EncodingRs.encode("Hello", "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.decode(<<72, 101, 108, 108, 111>>, "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.encode!("¥₪ש", "windows-1255")
<<165, 164, 249>>

iex> EncodingRs.decode!(<<165, 164, 249>>, "windows-1255")
"¥₪ש"

iex> EncodingRs.encoding_exists?("utf-8")
true

iex> EncodingRs.encoding_exists?("not-an-encoding")
false

Summary

Types

batch_result(t)

Result from batch operations

bom_result()

Result of BOM detection: encoding name and BOM length in bytes.

decode_batch_item()

Input item for batch decoding: {binary, encoding}

encode_batch_item()

Input item for batch encoding: {string, encoding}

encoding()

An encoding label string (e.g., "utf-8", "shift_jis", "windows-1252").

error_reason()

Error reason atoms returned by encoding/decoding functions.

Functions

canonical_name(encoding)

Returns the canonical name for an encoding label.

decode(binary, encoding)

Decodes a binary from the specified encoding to a UTF-8 string.

decode!(binary, encoding)

Decodes a binary from the specified encoding to a UTF-8 string.

decode_batch(items)

Decodes multiple binaries in a single NIF call.

detect_and_strip_bom(data)

Detects the encoding from a BOM and strips it from the data.

detect_bom(data)

Detects the encoding from a Byte Order Mark (BOM) at the start of the data.

dirty_threshold()

Returns the threshold (in bytes) above which dirty schedulers are used.

encode(string, encoding)

Encodes a UTF-8 string to the specified encoding.

encode!(string, encoding)

Encodes a UTF-8 string to the specified encoding.

encode_batch(items)

Encodes multiple strings in a single NIF call.

encoding_exists?(encoding)

Checks if an encoding label is valid and supported.

list_encodings()

Returns a list of all supported encoding names.

max_input_size()

Returns the maximum input size (in bytes) allowed for encoding/decoding operations.

Types

batch_result(t)

@type batch_result(t) :: {:ok, t} | {:error, :unknown_encoding | :input_too_large}

Result from batch operations

bom_result()

@type bom_result() ::
  {:ok, encoding(), bom_length :: non_neg_integer()} | {:error, :no_bom}

Result of BOM detection: encoding name and BOM length in bytes.

decode_batch_item()

@type decode_batch_item() :: {binary(), encoding()}

Input item for batch decoding: {binary, encoding}

encode_batch_item()

@type encode_batch_item() :: {String.t(), encoding()}

Input item for batch encoding: {string, encoding}

encoding()

@type encoding() :: String.t()

An encoding label string (e.g., "utf-8", "shift_jis", "windows-1252").

See list_encodings/0 for all supported encodings, or check the WHATWG Encoding Standard.

error_reason()

@type error_reason() :: :unknown_encoding | :no_bom | :input_too_large

Error reason atoms returned by encoding/decoding functions.

Functions

canonical_name(encoding)

@spec canonical_name(encoding()) :: {:ok, encoding()} | {:error, :unknown_encoding}

Returns the canonical name for an encoding label.

Encoding labels have many aliases (e.g., "latin1", "iso-8859-1", "iso_8859-1"). This function returns the canonical WHATWG name for any valid alias.

Examples

iex> EncodingRs.canonical_name("latin1")
{:ok, "windows-1252"}

iex> EncodingRs.canonical_name("utf8")
{:ok, "UTF-8"}

iex> EncodingRs.canonical_name("invalid")
{:error, :unknown_encoding}

decode(binary, encoding)

@spec decode(binary(), encoding()) ::
  {:ok, String.t()} | {:error, :unknown_encoding | :input_too_large}

Decodes a binary from the specified encoding to a UTF-8 string.

Returns {:ok, string} on success, or {:error, reason} on failure. Unmappable bytes are replaced with the Unicode replacement character (U+FFFD).

Automatically uses dirty CPU schedulers for binaries larger than the configured threshold (see dirty_threshold/0).

Examples

iex> EncodingRs.decode(<<72, 101, 108, 108, 111>>, "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.decode(<<0xFF>>, "invalid-encoding")
{:error, :unknown_encoding}

decode!(binary, encoding)

@spec decode!(binary(), encoding()) :: String.t()

Decodes a binary from the specified encoding to a UTF-8 string.

Returns the decoded string on success, or raises an ArgumentError on failure.

Examples

iex> EncodingRs.decode!(<<72, 101, 108, 108, 111>>, "windows-1252")
"Hello"

iex> EncodingRs.decode!(<<0xFF>>, "invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding

decode_batch(items)

@spec decode_batch([decode_batch_item()]) :: [batch_result(String.t())]

Decodes multiple binaries in a single NIF call.

This is more efficient than calling decode/2 repeatedly when processing many items, as it amortizes the NIF dispatch overhead.

Results are returned in the same order as the input items.

Note: Batch operations always use dirty CPU schedulers, regardless of input size. See the Batch Processing Guide for details.

Arguments

items - List of {binary, encoding} tuples to decode

Returns

List of {:ok, string}, {:error, :unknown_encoding}, or {:error, :input_too_large} tuples.

Examples

iex> items = [{<<72, 101, 108, 108, 111>>, "windows-1252"}, {<<0x82, 0xA0>>, "shift_jis"}]
iex> EncodingRs.decode_batch(items)
[{:ok, "Hello"}, {:ok, "あ"}]

iex> EncodingRs.decode_batch([{<<72>>, "invalid-encoding"}])
[{:error, :unknown_encoding}]

detect_and_strip_bom(data)

@spec detect_and_strip_bom(binary()) ::
  {:ok, encoding(), binary()} | {:error, :no_bom}

Detects the encoding from a BOM and strips it from the data.

Convenience function that combines BOM detection with stripping the BOM from the input data. Useful when you want to both detect the encoding and get the data without the BOM prefix.

Returns

{:ok, encoding, data_without_bom} - BOM detected and stripped
{:error, :no_bom} - No BOM found, data unchanged

Examples

iex> EncodingRs.detect_and_strip_bom(<<0xEF, 0xBB, 0xBF, "hello">>)
{:ok, "UTF-8", "hello"}

iex> EncodingRs.detect_and_strip_bom("hello")
{:error, :no_bom}

detect_bom(data)

@spec detect_bom(binary()) :: bom_result()

Detects the encoding from a Byte Order Mark (BOM) at the start of the data.

BOMs are special byte sequences at the beginning of a file that indicate the encoding. This function checks the first few bytes of the input and returns the detected encoding if a BOM is found.

Supported BOMs:

UTF-8: <<0xEF, 0xBB, 0xBF>> (3 bytes)
UTF-16LE: <<0xFF, 0xFE>> (2 bytes)
UTF-16BE: <<0xFE, 0xFF>> (2 bytes)

Returns

{:ok, encoding, bom_length} - BOM detected, returns encoding name and BOM size
{:error, :no_bom} - No BOM found at the start of the data

Examples

iex> EncodingRs.detect_bom(<<0xEF, 0xBB, 0xBF, "hello">>)
{:ok, "UTF-8", 3}

iex> EncodingRs.detect_bom(<<0xFF, 0xFE, 0x48, 0x00>>)
{:ok, "UTF-16LE", 2}

iex> EncodingRs.detect_bom(<<0xFE, 0xFF, 0x00, 0x48>>)
{:ok, "UTF-16BE", 2}

iex> EncodingRs.detect_bom("hello")
{:error, :no_bom}

iex> EncodingRs.detect_bom(<<>>)
{:error, :no_bom}

dirty_threshold()

@spec dirty_threshold() :: non_neg_integer()

Returns the threshold (in bytes) above which dirty schedulers are used.

Encode/decode operations on binaries larger than this threshold will automatically use dirty CPU schedulers to avoid blocking the BEAM's normal schedulers. This prevents long-running encoding operations from causing latency for other processes.

This value can be configured in your config.exs:

# Using multiplication for readability
config :encoding_rs, dirty_threshold: 128 * 1024

# Or using Elixir's underscore notation
config :encoding_rs, dirty_threshold: 131_072

The default is 64KB (65,536 bytes).

Examples

iex> EncodingRs.dirty_threshold()
65536

encode(string, encoding)

@spec encode(String.t(), encoding()) ::
  {:ok, binary()} | {:error, :unknown_encoding | :input_too_large}

Encodes a UTF-8 string to the specified encoding.

Returns {:ok, binary} on success, or {:error, reason} on failure. Unmappable characters are replaced with a suitable fallback character.

Automatically uses dirty CPU schedulers for strings larger than the configured threshold (see dirty_threshold/0).

Examples

iex> EncodingRs.encode("Hello", "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.encode("Hello", "invalid-encoding")
{:error, :unknown_encoding}

encode!(string, encoding)

@spec encode!(String.t(), encoding()) :: binary()

Encodes a UTF-8 string to the specified encoding.

Returns the encoded binary on success, or raises an ArgumentError on failure.

Examples

iex> EncodingRs.encode!("Hello", "windows-1252")
"Hello"

iex> EncodingRs.encode!("Hello", "invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding

encode_batch(items)

@spec encode_batch([encode_batch_item()]) :: [batch_result(binary())]

Encodes multiple strings in a single NIF call.

This is more efficient than calling encode/2 repeatedly when processing many items, as it amortizes the NIF dispatch overhead.

Results are returned in the same order as the input items.

Note: Batch operations always use dirty CPU schedulers, regardless of input size. See the Batch Processing Guide for details.

Arguments

items - List of {string, encoding} tuples to encode

Returns

List of {:ok, binary}, {:error, :unknown_encoding}, or {:error, :input_too_large} tuples.

Examples

iex> items = [{"Hello", "windows-1252"}, {"あ", "shift_jis"}]
iex> EncodingRs.encode_batch(items)
[{:ok, "Hello"}, {:ok, <<130, 160>>}]

iex> EncodingRs.encode_batch([{"test", "invalid-encoding"}])
[{:error, :unknown_encoding}]

encoding_exists?(encoding)

@spec encoding_exists?(encoding()) :: boolean()

Checks if an encoding label is valid and supported.

Examples

iex> EncodingRs.encoding_exists?("utf-8")
true

iex> EncodingRs.encoding_exists?("UTF-8")
true

iex> EncodingRs.encoding_exists?("not-an-encoding")
false

list_encodings()

@spec list_encodings() :: [encoding()]

Returns a list of all supported encoding names.

Examples

iex> "UTF-8" in EncodingRs.list_encodings()
true

iex> "Shift_JIS" in EncodingRs.list_encodings()
true

max_input_size()

@spec max_input_size() :: non_neg_integer() | :infinity

Returns the maximum input size (in bytes) allowed for encoding/decoding operations.

Inputs larger than this limit will return {:error, :input_too_large} instead of being passed to the NIF. This prevents excessive memory allocation from untrusted or unexpectedly large inputs.

This value is read at runtime via Application.get_env/3, so it can be changed in runtime.exs or dynamically with Application.put_env/3 without recompiling the library.

Configure in your config.exs or runtime.exs:

config :encoding_rs, max_input_size: 200 * 1024 * 1024

The default is 100MB (104,857,600 bytes).

Set to :infinity to disable the size limit entirely. This is appropriate for trusted environments where inputs are known to be safe, but should be avoided when processing untrusted data — a large input can cause memory amplification of up to 3x in the NIF (input buffer + output buffer + BEAM binary copy).

# Disable size limit (trusted inputs only)
config :encoding_rs, max_input_size: :infinity

The value must be a non-negative integer or :infinity. Invalid values (e.g., strings, negative numbers) will raise an ArgumentError on first use.

Examples

iex> EncodingRs.max_input_size()
104857600