EncodingRs (encoding_rs v0.2.3)
Copy MarkdownHigh-performance string encoding/decoding using Rust's encoding_rs crate.
This library provides fast character encoding conversion using the same encoding library that powers Firefox. It supports all encodings in the WHATWG Encoding Standard.
Features
- High performance: Uses
encoding_rs, the same library used by Firefox - Dirty schedulers: Large binaries automatically use dirty CPU schedulers to avoid blocking the BEAM (configurable threshold, default 64KB)
- Safe error handling: Returns
{:ok, result}or{:error, reason}tuples - WHATWG compliant: Supports all encodings from the WHATWG Encoding Standard
Configuration
Dirty Scheduler Threshold (compile-time)
The dirty scheduler threshold controls when operations are moved to dirty CPU schedulers. The BEAM VM has a limited number of normal schedulers, and long-running NIFs can block them, causing latency for other processes. By offloading large encoding/decoding operations to dirty schedulers, the normal schedulers remain available for other work.
Configure in your config.exs:
# Using multiplication for readability
config :encoding_rs, dirty_threshold: 128 * 1024
# Or using Elixir's underscore notation
config :encoding_rs, dirty_threshold: 131_072The default is 64KB (65,536 bytes). This is a compile-time setting.
Increasing the threshold reduces context switching overhead, which benefits batch processing and throughput-focused workloads. However, larger operations will block normal schedulers longer, potentially causing latency for other processes.
Decreasing the threshold keeps normal schedulers more available, which benefits latency-sensitive and high-concurrency applications. However, more frequent context switching adds overhead that may reduce throughput.
Maximum Input Size (runtime)
A configurable safety limit on the maximum input size accepted by encoding/decoding
operations. Inputs exceeding this limit return {:error, :input_too_large} before
reaching the NIF. This guards against excessive memory allocation — a single large
input can cause up to 3x memory amplification in the NIF (input buffer + output
buffer + BEAM binary copy).
Configure in your config.exs or runtime.exs:
config :encoding_rs, max_input_size: 200 * 1024 * 1024The default is 100MB (104,857,600 bytes). This is a runtime setting — it can be changed without recompiling.
Set to :infinity to disable the limit entirely:
# Trusted environment — no size cap
config :encoding_rs, max_input_size: :infinityThe value must be a non-negative integer or :infinity. Invalid values
(e.g., strings, negative numbers) will raise an ArgumentError on first use.
Warning
Disabling the size limit or setting it very high removes a safety guardrail
against memory exhaustion. Only do this when inputs are trusted and bounded
by other means (e.g., request body limits, file size checks). For untrusted
input, prefer the streaming decoder (EncodingRs.Decoder) with bounded
chunk sizes.
See max_input_size/0 for more details.
Supported Encodings
- UTF-8, UTF-16LE, UTF-16BE
- Windows code pages: 874, 1250-1258, 949, 932
- ISO-8859 family: 2, 3, 4, 5, 6, 7, 8, 8-I, 10, 13, 14, 15, 16
- IBM866
- KOI8-R, KOI8-U
- macintosh, x-mac-cyrillic
- Asian encodings: Shift_JIS, EUC-JP, ISO-2022-JP, EUC-KR, GBK, GB18030, Big5
- x-user-defined
Examples
iex> EncodingRs.encode("Hello", "windows-1252")
{:ok, "Hello"}
iex> EncodingRs.decode(<<72, 101, 108, 108, 111>>, "windows-1252")
{:ok, "Hello"}
iex> EncodingRs.encode!("¥₪ש", "windows-1255")
<<165, 164, 249>>
iex> EncodingRs.decode!(<<165, 164, 249>>, "windows-1255")
"¥₪ש"
iex> EncodingRs.encoding_exists?("utf-8")
true
iex> EncodingRs.encoding_exists?("not-an-encoding")
false
Summary
Types
Result from batch operations
Result of BOM detection: encoding name and BOM length in bytes.
Input item for batch decoding: {binary, encoding}
Input item for batch encoding: {string, encoding}
An encoding label string (e.g., "utf-8", "shift_jis", "windows-1252").
Error reason atoms returned by encoding/decoding functions.
Functions
Returns the canonical name for an encoding label.
Decodes a binary from the specified encoding to a UTF-8 string.
Decodes a binary from the specified encoding to a UTF-8 string.
Decodes multiple binaries in a single NIF call.
Detects the encoding from a BOM and strips it from the data.
Detects the encoding from a Byte Order Mark (BOM) at the start of the data.
Returns the threshold (in bytes) above which dirty schedulers are used.
Encodes a UTF-8 string to the specified encoding.
Encodes a UTF-8 string to the specified encoding.
Encodes multiple strings in a single NIF call.
Checks if an encoding label is valid and supported.
Returns a list of all supported encoding names.
Returns the maximum input size (in bytes) allowed for encoding/decoding operations.
Types
@type batch_result(t) :: {:ok, t} | {:error, :unknown_encoding | :input_too_large}
Result from batch operations
@type bom_result() :: {:ok, encoding(), bom_length :: non_neg_integer()} | {:error, :no_bom}
Result of BOM detection: encoding name and BOM length in bytes.
Input item for batch decoding: {binary, encoding}
Input item for batch encoding: {string, encoding}
@type encoding() :: String.t()
An encoding label string (e.g., "utf-8", "shift_jis", "windows-1252").
See list_encodings/0 for all supported encodings, or check the
WHATWG Encoding Standard.
@type error_reason() :: :unknown_encoding | :no_bom | :input_too_large
Error reason atoms returned by encoding/decoding functions.
Functions
Returns the canonical name for an encoding label.
Encoding labels have many aliases (e.g., "latin1", "iso-8859-1", "iso_8859-1"). This function returns the canonical WHATWG name for any valid alias.
Examples
iex> EncodingRs.canonical_name("latin1")
{:ok, "windows-1252"}
iex> EncodingRs.canonical_name("utf8")
{:ok, "UTF-8"}
iex> EncodingRs.canonical_name("invalid")
{:error, :unknown_encoding}
@spec decode(binary(), encoding()) :: {:ok, String.t()} | {:error, :unknown_encoding | :input_too_large}
Decodes a binary from the specified encoding to a UTF-8 string.
Returns {:ok, string} on success, or {:error, reason} on failure.
Unmappable bytes are replaced with the Unicode replacement character (U+FFFD).
Automatically uses dirty CPU schedulers for binaries larger than the
configured threshold (see dirty_threshold/0).
Examples
iex> EncodingRs.decode(<<72, 101, 108, 108, 111>>, "windows-1252")
{:ok, "Hello"}
iex> EncodingRs.decode(<<0xFF>>, "invalid-encoding")
{:error, :unknown_encoding}
Decodes a binary from the specified encoding to a UTF-8 string.
Returns the decoded string on success, or raises an ArgumentError on failure.
Examples
iex> EncodingRs.decode!(<<72, 101, 108, 108, 111>>, "windows-1252")
"Hello"
iex> EncodingRs.decode!(<<0xFF>>, "invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding
@spec decode_batch([decode_batch_item()]) :: [batch_result(String.t())]
Decodes multiple binaries in a single NIF call.
This is more efficient than calling decode/2 repeatedly when processing
many items, as it amortizes the NIF dispatch overhead.
Results are returned in the same order as the input items.
Note: Batch operations always use dirty CPU schedulers, regardless of input size. See the Batch Processing Guide for details.
Arguments
items- List of{binary, encoding}tuples to decode
Returns
List of {:ok, string}, {:error, :unknown_encoding}, or
{:error, :input_too_large} tuples.
Examples
iex> items = [{<<72, 101, 108, 108, 111>>, "windows-1252"}, {<<0x82, 0xA0>>, "shift_jis"}]
iex> EncodingRs.decode_batch(items)
[{:ok, "Hello"}, {:ok, "あ"}]
iex> EncodingRs.decode_batch([{<<72>>, "invalid-encoding"}])
[{:error, :unknown_encoding}]
Detects the encoding from a BOM and strips it from the data.
Convenience function that combines BOM detection with stripping the BOM from the input data. Useful when you want to both detect the encoding and get the data without the BOM prefix.
Returns
{:ok, encoding, data_without_bom}- BOM detected and stripped{:error, :no_bom}- No BOM found, data unchanged
Examples
iex> EncodingRs.detect_and_strip_bom(<<0xEF, 0xBB, 0xBF, "hello">>)
{:ok, "UTF-8", "hello"}
iex> EncodingRs.detect_and_strip_bom("hello")
{:error, :no_bom}
@spec detect_bom(binary()) :: bom_result()
Detects the encoding from a Byte Order Mark (BOM) at the start of the data.
BOMs are special byte sequences at the beginning of a file that indicate the encoding. This function checks the first few bytes of the input and returns the detected encoding if a BOM is found.
Supported BOMs:
- UTF-8:
<<0xEF, 0xBB, 0xBF>>(3 bytes) - UTF-16LE:
<<0xFF, 0xFE>>(2 bytes) - UTF-16BE:
<<0xFE, 0xFF>>(2 bytes)
Returns
{:ok, encoding, bom_length}- BOM detected, returns encoding name and BOM size{:error, :no_bom}- No BOM found at the start of the data
Examples
iex> EncodingRs.detect_bom(<<0xEF, 0xBB, 0xBF, "hello">>)
{:ok, "UTF-8", 3}
iex> EncodingRs.detect_bom(<<0xFF, 0xFE, 0x48, 0x00>>)
{:ok, "UTF-16LE", 2}
iex> EncodingRs.detect_bom(<<0xFE, 0xFF, 0x00, 0x48>>)
{:ok, "UTF-16BE", 2}
iex> EncodingRs.detect_bom("hello")
{:error, :no_bom}
iex> EncodingRs.detect_bom(<<>>)
{:error, :no_bom}
@spec dirty_threshold() :: non_neg_integer()
Returns the threshold (in bytes) above which dirty schedulers are used.
Encode/decode operations on binaries larger than this threshold will automatically use dirty CPU schedulers to avoid blocking the BEAM's normal schedulers. This prevents long-running encoding operations from causing latency for other processes.
This value can be configured in your config.exs:
# Using multiplication for readability
config :encoding_rs, dirty_threshold: 128 * 1024
# Or using Elixir's underscore notation
config :encoding_rs, dirty_threshold: 131_072The default is 64KB (65,536 bytes).
Examples
iex> EncodingRs.dirty_threshold()
65536
@spec encode(String.t(), encoding()) :: {:ok, binary()} | {:error, :unknown_encoding | :input_too_large}
Encodes a UTF-8 string to the specified encoding.
Returns {:ok, binary} on success, or {:error, reason} on failure.
Unmappable characters are replaced with a suitable fallback character.
Automatically uses dirty CPU schedulers for strings larger than the
configured threshold (see dirty_threshold/0).
Examples
iex> EncodingRs.encode("Hello", "windows-1252")
{:ok, "Hello"}
iex> EncodingRs.encode("Hello", "invalid-encoding")
{:error, :unknown_encoding}
Encodes a UTF-8 string to the specified encoding.
Returns the encoded binary on success, or raises an ArgumentError on failure.
Examples
iex> EncodingRs.encode!("Hello", "windows-1252")
"Hello"
iex> EncodingRs.encode!("Hello", "invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding
@spec encode_batch([encode_batch_item()]) :: [batch_result(binary())]
Encodes multiple strings in a single NIF call.
This is more efficient than calling encode/2 repeatedly when processing
many items, as it amortizes the NIF dispatch overhead.
Results are returned in the same order as the input items.
Note: Batch operations always use dirty CPU schedulers, regardless of input size. See the Batch Processing Guide for details.
Arguments
items- List of{string, encoding}tuples to encode
Returns
List of {:ok, binary}, {:error, :unknown_encoding}, or
{:error, :input_too_large} tuples.
Examples
iex> items = [{"Hello", "windows-1252"}, {"あ", "shift_jis"}]
iex> EncodingRs.encode_batch(items)
[{:ok, "Hello"}, {:ok, <<130, 160>>}]
iex> EncodingRs.encode_batch([{"test", "invalid-encoding"}])
[{:error, :unknown_encoding}]
Checks if an encoding label is valid and supported.
Examples
iex> EncodingRs.encoding_exists?("utf-8")
true
iex> EncodingRs.encoding_exists?("UTF-8")
true
iex> EncodingRs.encoding_exists?("not-an-encoding")
false
@spec list_encodings() :: [encoding()]
Returns a list of all supported encoding names.
Examples
iex> "UTF-8" in EncodingRs.list_encodings()
true
iex> "Shift_JIS" in EncodingRs.list_encodings()
true
@spec max_input_size() :: non_neg_integer() | :infinity
Returns the maximum input size (in bytes) allowed for encoding/decoding operations.
Inputs larger than this limit will return {:error, :input_too_large} instead
of being passed to the NIF. This prevents excessive memory allocation from
untrusted or unexpectedly large inputs.
This value is read at runtime via Application.get_env/3, so it can be
changed in runtime.exs or dynamically with Application.put_env/3 without
recompiling the library.
Configure in your config.exs or runtime.exs:
config :encoding_rs, max_input_size: 200 * 1024 * 1024The default is 100MB (104,857,600 bytes).
Set to :infinity to disable the size limit entirely. This is appropriate for
trusted environments where inputs are known to be safe, but should be avoided
when processing untrusted data — a large input can cause memory amplification
of up to 3x in the NIF (input buffer + output buffer + BEAM binary copy).
# Disable size limit (trusted inputs only)
config :encoding_rs, max_input_size: :infinityThe value must be a non-negative integer or :infinity. Invalid values
(e.g., strings, negative numbers) will raise an ArgumentError on first use.
Examples
iex> EncodingRs.max_input_size()
104857600