EncodingRs (encoding_rs v0.2.2)

Copy Markdown

High-performance string encoding/decoding using Rust's encoding_rs crate.

This library provides fast character encoding conversion using the same encoding library that powers Firefox. It supports all encodings in the WHATWG Encoding Standard.

Features

  • High performance: Uses encoding_rs, the same library used by Firefox
  • Dirty schedulers: Large binaries automatically use dirty CPU schedulers to avoid blocking the BEAM (configurable threshold, default 64KB)
  • Safe error handling: Returns {:ok, result} or {:error, reason} tuples
  • WHATWG compliant: Supports all encodings from the WHATWG Encoding Standard

Configuration

The dirty scheduler threshold controls when operations are moved to dirty CPU schedulers. The BEAM VM has a limited number of normal schedulers, and long-running NIFs can block them, causing latency for other processes. By offloading large encoding/decoding operations to dirty schedulers, the normal schedulers remain available for other work.

Configure in your config.exs:

# Using multiplication for readability
config :encoding_rs, dirty_threshold: 128 * 1024

# Or using Elixir's underscore notation
config :encoding_rs, dirty_threshold: 131_072

The default is 64KB (65,536 bytes).

Increasing the threshold reduces context switching overhead, which benefits batch processing and throughput-focused workloads. However, larger operations will block normal schedulers longer, potentially causing latency for other processes.

Decreasing the threshold keeps normal schedulers more available, which benefits latency-sensitive and high-concurrency applications. However, more frequent context switching adds overhead that may reduce throughput.

Supported Encodings

  • UTF-8, UTF-16LE, UTF-16BE
  • Windows code pages: 874, 1250-1258, 949, 932
  • ISO-8859 family: 2, 3, 4, 5, 6, 7, 8, 8-I, 10, 13, 14, 15, 16
  • IBM866
  • KOI8-R, KOI8-U
  • macintosh, x-mac-cyrillic
  • Asian encodings: Shift_JIS, EUC-JP, ISO-2022-JP, EUC-KR, GBK, GB18030, Big5
  • x-user-defined

Examples

iex> EncodingRs.encode("Hello", "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.decode(<<72, 101, 108, 108, 111>>, "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.encode!("¥₪ש", "windows-1255")
<<165, 164, 249>>

iex> EncodingRs.decode!(<<165, 164, 249>>, "windows-1255")
"¥₪ש"

iex> EncodingRs.encoding_exists?("utf-8")
true

iex> EncodingRs.encoding_exists?("not-an-encoding")
false

Summary

Types

Result from batch operations

Result of BOM detection: encoding name and BOM length in bytes.

Input item for batch decoding: {binary, encoding}

Input item for batch encoding: {string, encoding}

An encoding label string (e.g., "utf-8", "shift_jis", "windows-1252").

Error reason atoms returned by encoding/decoding functions.

Functions

Returns the canonical name for an encoding label.

Decodes a binary from the specified encoding to a UTF-8 string.

Decodes a binary from the specified encoding to a UTF-8 string.

Decodes multiple binaries in a single NIF call.

Detects the encoding from a BOM and strips it from the data.

Detects the encoding from a Byte Order Mark (BOM) at the start of the data.

Returns the threshold (in bytes) above which dirty schedulers are used.

Encodes a UTF-8 string to the specified encoding.

Encodes a UTF-8 string to the specified encoding.

Encodes multiple strings in a single NIF call.

Checks if an encoding label is valid and supported.

Returns a list of all supported encoding names.

Types

batch_result(t)

@type batch_result(t) :: {:ok, t} | {:error, :unknown_encoding}

Result from batch operations

bom_result()

@type bom_result() ::
  {:ok, encoding(), bom_length :: non_neg_integer()} | {:error, :no_bom}

Result of BOM detection: encoding name and BOM length in bytes.

decode_batch_item()

@type decode_batch_item() :: {binary(), encoding()}

Input item for batch decoding: {binary, encoding}

encode_batch_item()

@type encode_batch_item() :: {String.t(), encoding()}

Input item for batch encoding: {string, encoding}

encoding()

@type encoding() :: String.t()

An encoding label string (e.g., "utf-8", "shift_jis", "windows-1252").

See list_encodings/0 for all supported encodings, or check the WHATWG Encoding Standard.

error_reason()

@type error_reason() :: :unknown_encoding | :no_bom

Error reason atoms returned by encoding/decoding functions.

Functions

canonical_name(encoding)

@spec canonical_name(encoding()) :: {:ok, encoding()} | {:error, :unknown_encoding}

Returns the canonical name for an encoding label.

Encoding labels have many aliases (e.g., "latin1", "iso-8859-1", "iso_8859-1"). This function returns the canonical WHATWG name for any valid alias.

Examples

iex> EncodingRs.canonical_name("latin1")
{:ok, "windows-1252"}

iex> EncodingRs.canonical_name("utf8")
{:ok, "UTF-8"}

iex> EncodingRs.canonical_name("invalid")
{:error, :unknown_encoding}

decode(binary, encoding)

@spec decode(binary(), encoding()) :: {:ok, String.t()} | {:error, :unknown_encoding}

Decodes a binary from the specified encoding to a UTF-8 string.

Returns {:ok, string} on success, or {:error, reason} on failure. Unmappable bytes are replaced with the Unicode replacement character (U+FFFD).

Automatically uses dirty CPU schedulers for binaries larger than the configured threshold (see dirty_threshold/0).

Examples

iex> EncodingRs.decode(<<72, 101, 108, 108, 111>>, "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.decode(<<0xFF>>, "invalid-encoding")
{:error, :unknown_encoding}

decode!(binary, encoding)

@spec decode!(binary(), encoding()) :: String.t()

Decodes a binary from the specified encoding to a UTF-8 string.

Returns the decoded string on success, or raises an ArgumentError on failure.

Examples

iex> EncodingRs.decode!(<<72, 101, 108, 108, 111>>, "windows-1252")
"Hello"

iex> EncodingRs.decode!(<<0xFF>>, "invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding

decode_batch(items)

@spec decode_batch([decode_batch_item()]) :: [batch_result(String.t())]

Decodes multiple binaries in a single NIF call.

This is more efficient than calling decode/2 repeatedly when processing many items, as it amortizes the NIF dispatch overhead.

Results are returned in the same order as the input items.

Note: Batch operations always use dirty CPU schedulers, regardless of input size. See the Batch Processing Guide for details.

Arguments

  • items - List of {binary, encoding} tuples to decode

Returns

List of {:ok, string} or {:error, :unknown_encoding} tuples.

Examples

iex> items = [{<<72, 101, 108, 108, 111>>, "windows-1252"}, {<<0x82, 0xA0>>, "shift_jis"}]
iex> EncodingRs.decode_batch(items)
[{:ok, "Hello"}, {:ok, "あ"}]

iex> EncodingRs.decode_batch([{<<72>>, "invalid-encoding"}])
[{:error, :unknown_encoding}]

detect_and_strip_bom(data)

@spec detect_and_strip_bom(binary()) ::
  {:ok, encoding(), binary()} | {:error, :no_bom}

Detects the encoding from a BOM and strips it from the data.

Convenience function that combines BOM detection with stripping the BOM from the input data. Useful when you want to both detect the encoding and get the data without the BOM prefix.

Returns

  • {:ok, encoding, data_without_bom} - BOM detected and stripped
  • {:error, :no_bom} - No BOM found, data unchanged

Examples

iex> EncodingRs.detect_and_strip_bom(<<0xEF, 0xBB, 0xBF, "hello">>)
{:ok, "UTF-8", "hello"}

iex> EncodingRs.detect_and_strip_bom("hello")
{:error, :no_bom}

detect_bom(data)

@spec detect_bom(binary()) :: bom_result()

Detects the encoding from a Byte Order Mark (BOM) at the start of the data.

BOMs are special byte sequences at the beginning of a file that indicate the encoding. This function checks the first few bytes of the input and returns the detected encoding if a BOM is found.

Supported BOMs:

  • UTF-8: <<0xEF, 0xBB, 0xBF>> (3 bytes)
  • UTF-16LE: <<0xFF, 0xFE>> (2 bytes)
  • UTF-16BE: <<0xFE, 0xFF>> (2 bytes)

Returns

  • {:ok, encoding, bom_length} - BOM detected, returns encoding name and BOM size
  • {:error, :no_bom} - No BOM found at the start of the data

Examples

iex> EncodingRs.detect_bom(<<0xEF, 0xBB, 0xBF, "hello">>)
{:ok, "UTF-8", 3}

iex> EncodingRs.detect_bom(<<0xFF, 0xFE, 0x48, 0x00>>)
{:ok, "UTF-16LE", 2}

iex> EncodingRs.detect_bom(<<0xFE, 0xFF, 0x00, 0x48>>)
{:ok, "UTF-16BE", 2}

iex> EncodingRs.detect_bom("hello")
{:error, :no_bom}

iex> EncodingRs.detect_bom(<<>>)
{:error, :no_bom}

dirty_threshold()

@spec dirty_threshold() :: non_neg_integer()

Returns the threshold (in bytes) above which dirty schedulers are used.

Encode/decode operations on binaries larger than this threshold will automatically use dirty CPU schedulers to avoid blocking the BEAM's normal schedulers. This prevents long-running encoding operations from causing latency for other processes.

This value can be configured in your config.exs:

# Using multiplication for readability
config :encoding_rs, dirty_threshold: 128 * 1024

# Or using Elixir's underscore notation
config :encoding_rs, dirty_threshold: 131_072

The default is 64KB (65,536 bytes).

Examples

iex> EncodingRs.dirty_threshold()
65536

encode(string, encoding)

@spec encode(String.t(), encoding()) :: {:ok, binary()} | {:error, :unknown_encoding}

Encodes a UTF-8 string to the specified encoding.

Returns {:ok, binary} on success, or {:error, reason} on failure. Unmappable characters are replaced with a suitable fallback character.

Automatically uses dirty CPU schedulers for strings larger than the configured threshold (see dirty_threshold/0).

Examples

iex> EncodingRs.encode("Hello", "windows-1252")
{:ok, "Hello"}

iex> EncodingRs.encode("Hello", "invalid-encoding")
{:error, :unknown_encoding}

encode!(string, encoding)

@spec encode!(String.t(), encoding()) :: binary()

Encodes a UTF-8 string to the specified encoding.

Returns the encoded binary on success, or raises an ArgumentError on failure.

Examples

iex> EncodingRs.encode!("Hello", "windows-1252")
"Hello"

iex> EncodingRs.encode!("Hello", "invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding

encode_batch(items)

@spec encode_batch([encode_batch_item()]) :: [batch_result(binary())]

Encodes multiple strings in a single NIF call.

This is more efficient than calling encode/2 repeatedly when processing many items, as it amortizes the NIF dispatch overhead.

Results are returned in the same order as the input items.

Note: Batch operations always use dirty CPU schedulers, regardless of input size. See the Batch Processing Guide for details.

Arguments

  • items - List of {string, encoding} tuples to encode

Returns

List of {:ok, binary} or {:error, :unknown_encoding} tuples.

Examples

iex> items = [{"Hello", "windows-1252"}, {"あ", "shift_jis"}]
iex> EncodingRs.encode_batch(items)
[{:ok, "Hello"}, {:ok, <<130, 160>>}]

iex> EncodingRs.encode_batch([{"test", "invalid-encoding"}])
[{:error, :unknown_encoding}]

encoding_exists?(encoding)

@spec encoding_exists?(encoding()) :: boolean()

Checks if an encoding label is valid and supported.

Examples

iex> EncodingRs.encoding_exists?("utf-8")
true

iex> EncodingRs.encoding_exists?("UTF-8")
true

iex> EncodingRs.encoding_exists?("not-an-encoding")
false

list_encodings()

@spec list_encodings() :: [encoding()]

Returns a list of all supported encoding names.

Examples

iex> "UTF-8" in EncodingRs.list_encodings()
true

iex> "Shift_JIS" in EncodingRs.list_encodings()
true