Batch Processing Guide

This guide covers the batch API for encoding and decoding multiple items in a single NIF call.

When to Use Batch Operations

Batch operations are useful when you need to process many separate strings or binaries:

Decoding/encoding rows from a database
Processing lists of filenames or paths
Converting multiple user inputs
Data migration tasks

For streaming a single large file, use EncodingRs.Decoder instead (see the Streaming Guide).

The Problem

Each NIF call has overhead: scheduler context switching, argument marshalling, and result conversion. When processing many small items, this overhead can dominate:

# Inefficient: 1000 NIF calls
items
|> Enum.map(fn {data, encoding} ->
  EncodingRs.decode(data, encoding)
end)

The Solution

Batch operations process all items in a single NIF call, amortizing the dispatch overhead:

# Efficient: 1 NIF call
EncodingRs.decode_batch(items)

Usage

Decoding Multiple Binaries

items = [
  {<<72, 101, 108, 108, 111>>, "windows-1252"},
  {<<0x82, 0xA0>>, "shift_jis"},
  {<<0xC4, 0xE3, 0xBA, 0xC3>>, "gbk"}
]

results = EncodingRs.decode_batch(items)
# => [{:ok, "Hello"}, {:ok, "あ"}, {:ok, "你好"}]

Encoding Multiple Strings

items = [
  {"Hello", "windows-1252"},
  {"あ", "shift_jis"},
  {"你好", "gbk"}
]

results = EncodingRs.encode_batch(items)
# => [{:ok, <<72, 101, 108, 108, 111>>}, {:ok, <<130, 160>>}, {:ok, <<196, 227, 186, 195>>}]

Handling Errors

Results are returned in the same order as input. Check each result individually:

items = [
  {"Hello", "windows-1252"},
  {"Test", "invalid-encoding"},
  {"World", "utf-8"}
]

results = EncodingRs.encode_batch(items)
# => [{:ok, "Hello"}, {:error, :unknown_encoding}, {:ok, "World"}]

# Process results
Enum.zip(items, results)
|> Enum.each(fn {{input, encoding}, result} ->
  case result do
    {:ok, encoded} ->
      IO.puts("Encoded #{inspect(input)} to #{encoding}")
    {:error, reason} ->
      IO.puts("Failed to encode #{inspect(input)}: #{reason}")
  end
end)

Mixed Encodings

Batch operations support different encodings per item:

# Database rows with encoding metadata
rows = [
  %{content: <<...>>, encoding: "shift_jis", id: 1},
  %{content: <<...>>, encoding: "gbk", id: 2},
  %{content: <<...>>, encoding: "windows-1252", id: 3}
]

items = Enum.map(rows, &{&1.content, &1.encoding})
results = EncodingRs.decode_batch(items)

# Combine results back with original data
Enum.zip(rows, results)
|> Enum.map(fn {row, {:ok, decoded}} ->
  Map.put(row, :content_utf8, decoded)
end)

Dirty Scheduler Behavior

Batch operations always use dirty CPU schedulers, regardless of input size or item count.

Rationale

Batch operations are typically used for throughput-focused workloads where:

Total work is significant - Even if individual items are small, processing many items adds up
Predictability matters - Consistent dirty scheduler usage avoids variable latency
Simplicity - No threshold logic to tune or understand

Trade-offs

Aspect	Batch (always dirty)	Single-item (threshold-based)
Small workloads	Slight overhead from dirty scheduler	Uses normal scheduler
Large workloads	Optimal	Optimal
Latency	Consistent	Variable based on size
Complexity	Simple	Requires threshold tuning

When This Matters

For most use cases, always using dirty schedulers is the right choice. The overhead is minimal and the behavior is predictable.

If you have a latency-sensitive application processing very small batches (< 10 items, each < 1KB), you may see slightly better latency using individual decode/2 or encode/2 calls, which respect the configured dirty threshold.

Known Limitations

No Batch Streaming

The batch API is for one-shot processing of complete binaries only. It does not support stateful streaming decoding where characters may be split across chunk boundaries.

For streaming use cases, use EncodingRs.Decoder which maintains state between chunks. However, each decoder handles a single stream - there is currently no way to batch process chunks from multiple streams in a single NIF call.

If you need to process multiple streams concurrently, create separate EncodingRs.Decoder instances for each stream.

Future Options

The following options may be added in future versions based on user feedback:

Batch streaming - Process chunks from multiple decoders in a single NIF call
Threshold-based routing - Check total bytes and route to normal/dirty scheduler
Item count threshold - Use dirty scheduler only above N items
Explicit scheduler choice - decode_batch/2 with options like [scheduler: :normal]

If you have a use case that would benefit from these options, please open an issue.

Performance Tips

Batch similar-sized items - Helps with memory allocation efficiency
Reasonable batch sizes - Batches of 100-10,000 items work well. Extremely large batches (100K+) may cause memory pressure.

Consider chunking very large lists:

large_list
|> Enum.chunk_every(1000)
|> Enum.flat_map(&EncodingRs.decode_batch/1)

Parallel batches - For very large workloads, split across processes:

items
|> Enum.chunk_every(1000)
|> Task.async_stream(&EncodingRs.decode_batch/1, max_concurrency: 4)
|> Enum.flat_map(fn {:ok, results} -> results end)

Comparison: Batch vs Streaming vs One-Shot

Scenario	Best Approach
Single small binary	`EncodingRs.decode/2`
Single large file	`EncodingRs.Decoder.stream/2`
Many separate items	`EncodingRs.decode_batch/1`
Network stream	`EncodingRs.Decoder`
Database rows	`EncodingRs.decode_batch/1`

← Previous Page Streaming Guide

Next Page → Library Comparison