Batch Processing Guide

Copy Markdown

This guide covers the batch API for encoding and decoding multiple items in a single NIF call.

When to Use Batch Operations

Batch operations are useful when you need to process many separate strings or binaries:

  • Decoding/encoding rows from a database
  • Processing lists of filenames or paths
  • Converting multiple user inputs
  • Data migration tasks

For streaming a single large file, use EncodingRs.Decoder instead (see the Streaming Guide).

The Problem

Each NIF call has overhead: scheduler context switching, argument marshalling, and result conversion. When processing many small items, this overhead can dominate:

# Inefficient: 1000 NIF calls
items
|> Enum.map(fn {data, encoding} ->
  EncodingRs.decode(data, encoding)
end)

The Solution

Batch operations process all items in a single NIF call, amortizing the dispatch overhead:

# Efficient: 1 NIF call
EncodingRs.decode_batch(items)

Usage

Decoding Multiple Binaries

items = [
  {<<72, 101, 108, 108, 111>>, "windows-1252"},
  {<<0x82, 0xA0>>, "shift_jis"},
  {<<0xC4, 0xE3, 0xBA, 0xC3>>, "gbk"}
]

results = EncodingRs.decode_batch(items)
# => [{:ok, "Hello"}, {:ok, "あ"}, {:ok, "你好"}]

Encoding Multiple Strings

items = [
  {"Hello", "windows-1252"},
  {"あ", "shift_jis"},
  {"你好", "gbk"}
]

results = EncodingRs.encode_batch(items)
# => [{:ok, <<72, 101, 108, 108, 111>>}, {:ok, <<130, 160>>}, {:ok, <<196, 227, 186, 195>>}]

Handling Errors

Results are returned in the same order as input. Check each result individually:

items = [
  {"Hello", "windows-1252"},
  {"Test", "invalid-encoding"},
  {"World", "utf-8"}
]

results = EncodingRs.encode_batch(items)
# => [{:ok, "Hello"}, {:error, :unknown_encoding}, {:ok, "World"}]

# Process results
Enum.zip(items, results)
|> Enum.each(fn {{input, encoding}, result} ->
  case result do
    {:ok, encoded} ->
      IO.puts("Encoded #{inspect(input)} to #{encoding}")
    {:error, reason} ->
      IO.puts("Failed to encode #{inspect(input)}: #{reason}")
  end
end)

Mixed Encodings

Batch operations support different encodings per item:

# Database rows with encoding metadata
rows = [
  %{content: <<...>>, encoding: "shift_jis", id: 1},
  %{content: <<...>>, encoding: "gbk", id: 2},
  %{content: <<...>>, encoding: "windows-1252", id: 3}
]

items = Enum.map(rows, &{&1.content, &1.encoding})
results = EncodingRs.decode_batch(items)

# Combine results back with original data
Enum.zip(rows, results)
|> Enum.map(fn {row, {:ok, decoded}} ->
  Map.put(row, :content_utf8, decoded)
end)

Dirty Scheduler Behavior

Batch operations always use dirty CPU schedulers, regardless of input size or item count.

Rationale

Batch operations are typically used for throughput-focused workloads where:

  1. Total work is significant - Even if individual items are small, processing many items adds up
  2. Predictability matters - Consistent dirty scheduler usage avoids variable latency
  3. Simplicity - No threshold logic to tune or understand

Trade-offs

AspectBatch (always dirty)Single-item (threshold-based)
Small workloadsSlight overhead from dirty schedulerUses normal scheduler
Large workloadsOptimalOptimal
LatencyConsistentVariable based on size
ComplexitySimpleRequires threshold tuning

When This Matters

For most use cases, always using dirty schedulers is the right choice. The overhead is minimal and the behavior is predictable.

If you have a latency-sensitive application processing very small batches (< 10 items, each < 1KB), you may see slightly better latency using individual decode/2 or encode/2 calls, which respect the configured dirty threshold.

Known Limitations

No Batch Streaming

The batch API is for one-shot processing of complete binaries only. It does not support stateful streaming decoding where characters may be split across chunk boundaries.

For streaming use cases, use EncodingRs.Decoder which maintains state between chunks. However, each decoder handles a single stream - there is currently no way to batch process chunks from multiple streams in a single NIF call.

If you need to process multiple streams concurrently, create separate EncodingRs.Decoder instances for each stream.

Future Options

The following options may be added in future versions based on user feedback:

  • Batch streaming - Process chunks from multiple decoders in a single NIF call
  • Threshold-based routing - Check total bytes and route to normal/dirty scheduler
  • Item count threshold - Use dirty scheduler only above N items
  • Explicit scheduler choice - decode_batch/2 with options like [scheduler: :normal]

If you have a use case that would benefit from these options, please open an issue.

Performance Tips

  1. Batch similar-sized items - Helps with memory allocation efficiency

  2. Reasonable batch sizes - Batches of 100-10,000 items work well. Extremely large batches (100K+) may cause memory pressure.

  3. Consider chunking very large lists:

    large_list
    |> Enum.chunk_every(1000)
    |> Enum.flat_map(&EncodingRs.decode_batch/1)
  4. Parallel batches - For very large workloads, split across processes:

    items
    |> Enum.chunk_every(1000)
    |> Task.async_stream(&EncodingRs.decode_batch/1, max_concurrency: 4)
    |> Enum.flat_map(fn {:ok, results} -> results end)

Comparison: Batch vs Streaming vs One-Shot

ScenarioBest Approach
Single small binaryEncodingRs.decode/2
Single large fileEncodingRs.Decoder.stream/2
Many separate itemsEncodingRs.decode_batch/1
Network streamEncodingRs.Decoder
Database rowsEncodingRs.decode_batch/1