EncodingRs.Decoder (encoding_rs v0.2.1)
Copy MarkdownStateful streaming decoder for converting encoded byte streams to UTF-8.
This module provides a streaming API for decoding multibyte encodings (like Shift_JIS, GBK, Big5, EUC-JP, etc.) where characters may be split across chunk boundaries.
Why Use Streaming Decoding?
Multibyte encodings use variable-length byte sequences to represent characters.
For example, in Shift_JIS, the character "あ" is encoded as two bytes: <<0x82, 0xA0>>.
When processing data in chunks (e.g., from File.stream!/1 or network streams),
a character's bytes may be split across chunks:
# Chunk 1 ends with first byte of "あ"
chunk1 = <<..., 0x82>>
# Chunk 2 starts with second byte of "あ"
chunk2 = <<0xA0, ...>>The one-shot EncodingRs.decode/2 treats each chunk independently, so:
- Chunk 1's trailing
0x82is invalid → replaced with� - Chunk 2's leading
0xA0is invalid → replaced with�
The streaming decoder maintains state between chunks, properly buffering incomplete sequences until completed.
Usage
Manual Chunked Decoding
{:ok, decoder} = EncodingRs.Decoder.new("shift_jis")
{:ok, output1, _} = EncodingRs.Decoder.decode_chunk(decoder, chunk1, false)
{:ok, output2, _} = EncodingRs.Decoder.decode_chunk(decoder, chunk2, false)
{:ok, output3, _} = EncodingRs.Decoder.decode_chunk(decoder, chunk3, true)
result = output1 <> output2 <> output3Stream-Based Decoding
File.stream!("data.txt", [], 4096)
|> EncodingRs.Decoder.stream("shift_jis")
|> Enum.join()Important Notes
- Always pass
is_last: truefor the final chunk to flush any buffered bytes - The decoder resource is mutable; don't share it across concurrent processes
- For single complete binaries, use
EncodingRs.decode/2instead (more efficient)
Summary
Types
Result of decoding a chunk: {:ok, decoded_string, had_errors}
Functions
Decodes a chunk of bytes using the stateful decoder.
Decodes a chunk, raising on error.
Creates a new stateful decoder for the specified encoding.
Creates a new stateful decoder, raising on error.
Creates a stream that decodes chunks from the given encoding to UTF-8.
Creates a stream that decodes chunks, including error information.
Types
Functions
@spec decode_chunk(t(), binary(), boolean()) :: decode_result()
Decodes a chunk of bytes using the stateful decoder.
This function properly handles multibyte characters split across chunk boundaries by maintaining decoder state between calls.
Arguments
decoder- The decoder reference fromnew/1chunk- The binary chunk to decodeis_last- Set totruefor the final chunk (default:false)
Returns
{:ok, output, had_errors}on successoutput- The decoded UTF-8 string for this chunkhad_errors-trueif any bytes were replaced with U+FFFD
Behavior
- When
is_lastisfalse: Incomplete byte sequences at the end of the chunk are buffered internally and completed with the next chunk. - When
is_lastistrue: Any remaining incomplete sequences are replaced with U+FFFD (the Unicode replacement character).
Examples
iex> {:ok, decoder} = EncodingRs.Decoder.new("shift_jis")
iex> # Shift_JIS "あ" is <<0x82, 0xA0>> - split across chunks
iex> {:ok, out1, false} = EncodingRs.Decoder.decode_chunk(decoder, <<0x82>>, false)
iex> {:ok, out2, false} = EncodingRs.Decoder.decode_chunk(decoder, <<0xA0>>, true)
iex> out1 <> out2
"あ"
Decodes a chunk, raising on error.
See decode_chunk/3 for details.
Examples
iex> decoder = EncodingRs.Decoder.new!("utf-8")
iex> EncodingRs.Decoder.decode_chunk!(decoder, "hello", true)
{"hello", false}
@spec new(EncodingRs.encoding()) :: {:ok, t()} | {:error, :unknown_encoding}
Creates a new stateful decoder for the specified encoding.
The decoder maintains internal state to handle multibyte characters that may be split across chunk boundaries.
Arguments
encoding- The source encoding label (e.g., "shift_jis", "gbk", "euc-jp")
Returns
{:ok, decoder}on success{:error, :unknown_encoding}if the encoding is not recognized
Examples
iex> {:ok, decoder} = EncodingRs.Decoder.new("shift_jis")
iex> is_reference(decoder)
true
iex> EncodingRs.Decoder.new("invalid-encoding")
{:error, :unknown_encoding}
@spec new!(EncodingRs.encoding()) :: t()
Creates a new stateful decoder, raising on error.
Examples
iex> decoder = EncodingRs.Decoder.new!("shift_jis")
iex> is_reference(decoder)
true
iex> EncodingRs.Decoder.new!("invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding
@spec stream(Enumerable.t(), EncodingRs.encoding()) :: Enumerable.t()
Creates a stream that decodes chunks from the given encoding to UTF-8.
This is the recommended way to process streaming data in multibyte encodings. It properly handles characters split across chunk boundaries.
Arguments
chunks- An enumerable of binary chunks (e.g., fromFile.stream!/3)encoding- The source encoding label
Returns
A stream of decoded UTF-8 strings, one for each input chunk.
Examples
# Decode a Shift_JIS file
File.stream!("japanese.txt", [], 4096)
|> EncodingRs.Decoder.stream("shift_jis")
|> Enum.join()
# Process line by line (after decoding)
File.stream!("data.csv", [], 8192)
|> EncodingRs.Decoder.stream("gbk")
|> Enum.join()
|> String.split("\n")
# With error tracking
File.stream!("data.txt", [], 4096)
|> EncodingRs.Decoder.stream_with_errors("windows-1252")
|> Enum.reduce({"", false}, fn {chunk, errors}, {acc, had_any} ->
{acc <> chunk, had_any or errors}
end)Notes
- The stream automatically handles the
is_lastflag for the final chunk - Each output element corresponds to one input chunk
- For better error visibility, use
stream_with_errors/2
@spec stream_with_errors(Enumerable.t(), EncodingRs.encoding()) :: Enumerable.t()
Creates a stream that decodes chunks, including error information.
Like stream/2, but each element is a tuple {decoded_string, had_errors}.
Examples
File.stream!("data.txt", [], 4096)
|> EncodingRs.Decoder.stream_with_errors("shift_jis")
|> Enum.each(fn {chunk, had_errors} ->
if had_errors, do: Logger.warning("Encountered invalid bytes")
IO.write(chunk)
end)