EncodingRs.Decoder (encoding_rs v0.2.2)
Copy MarkdownStateful streaming decoder for converting encoded byte streams to UTF-8.
This module provides a streaming API for decoding multibyte encodings (like Shift_JIS, GBK, Big5, EUC-JP, etc.) where characters may be split across chunk boundaries.
Why Use Streaming Decoding?
Multibyte encodings use variable-length byte sequences to represent characters.
For example, in Shift_JIS, the character "あ" is encoded as two bytes: <<0x82, 0xA0>>.
When processing data in chunks (e.g., from File.stream!/1 or network streams),
a character's bytes may be split across chunks:
# Chunk 1 ends with first byte of "あ"
chunk1 = <<..., 0x82>>
# Chunk 2 starts with second byte of "あ"
chunk2 = <<0xA0, ...>>The one-shot EncodingRs.decode/2 treats each chunk independently, so:
- Chunk 1's trailing
0x82is invalid → replaced with� - Chunk 2's leading
0xA0is invalid → replaced with�
The streaming decoder maintains state between chunks, properly buffering incomplete sequences until completed.
Usage
Manual Chunked Decoding
{:ok, decoder} = EncodingRs.Decoder.new("shift_jis")
{:ok, output1, _} = EncodingRs.Decoder.decode_chunk(decoder, chunk1, false)
{:ok, output2, _} = EncodingRs.Decoder.decode_chunk(decoder, chunk2, false)
{:ok, output3, _} = EncodingRs.Decoder.decode_chunk(decoder, chunk3, true)
result = output1 <> output2 <> output3Stream-Based Decoding
File.stream!("data.txt", [], 4096)
|> EncodingRs.Decoder.stream("shift_jis")
|> Enum.join()Important Notes
- Always pass
is_last: truefor the final chunk to flush any buffered bytes - The decoder resource is mutable; don't share it across concurrent processes
- For single complete binaries, use
EncodingRs.decode/2instead (more efficient)
Summary
Types
Result of decoding a chunk: {:ok, decoded_string, had_errors}
Functions
Decodes a chunk of bytes using the stateful decoder.
Decodes a chunk, raising on error.
Creates a new stateful decoder for the specified encoding.
Creates a new stateful decoder, raising on error.
Creates a stream that decodes chunks from the given encoding to UTF-8.
Creates a stream that decodes chunks, including error information.
Types
Functions
@spec decode_chunk(t(), binary(), boolean()) :: decode_result()
Decodes a chunk of bytes using the stateful decoder.
This function properly handles multibyte characters split across chunk boundaries by maintaining decoder state between calls.
Arguments
decoder- The decoder reference fromnew/1chunk- The binary chunk to decodeis_last- Set totruefor the final chunk (default:false)
Returns
{:ok, output, had_errors}on successoutput- The decoded UTF-8 string for this chunkhad_errors-trueif any bytes were replaced with U+FFFD
Behavior
- When
is_lastisfalse: Incomplete byte sequences at the end of the chunk are buffered internally and completed with the next chunk. - When
is_lastistrue: Any remaining incomplete sequences are replaced with U+FFFD (the Unicode replacement character).
Examples
iex> {:ok, decoder} = EncodingRs.Decoder.new("shift_jis")
iex> # Shift_JIS "あ" is <<0x82, 0xA0>> - split across chunks
iex> {:ok, out1, false} = EncodingRs.Decoder.decode_chunk(decoder, <<0x82>>, false)
iex> {:ok, out2, false} = EncodingRs.Decoder.decode_chunk(decoder, <<0xA0>>, true)
iex> out1 <> out2
"あ"
Decodes a chunk, raising on error.
See decode_chunk/3 for details.
Examples
iex> decoder = EncodingRs.Decoder.new!("utf-8")
iex> EncodingRs.Decoder.decode_chunk!(decoder, "hello", true)
{"hello", false}
@spec new(EncodingRs.encoding()) :: {:ok, t()} | {:error, :unknown_encoding}
Creates a new stateful decoder for the specified encoding.
The decoder maintains internal state to handle multibyte characters that may be split across chunk boundaries.
Arguments
encoding- The source encoding label (e.g., "shift_jis", "gbk", "euc-jp")
Returns
{:ok, decoder}on success{:error, :unknown_encoding}if the encoding is not recognized
Examples
iex> {:ok, decoder} = EncodingRs.Decoder.new("shift_jis")
iex> is_reference(decoder)
true
iex> EncodingRs.Decoder.new("invalid-encoding")
{:error, :unknown_encoding}
@spec new!(EncodingRs.encoding()) :: t()
Creates a new stateful decoder, raising on error.
Examples
iex> decoder = EncodingRs.Decoder.new!("shift_jis")
iex> is_reference(decoder)
true
iex> EncodingRs.Decoder.new!("invalid-encoding")
** (ArgumentError) unknown encoding: invalid-encoding
@spec stream(Enumerable.t(), EncodingRs.encoding()) :: Enumerable.t()
Creates a stream that decodes chunks from the given encoding to UTF-8.
This is the recommended way to process streaming data in multibyte encodings. It properly handles characters split across chunk boundaries.
Arguments
chunks- An enumerable of binary chunks (e.g., fromFile.stream!/3)encoding- The source encoding label
Returns
A stream of decoded UTF-8 strings. One element is emitted per input chunk, plus an additional element may be emitted at the end if the decoder has buffered bytes remaining (e.g., an incomplete multibyte sequence that gets flushed as a replacement character).
Examples
# Decode a Shift_JIS file
File.stream!("japanese.txt", [], 4096)
|> EncodingRs.Decoder.stream("shift_jis")
|> Enum.join()
# Process line by line (after decoding)
File.stream!("data.csv", [], 8192)
|> EncodingRs.Decoder.stream("gbk")
|> Enum.join()
|> String.split("\n")
# With error tracking
File.stream!("data.txt", [], 4096)
|> EncodingRs.Decoder.stream_with_errors("windows-1252")
|> Enum.reduce({"", false}, fn {chunk, errors}, {acc, had_any} ->
{acc <> chunk, had_any or errors}
end)Notes
- The stream automatically handles the
is_lastflag for the final chunk - The output may contain one more element than the input if buffered bytes are flushed at the end of the stream
- For better error visibility, use
stream_with_errors/2
@spec stream_with_errors(Enumerable.t(), EncodingRs.encoding()) :: Enumerable.t()
Creates a stream that decodes chunks, including error information.
Like stream/2, but each element is a tuple {decoded_string, had_errors}.
Examples
File.stream!("data.txt", [], 4096)
|> EncodingRs.Decoder.stream_with_errors("shift_jis")
|> Enum.each(fn {chunk, had_errors} ->
if had_errors, do: Logger.warning("Encountered invalid bytes")
IO.write(chunk)
end)