Module mochiutf8

Algorithm to convert any binary to a valid UTF-8 sequence by ignoring invalid bytes.

Copyright © 2010 Mochi Media, Inc.

Authors: Bob Ippolito (bob@mochimedia.com).

Description

Algorithm to convert any binary to a valid UTF-8 sequence by ignoring invalid bytes.

Data Types

unichar()

unichar() = unichar_low() | unichar_high()

unichar_high()

unichar_high() = 57344..1114111

unichar_low()

unichar_low() = 0..55295

Function Index

bytes_foldl/3
bytes_to_codepoints/1
codepoint_foldl/3
codepoint_to_bytes/1Convert a unicode codepoint to UTF-8 bytes.
codepoints_to_bytes/1Convert a list of codepoints to a UTF-8 binary.
len/1
read_codepoint/1
valid_utf8_bytes/1Return only the bytes in B that represent valid UTF-8.

Function Details

bytes_foldl/3

bytes_foldl(F::fun((binary(), term()) -> term()), Acc::term(), Bin::binary()) -> term()

bytes_to_codepoints/1

bytes_to_codepoints(B::binary()) -> [unichar()]

codepoint_foldl/3

codepoint_foldl(F::fun((unichar(), term()) -> term()), Acc::term(), Bin::binary()) -> term()

codepoint_to_bytes/1

codepoint_to_bytes(C::unichar()) -> binary()

Convert a unicode codepoint to UTF-8 bytes.

codepoints_to_bytes/1

codepoints_to_bytes(L::[unichar()]) -> binary()

Convert a list of codepoints to a UTF-8 binary.

len/1

len(B::binary()) -> non_neg_integer()

read_codepoint/1

read_codepoint(Bin::binary()) -> {unichar(), binary(), binary()}

valid_utf8_bytes/1

valid_utf8_bytes(B::binary()) -> binary()

Return only the bytes in B that represent valid UTF-8. Uses the following recursive algorithm: skip one byte if B does not follow UTF-8 syntax (a 1-4 byte encoding of some number), skip sequence of 2-4 bytes if it represents an overlong encoding or bad code point (surrogate U+D800 - U+DFFF or > U+10FFFF).


Generated by EDoc