View Source A much improved version is available here: https://github.com/elixir-unicode/unicode
See Unicode.replace_invalid/3
.
UniRecover
A library for substituting illegal bytes in Unicode encoded data, following W3C spec as suggested by the Unicode Standard.
This library leverages Erlang Sub Binaries to scale well with large amounts of data. This should suffice for most use-cases, short of those that may necessitate NIF-based solutions.
Installation
Add :uni_recover
to your list of dependencies in mix.exs
:
def deps do
[
{:uni_recover, "~> 0.1.2"}
]
end
Documentation is available on HexDocs and may also be generated with ExDoc.
Usage
# 0b11111111 = an illegal utf-8 code sequence
UniRecover.sub(<<"foo", 0b11111111, "bar">>)
# "foo�bar"
# 216, 0 = an illegal utf-16 code sequence
(UniRecover.sub(<<"foo"::utf16, 216, 0, "bar"::utf16>>, :utf16)
|> :unicode.characters_to_binary(:utf16))
# "foo�bar"
Benchmarking
The following benchmark demonstrates how UniRecover leverages sub binaries, only allocating the indexes of illegal bytes. See the benchmarking folder in the repo for details.
Name ips average deviation median 99th %
UniRecover, 207KB Input 1842.84 542.64 μs ±1.44% 539.67 μs 574.71 μs
Simple Rebuild, 207KB Input 172.02 5813.34 μs ±13.88% 5534.29 μs 8223.92 μs
Naive 3-liner, 207KB Input 56.59 17670.58 μs ±6.44% 17377.60 μs 19210.26 μs
Comparison:
UniRecover, 207KB Input 1842.84
Simple Rebuild, 207KB Input 172.02 - 10.71x slower +5270.70 μs
Naive 3-liner, 207KB Input 56.59 - 32.56x slower +17127.94 μs
Memory usage statistics:
Name Memory usage
UniRecover, 207KB Input 296 B
Simple Rebuild, 207KB Input 8215208 B - 27754.08x memory usage +8214912 B
Naive 3-liner, 207KB Input 39556040 B - 133635.27x memory usage +39555744 B
For reference, the Simple
implementation allocated 39.66x the original json, and Naive
even worse at a whopping 191x the original.