View Source A much improved version is available here: https://github.com/elixir-unicode/unicode
See Unicode.replace_invalid/3.
UniRecover
A library for substituting illegal bytes in Unicode encoded data, following W3C spec as suggested by the Unicode Standard.
This library leverages Erlang Sub Binaries to scale well with large amounts of data. This should suffice for most use-cases, short of those that may necessitate NIF-based solutions.
Installation
Add :uni_recover to your list of dependencies in mix.exs:
def deps do
[
{:uni_recover, "~> 0.1.2"}
]
endDocumentation is available on HexDocs and may also be generated with ExDoc.
Usage
# 0b11111111 = an illegal utf-8 code sequence
UniRecover.sub(<<"foo", 0b11111111, "bar">>)
# "foo�bar"
# 216, 0 = an illegal utf-16 code sequence
(UniRecover.sub(<<"foo"::utf16, 216, 0, "bar"::utf16>>, :utf16)
|> :unicode.characters_to_binary(:utf16))
# "foo�bar"Benchmarking
The following benchmark demonstrates how UniRecover leverages sub binaries, only allocating the indexes of illegal bytes. See the benchmarking folder in the repo for details.
Name ips average deviation median 99th %
UniRecover, 207KB Input 1842.84 542.64 μs ±1.44% 539.67 μs 574.71 μs
Simple Rebuild, 207KB Input 172.02 5813.34 μs ±13.88% 5534.29 μs 8223.92 μs
Naive 3-liner, 207KB Input 56.59 17670.58 μs ±6.44% 17377.60 μs 19210.26 μs
Comparison:
UniRecover, 207KB Input 1842.84
Simple Rebuild, 207KB Input 172.02 - 10.71x slower +5270.70 μs
Naive 3-liner, 207KB Input 56.59 - 32.56x slower +17127.94 μs
Memory usage statistics:
Name Memory usage
UniRecover, 207KB Input 296 B
Simple Rebuild, 207KB Input 8215208 B - 27754.08x memory usage +8214912 B
Naive 3-liner, 207KB Input 39556040 B - 133635.27x memory usage +39555744 BFor reference, the Simple implementation allocated 39.66x the original json, and Naive even worse at a whopping 191x the original.