Codepagex (Codepagex v0.1.11)

View Source

Codepagex is an elixir library to convert between string encodings to and from utf-8. Like iconv, but written in pure Elixir.

All the encodings are fetched from unicode.org tables and conversion functions are generated from these at compile time.

Note on the unicode built in module

Note that the Erlang built in :unicode module has some provisions for converting between utf-8 and latin1 code sets. If that is all you need, you should consider not using codepagex but rather rely on this simpler alternative.

Compared to this functionality codepagex provides:

  • More codepage mapping options
  • The ability to handle illegal encoding with custom logic
  • A simpler interface

But please remember that codepagex is comparatively a lot more complex, making extensive use of macro programming.

Examples

The package is assumed to be interfaced using only the Codepagex module.

iex> from_string("æøåÆØÅ", :iso_8859_1)
{:ok, <<230, 248, 229, 198, 216, 197>>}

iex> to_string(<<230, 248, 229, 198, 216, 197>>, :iso_8859_1)
{:ok, "æøåÆØÅ"}

iex> from_string!("æøåÆØÅ", :iso_8859_1)
<<230, 248, 229, 198, 216, 197>>

iex> to_string!(<<230, 248, 229, 198, 216, 197>>, :iso_8859_1)
"æøåÆØÅ"

When there are invalid byte sequences in a String or encoded binary, the functions will not succeed. If you still want to handle these strings, you may specify a function to handle these circumstances. Eg:

iex> from_string("Hello æøå!", :ascii, replace_nonexistent("_"))
{:ok, "Hello ___!", 3}

iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string!(iso, :ascii, use_utf_replacement())
"Hello ���!"

Encodings

A full list of encodings is found by running encoding_list/1.

The encodings are best supplied as an atom, or else the string is converted to atom for you (but with a somewhat less efficient function lookup). Eg:

iex> from_string("æøå", "ISO8859/8859-9")
{:ok, <<230, 248, 229>>}

iex> from_string("æøå", :"ISO8859/8859-9")
{:ok, <<230, 248, 229>>}

For some encodings, an alias is set up for easier dispatch. The list of aliases is found by running aliases/1. The code looks like:

iex> from_string!("Hello æøåÆØÅ!", :iso_8859_1)
<<72, 101, 108, 108, 111, 32, 230, 248, 229, 198, 216, 197, 33>>

Encoding selection

By default all ISO-8859 encodings and ASCII is included. There are a few more available, and these must be specified in the config/config.exs file. The specified files are then compiled. Adding many encodings may affect compilation times, in particular for the largest ones.

To specify the encodings to use, add the following lines to your config/config.exs and recompile:

use Mix.Config
config :codepagex, :encodings, [:ascii]

This will add only the ASCII encoding, as specified by it's shorthand alias. Any number of encodings may be specified like this in the list. The list may contain strings, atoms or regular expressions that match either an alias or a full encoding name, eg:

use Mix.Config
config :codepagex, :encodings, [
  :ascii,           # by alias name
  ~r[iso8859]i,     # by a regex matching the full name
  "ETSI/GSM0338",   # by the full name as a string
  :"MISC/CP856"     # by a full name as an atom
]

After modifying the encodings list in the configuration, always make sure to run the following or the encodings you specified will not be compiled in:

mix deps.compile codepagex --force

This is necessary due to the fact that Codepagex's configuration changes are not picked up automatically when it's a dependency in another project. Credit for the find goes to @michalmuskala here: https://elixirforum.com/t/sharing-with-the-community-text-transcoding-libraries/17962/2

The encodings that are known to require very long compile times are:

  • VENDORS/MISC/KPS9566
  • VENDORS/MICSFT/WINDOWS/CP932
  • VENDORS/MICSFT/WINDOWS/CP936
  • VENDORS/MICSFT/WINDOWS/CP949
  • VENDORS/MICSFT/WINDOWS/CP950

TODO

  • A few encodings are not yet supported for different reasons. In particular the asian and arab ones with left-right and up-down variations.
  • Test Elixir function specs
  • Benchmarking vs iconv native libraries
  • Support for iolists
  • when converting sections of a string that are unchanged, return the original input. Consider using iolists to return the values so that chunks may be saved continuously
  • lazy converter to get n characters / codepoints
  • function to drop n characters and take n characters (and slice?)

Summary

Functions

Returns a list of shorthand aliases that may be used instead of the full name of the encoding.

Returns a list of the supported encodings. These are extracted from http://unicode.org/ and the names correspond to a encoding file on that page

Converts an Elixir string in utf-8 encoding to a binary in another encoding.

Convert an Elixir String in utf-8 to a binary in a specified encoding. A function parameter specifies how to deal with codepoints that are not representable in the target encoding.

Like from_string/2 but raising exceptions on errors.

This function may be used in conjunction with to from_string/4 or from_string!/4. If there are utf-8 codepoints in the source string that are not possible to represent in the target encoding, they are replaced with a String.

Converts a binary in a specified encoding to an Elixir string in utf-8 encoding.

Convert a binary in a specified encoding into an Elixir string in utf-8 encoding

Like to_string/2 but raises exceptions on errors.

Convert a binary in one encoding to a binary in another encoding. The string is converted to utf-8 internally in the process.

Like translate/3 but raises exceptions on errors

This function may be used as a parameter to to_string/4 or to_string!/4 such that any bytes in the input binary that don't have a proper encoding are replaced with a special unicode character and the function will not fail.

Types

encoding()

@type encoding() :: atom() | String.t()

from_s_missing_inner()

@type from_s_missing_inner() ::
  (String.t(), term() -> {:ok, binary(), String.t(), term()} | {:error, term()})

from_s_missing_outer()

@type from_s_missing_outer() ::
  (String.t() -> {:ok, from_s_missing_inner()} | {:error, term()})

to_s_missing_inner()

@type to_s_missing_inner() ::
  (binary(), term() -> {:ok, String.t(), binary(), term()} | {:error, term()})

to_s_missing_outer()

@type to_s_missing_outer() ::
  (String.t() -> {:ok, to_s_missing_inner()} | {:error, term()})

Functions

aliases(selection \\ nil)

@spec aliases(atom()) :: [atom()]

Returns a list of shorthand aliases that may be used instead of the full name of the encoding.

The available aliases are:

AliasFull name
:asciiVENDORS/MISC/US-ASCII-QUOTES
:iso_8859_1ISO8859/8859-1
:iso_8859_2ISO8859/8859-2
:iso_8859_3ISO8859/8859-3
:iso_8859_4ISO8859/8859-4
:iso_8859_5ISO8859/8859-5
:iso_8859_6ISO8859/8859-6
:iso_8859_7ISO8859/8859-7
:iso_8859_8ISO8859/8859-8
:iso_8859_9ISO8859/8859-9
:iso_8859_10ISO8859/8859-10
:iso_8859_11ISO8859/8859-11
:iso_8859_12ISO8859/8859-12
:iso_8859_13ISO8859/8859-13
:iso_8859_14ISO8859/8859-14
:iso_8859_15ISO8859/8859-15
:iso_8859_16ISO8859/8859-16

Some of these may not be available depending on mix configuration. If the selection parameter is :all then all possible aliases are listed, otherwise, only the available aliases are listed

For a full list of encodings, see encoding_list/1

encoding_list(selection \\ nil)

@spec encoding_list(atom()) :: [String.t()]

Returns a list of the supported encodings. These are extracted from http://unicode.org/ and the names correspond to a encoding file on that page

encoding_list/1 is normally called without any parameters to list the encodings that are currently configured during compilation. To see all available options, even those unavailable, use encoding_list(:all)

The available encodings are:

ETSI/GSM0338ISO8859/8859-1ISO8859/8859-10
ISO8859/8859-11ISO8859/8859-13ISO8859/8859-14
ISO8859/8859-15ISO8859/8859-16ISO8859/8859-2
ISO8859/8859-3ISO8859/8859-4ISO8859/8859-5
ISO8859/8859-6ISO8859/8859-7ISO8859/8859-8
ISO8859/8859-9VENDORS/MICSFT/EBCDIC/CP037VENDORS/MICSFT/EBCDIC/CP1026
VENDORS/MICSFT/EBCDIC/CP500VENDORS/MICSFT/MAC/CYRILLICVENDORS/MICSFT/MAC/GREEK
VENDORS/MICSFT/MAC/ICELANDVENDORS/MICSFT/MAC/LATIN2VENDORS/MICSFT/MAC/ROMAN
VENDORS/MICSFT/MAC/TURKISHVENDORS/MICSFT/PC/CP437VENDORS/MICSFT/PC/CP737
VENDORS/MICSFT/PC/CP775VENDORS/MICSFT/PC/CP850VENDORS/MICSFT/PC/CP852
VENDORS/MICSFT/PC/CP855VENDORS/MICSFT/PC/CP857VENDORS/MICSFT/PC/CP860
VENDORS/MICSFT/PC/CP861VENDORS/MICSFT/PC/CP862VENDORS/MICSFT/PC/CP863
VENDORS/MICSFT/PC/CP864VENDORS/MICSFT/PC/CP865VENDORS/MICSFT/PC/CP866
VENDORS/MICSFT/PC/CP869VENDORS/MICSFT/PC/CP874VENDORS/MICSFT/WINDOWS/CP1250
VENDORS/MICSFT/WINDOWS/CP1251VENDORS/MICSFT/WINDOWS/CP1252VENDORS/MICSFT/WINDOWS/CP1253
VENDORS/MICSFT/WINDOWS/CP1254VENDORS/MICSFT/WINDOWS/CP1255VENDORS/MICSFT/WINDOWS/CP1256
VENDORS/MICSFT/WINDOWS/CP1257VENDORS/MICSFT/WINDOWS/CP1258VENDORS/MICSFT/WINDOWS/CP874
VENDORS/MICSFT/WINDOWS/CP932VENDORS/MICSFT/WINDOWS/CP936VENDORS/MICSFT/WINDOWS/CP949
VENDORS/MICSFT/WINDOWS/CP950VENDORS/MISC/ATARISTVENDORS/MISC/CP424
VENDORS/MISC/CP856VENDORS/MISC/KOI8-RVENDORS/MISC/KOI8-U
VENDORS/MISC/KPS9566VENDORS/MISC/KZ1048VENDORS/MISC/US-ASCII-QUOTES

For more information about configuring encodings, refer to Codepagex.

For a list of shorthand names, see aliases/1

from_string(string, encoding)

@spec from_string(String.t(), encoding()) :: {:ok, binary()} | {:error, term()}

Converts an Elixir string in utf-8 encoding to a binary in another encoding.

The encoding parameter should be in encoding_list/0 as an atom or String, or in aliases/0.

Examples

iex> from_string("Hɦ¦Ó", :iso_8859_1)
{:ok, <<72, 201, 166, 166, 211>>}

iex> from_string("Hɦ¦Ó", :"ISO8859/8859-1") # without alias
{:ok, <<72, 201, 166, 166, 211>>}

iex> from_string("ʒ", :iso_8859_1)
{:error, "Invalid bytes for encoding"}

from_string(string, encoding, missing_fun, acc \\ nil)

@spec from_string(binary(), encoding(), from_s_missing_outer(), term()) ::
  {:ok, String.t(), integer()} | {:error, term(), integer()}

Convert an Elixir String in utf-8 to a binary in a specified encoding. A function parameter specifies how to deal with codepoints that are not representable in the target encoding.

Compared to from_string/2, you may pass a missing_fun function parameter to handle encoding errors in string. The function replace_nonexistent/1 may be used as a default error handling mechanism.

The encoding parameter should be in encoding_list/0 as an atom or String, or in aliases/0.

Implementing missing_fun

The missing_fun must be an anonymous function that returns a second function. The outer function will receive the encoding used by from_string/4, and must then return {:ok, inner_function} or {:error, reason}. Returning :error will cause from_string/4 to fail.

The returned inner function must receive two arguments.

  • a String containing the remainder of the string parameter that is still unprocessed.
  • the accumulator acc

The return value must be

  • {:ok, replacement, new_rest, new_acc} to continue processing
  • {:error, reason, new_acc} to cause from_string/4 to fail

The acc parameter from from_string/4 is passed between every invocation of the inner function then returned by to_string/4. In many use cases, acc may be ignored.

Examples

Using the replace_nonexistent/1 function to handle invalid bytes:

iex> from_string("Hello æøå!", :ascii, replace_nonexistent("_"))
{:ok, "Hello ___!", 3}

Defining a custom missing_fun:

iex> missing_fun =
...>   fn encoding ->
...>     case from_string("#", encoding) do
...>       {:ok, replacement} ->
...>         inner_fun =
...>           fn <<_ :: utf8, rest :: binary>>, acc ->
...>             {:ok, replacement, rest, acc + 1}
...>           end
...>         {:ok, inner_fun}
...>       err ->
...>         err
...>     end
...>   end
iex> from_string("Hello æøå!", :ascii, missing_fun, 0)
{:ok, "Hello ###!", 3}

The previous code was included for completeness. If you know your replacement is valid in the target encoding, you might as well do:

iex> missing_fun = fn _encoding ->
...>   inner_fun =
...>     fn <<_ :: utf8, rest :: binary>>, acc ->
...>       {:ok, "#", rest, acc + 1}
...>     end
...>   {:ok, inner_fun}
...> end
iex> from_string("Hello æøå!", :ascii, missing_fun, 10)
{:ok, "Hello ###!", 13}

from_string!(binary, encoding)

@spec from_string!(String.t(), encoding()) :: binary() | no_return()

Like from_string/2 but raising exceptions on errors.

Examples

iex> from_string!("Hɦ¦Ó", :iso_8859_1)
<<72, 201, 166, 166, 211>>

iex> from_string!("ʒ", :iso_8859_1)
** (Codepagex.Error) Invalid bytes for encoding

from_string!(string, encoding, missing_fun, acc \\ nil)

@spec from_string!(String.t(), encoding(), from_s_missing_outer(), term()) ::
  binary() | no_return()

Like from_string/4 but raising exceptions on errors.

Examples

iex> missing_fun = replace_nonexistent("_")
iex> from_string!("Hello æøå!", :ascii, missing_fun)
"Hello ___!"

load_atoms()

replace_nonexistent(replace_with)

@spec replace_nonexistent(String.t()) :: from_s_missing_outer()

This function may be used in conjunction with to from_string/4 or from_string!/4. If there are utf-8 codepoints in the source string that are not possible to represent in the target encoding, they are replaced with a String.

When using this function, from_string/4 will never return an error if replace_with converts to the target encoding without errors.

The accumulator input acc of from_string/4 is incremented on each replacement done.

Examples

iex> from_string!("Hello æøå!", :ascii, replace_nonexistent("_"))
"Hello ___!"

iex> from_string("Hello æøå!", :ascii, replace_nonexistent("_"), 100)
{:ok, "Hello ___!", 103}

to_string(binary, encoding)

@spec to_string(binary(), encoding()) :: {:ok, String.t()} | {:error, term()}

Converts a binary in a specified encoding to an Elixir string in utf-8 encoding.

The encoding parameter should be in encoding_list/0 (passed as atoms or strings), or in aliases/0.

Examples

iex> to_string(<<72, 201, 166, 166, 211>>, :iso_8859_1)
{:ok, "Hɦ¦Ó"}

iex> to_string(<<128>>, "ETSI/GSM0338")
{:error, "Invalid bytes for encoding"}

to_string(binary, encoding, missing_fun, acc \\ nil)

@spec to_string(binary(), encoding(), to_s_missing_outer(), term()) ::
  {:ok, String.t(), integer()} | {:error, term(), integer()}

Convert a binary in a specified encoding into an Elixir string in utf-8 encoding

Compared to to_string/2, you may pass a missing_fun function parameter to handle encoding errors in the binary. The function use_utf_replacement/0 may be used as a default error handling mechanism.

Implementing missing_fun

The missing_fun must be an anonymous function that returns a second function. The outer function will receive the encoding used by to_string/4, and must then return {:ok, inner_function} or {:error, reason}. Returning :error will cause to_string/4 to fail.

The returned inner function must receive two arguments.

  • a binary containing the remainder of the binary parameter that is still unprocessed.
  • the accumulator acc

The return value must be

  • {:ok, replacement, new_rest, new_acc} to continue processing
  • {:error, reason, new_acc} to cause to_string/4 to fail

The acc parameter from to_string/4 is passed between every invocation of the inner function then returned by to_string/4. In many use cases, acc may be ignored.

Examples

Using the use_utf_replacement/0 function to handle invalid bytes:

iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string(iso, :ascii, use_utf_replacement())
{:ok, "Hello ���!", 3}

iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> missing_fun =
...>   fn encoding ->
...>     case to_string("#", encoding) do
...>       {:ok, replacement} ->
...>         inner_fun =
...>           fn <<_, rest :: binary>>, acc ->
...>             {:ok, replacement, rest, acc + 1}
...>           end
...>         {:ok, inner_fun}
...>       err ->
...>         err
...>     end
...>   end
iex> to_string(iso, :ascii, missing_fun, 0)
{:ok, "Hello ###!", 3}

The previous code was included for completeness. If you know your replacement is valid in the target encoding, you might as well do:

iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> missing_fun =
...>   fn _encoding ->
...>     inner_fun =
...>       fn <<_, rest :: binary>>, acc ->
...>         {:ok, "#", rest, acc + 1}
...>       end
...>     {:ok, inner_fun}
...>   end
iex> to_string(iso, :ascii, missing_fun, 10)
{:ok, "Hello ###!", 13}

to_string!(binary, encoding)

@spec to_string!(binary(), encoding()) :: String.t() | no_return()

Like to_string/2 but raises exceptions on errors.

Examples

iex> to_string!(<<72, 201, 166, 166, 211>>, :iso_8859_1)
"Hɦ¦Ó"

iex> to_string!(<<128>>, "ETSI/GSM0338")
** (Codepagex.Error) Invalid bytes for encoding

to_string!(binary, encoding, missing_fun, acc \\ nil)

@spec to_string!(binary(), encoding(), to_s_missing_outer(), term()) ::
  String.t() | no_return()

Like to_string/4 but raises exceptions on errors.

Examples

iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string!(iso, :ascii, use_utf_replacement())
"Hello ���!"

translate(binary, encoding_from, encoding_to)

@spec translate(binary(), encoding(), encoding()) ::
  {:ok, binary()} | {:error, term()}

Convert a binary in one encoding to a binary in another encoding. The string is converted to utf-8 internally in the process.

The encoding parameters should be in encoding_list/0 or aliases/0. It may be passed as an atom, or a string for full encoding names.

Examples

iex> translate(<<174>>, :iso_8859_1, :iso_8859_15)
{:ok, <<174>>}

iex> translate(<<174>>, :iso_8859_1, :iso_8859_2)
{:error, "Invalid bytes for encoding"}

translate!(binary, encoding_from, encoding_to)

@spec translate!(binary(), encoding(), encoding()) :: binary()

Like translate/3 but raises exceptions on errors

Examples

iex> translate!(<<174>>, :iso_8859_1, :iso_8859_15)
<<174>>

iex> translate!(<<174>>, :iso_8859_1,:iso_8859_2)
** (Codepagex.Error) Invalid bytes for encoding

use_utf_replacement()

@spec use_utf_replacement() :: to_s_missing_outer()

This function may be used as a parameter to to_string/4 or to_string!/4 such that any bytes in the input binary that don't have a proper encoding are replaced with a special unicode character and the function will not fail.

If this function is used, to_string/4 will never return an error.

The accumulator input acc of to_string/4 is incremented by the number of replacements made.

Examples

iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string!(iso, :ascii, use_utf_replacement())
"Hello ���!"

iex> iso = "Hello æøå!" |> from_string!(:iso_8859_1)
iex> to_string(iso, :ascii, use_utf_replacement())
{:ok, "Hello ���!", 3}