Dicom.CharacterSet (Dicom v0.9.1)

DICOM Specific Character Set handling.

Supports decoding of text values according to the character set specified by tag (0008,0005) SpecificCharacterSet. See DICOM PS3.5 Section 6.1.

Supported Character Sets

Default character repertoire (ISO IR 6 / ASCII) — always supported
ISO_IR 100 (Latin-1 / ISO 8859-1)
ISO_IR 101 (Latin-2 / ISO 8859-2)
ISO_IR 109 (Latin-3 / ISO 8859-3)
ISO_IR 110 (Latin-4 / ISO 8859-4)
ISO_IR 144 (Cyrillic / ISO 8859-5)
ISO_IR 127 (Arabic / ISO 8859-6)
ISO_IR 126 (Greek / ISO 8859-7)
ISO_IR 138 (Hebrew / ISO 8859-8)
ISO_IR 148 (Latin-5 / ISO 8859-9)
ISO_IR 13 (JIS X 0201 — Roman + half-width Katakana)
ISO_IR 192 (UTF-8)

ISO 2022 Code Extension Support

The labels ISO 2022 IR 6 and ISO 2022 IR 100 are accepted both with and without ISO 2022 escape sequences.

ISO 2022 escape sequence parsing is supported per DICOM PS3.5 Section 6.1.2.5. Multi-valued Specific Character Set declarations (e.g. "ISO 2022 IR 13\ISO 2022 IR 87") use escape sequences to switch between character repertoires within a single text value. The following ISO 2022 charsets are recognized:

ISO 2022 IR 6 (ASCII, G0)
ISO 2022 IR 13 (JIS X 0201 — Roman G0 + Katakana G1)
ISO 2022 IR 87 (JIS X 0208 — multi-byte Kanji/Kana)
ISO 2022 IR 100 through ISO 2022 IR 148 (ISO 8859 variants, G1)
ISO 2022 IR 149 (KS X 1001 — multi-byte Korean)
ISO 2022 IR 159 (JIS X 0212 — multi-byte, not yet decodable)
ISO 2022 IR 58 (GB2312-80 — multi-byte Simplified Chinese)
GB18030 (Chinese national standard — 1/2/4-byte variable-length encoding)

JIS X 0208 (ISO 2022 IR 87) is fully decodable with a 6879-entry lookup table from the Unicode consortium's JIS0208.TXT mapping. GB2312-80 (ISO 2022 IR 58) is fully decodable with a 7478-entry lookup table from the Unicode consortium's CP936.TXT mapping (GB2312 subset). KS X 1001 (ISO 2022 IR 149) is fully decodable with an 8225-entry lookup table generated from Python's euc-kr codec. GB18030 is fully decodable with a 21791-entry GBK 2-byte lookup table plus algorithmic 4-byte decoding for BMP gaps and supplementary planes. The remaining multi-byte charset (JIS X 0212) is parsed at the escape-sequence level but returns {:error, :not_yet_implemented} when actual decoding is needed.

All other character sets return {:error, {:unsupported_charset, term}}.

Summary

Types

charset()

Functions

decode(binary, charset)

Decodes a binary value according to the given character set.

decode_iso2022(binary, default_encoding)

Decodes a binary containing ISO 2022 escape sequences.

decode_lossy(binary, charset)

Decodes a binary value, returning the original binary on failure instead of an error.

extract(elements)

Extracts the primary character set from a parsed data set's elements map.

extract_all(elements)

Extracts all Specific Character Set values from a parsed data set's elements map.

supported?(charset)

Returns true if the given character set label is recognized by the decoder.

Types

charset()

@type charset() :: String.t()

Functions

decode(binary, charset)

@spec decode(binary(), charset() | nil) :: {:ok, String.t()} | {:error, term()}

Decodes a binary value according to the given character set.

If charset is nil or empty, the default character repertoire is assumed (ISO IR 6 / ASCII, which is a subset of Latin-1 and UTF-8).

Returns {:ok, string} or {:error, reason}.

Examples

iex> Dicom.CharacterSet.decode("JOHN", nil)
{:ok, "JOHN"}

iex> Dicom.CharacterSet.decode(<<0xC4, 0xD6, 0xDC>>, "ISO_IR 100")
{:ok, "ÄÖÜ"}

decode_iso2022(binary, default_encoding)

@spec decode_iso2022(binary(), atom() | tuple()) ::
  {:ok, String.t()} | {:error, term()}

Decodes a binary containing ISO 2022 escape sequences.

Takes a binary and a default encoding (from the first value of a multi-valued Specific Character Set). Parses ESC sequences per DICOM PS3.5 Table C.12-3, splits the text into segments, and decodes each segment with the appropriate charset.

Returns {:ok, utf8_string} or {:error, reason}.

Examples

iex> Dicom.CharacterSet.decode_iso2022("HELLO", :ascii)
{:ok, "HELLO"}

iex> Dicom.CharacterSet.decode_iso2022(<<0xB1, 0xB6>>, :jis_x0201)
{:ok, "ｱｶ"}

decode_lossy(binary, charset)

@spec decode_lossy(binary(), charset() | nil) :: binary()

Decodes a binary value, returning the original binary on failure instead of an error.

This is a convenience function for use in the parser where we want to attempt charset decoding but fall back to the undecoded bytes rather than failing. Successful decodes return a UTF-8 Elixir string; failed decodes return the original binary unchanged.

extract(elements)

@spec extract(map()) :: charset() | nil

Extracts the primary character set from a parsed data set's elements map.

Returns the first (or only) character set value, or nil if absent. Use extract_all/1 when you need the full Specific Character Set list.

extract_all(elements)

@spec extract_all(map()) :: [charset()]

Extracts all Specific Character Set values from a parsed data set's elements map.

supported?(charset)

@spec supported?(charset() | nil) :: boolean()

Returns true if the given character set label is recognized by the decoder.

ISO 2022 labels (e.g. "ISO 2022 IR 87") are recognized even when their multi-byte lookup tables are not yet implemented. decode/2 will return {:error, :not_yet_implemented} for those, but supported?/1 returns true because the charset is known and the escape-sequence infrastructure exists.