# `Dicom.CharacterSet`
[🔗](https://github.com/Balneario-de-Cofrentes/dicom/blob/v0.9.1/lib/dicom/character_set.ex#L1)

DICOM Specific Character Set handling.

Supports decoding of text values according to the character set specified
by tag (0008,0005) SpecificCharacterSet. See DICOM PS3.5 Section 6.1.

## Supported Character Sets

- Default character repertoire (ISO IR 6 / ASCII) — always supported
- `ISO_IR 100` (Latin-1 / ISO 8859-1)
- `ISO_IR 101` (Latin-2 / ISO 8859-2)
- `ISO_IR 109` (Latin-3 / ISO 8859-3)
- `ISO_IR 110` (Latin-4 / ISO 8859-4)
- `ISO_IR 144` (Cyrillic / ISO 8859-5)
- `ISO_IR 127` (Arabic / ISO 8859-6)
- `ISO_IR 126` (Greek / ISO 8859-7)
- `ISO_IR 138` (Hebrew / ISO 8859-8)
- `ISO_IR 148` (Latin-5 / ISO 8859-9)
- `ISO_IR 13` (JIS X 0201 — Roman + half-width Katakana)
- `ISO_IR 192` (UTF-8)

## ISO 2022 Code Extension Support

The labels `ISO 2022 IR 6` and `ISO 2022 IR 100` are accepted both with and
without ISO 2022 escape sequences.

ISO 2022 escape sequence parsing is supported per DICOM PS3.5 Section 6.1.2.5.
Multi-valued Specific Character Set declarations (e.g. `"ISO 2022 IR 13\ISO 2022 IR 87"`)
use escape sequences to switch between character repertoires within a single
text value. The following ISO 2022 charsets are recognized:

- `ISO 2022 IR 6` (ASCII, G0)
- `ISO 2022 IR 13` (JIS X 0201 — Roman G0 + Katakana G1)
- `ISO 2022 IR 87` (JIS X 0208 — multi-byte Kanji/Kana)
- `ISO 2022 IR 100` through `ISO 2022 IR 148` (ISO 8859 variants, G1)
- `ISO 2022 IR 149` (KS X 1001 — multi-byte Korean)
- `ISO 2022 IR 159` (JIS X 0212 — multi-byte, not yet decodable)
- `ISO 2022 IR 58` (GB2312-80 — multi-byte Simplified Chinese)
- `GB18030` (Chinese national standard — 1/2/4-byte variable-length encoding)

JIS X 0208 (ISO 2022 IR 87) is fully decodable with a 6879-entry lookup
table from the Unicode consortium's JIS0208.TXT mapping. GB2312-80
(ISO 2022 IR 58) is fully decodable with a 7478-entry lookup table from
the Unicode consortium's CP936.TXT mapping (GB2312 subset). KS X 1001
(ISO 2022 IR 149) is fully decodable with an 8225-entry lookup table
generated from Python's euc-kr codec. GB18030 is fully decodable with a
21791-entry GBK 2-byte lookup table plus algorithmic 4-byte decoding for
BMP gaps and supplementary planes. The remaining multi-byte charset
(JIS X 0212) is parsed at the escape-sequence level but returns
`{:error, :not_yet_implemented}` when actual decoding is needed.

All other character sets return `{:error, {:unsupported_charset, term}}`.

# `charset`

```elixir
@type charset() :: String.t()
```

# `decode`

```elixir
@spec decode(binary(), charset() | nil) :: {:ok, String.t()} | {:error, term()}
```

Decodes a binary value according to the given character set.

If `charset` is nil or empty, the default character repertoire is assumed
(ISO IR 6 / ASCII, which is a subset of Latin-1 and UTF-8).

Returns `{:ok, string}` or `{:error, reason}`.

## Examples

    iex> Dicom.CharacterSet.decode("JOHN", nil)
    {:ok, "JOHN"}

    iex> Dicom.CharacterSet.decode(<<0xC4, 0xD6, 0xDC>>, "ISO_IR 100")
    {:ok, "ÄÖÜ"}

# `decode_iso2022`

```elixir
@spec decode_iso2022(binary(), atom() | tuple()) ::
  {:ok, String.t()} | {:error, term()}
```

Decodes a binary containing ISO 2022 escape sequences.

Takes a binary and a default encoding (from the first value of a
multi-valued Specific Character Set). Parses ESC sequences per
DICOM PS3.5 Table C.12-3, splits the text into segments, and decodes
each segment with the appropriate charset.

Returns `{:ok, utf8_string}` or `{:error, reason}`.

## Examples

    iex> Dicom.CharacterSet.decode_iso2022("HELLO", :ascii)
    {:ok, "HELLO"}

    iex> Dicom.CharacterSet.decode_iso2022(<<0xB1, 0xB6>>, :jis_x0201)
    {:ok, "ｱｶ"}

# `decode_lossy`

```elixir
@spec decode_lossy(binary(), charset() | nil) :: binary()
```

Decodes a binary value, returning the original binary on failure instead of an error.

This is a convenience function for use in the parser where we want to
attempt charset decoding but fall back to the undecoded bytes rather than
failing. Successful decodes return a UTF-8 Elixir string; failed decodes
return the original binary unchanged.

# `extract`

```elixir
@spec extract(map()) :: charset() | nil
```

Extracts the primary character set from a parsed data set's elements map.

Returns the first (or only) character set value, or nil if absent.
Use `extract_all/1` when you need the full Specific Character Set list.

# `extract_all`

```elixir
@spec extract_all(map()) :: [charset()]
```

Extracts all Specific Character Set values from a parsed data set's elements map.

# `supported?`

```elixir
@spec supported?(charset() | nil) :: boolean()
```

Returns true if the given character set label is recognized by the decoder.

ISO 2022 labels (e.g. `"ISO 2022 IR 87"`) are recognized even when their
multi-byte lookup tables are not yet implemented. `decode/2` will return
`{:error, :not_yet_implemented}` for those, but `supported?/1` returns true
because the charset is known and the escape-sequence infrastructure exists.

---

*Consult [api-reference.md](api-reference.md) for complete listing*