# `RustyCSV`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L1)

RustyCSV is an ultra-fast CSV parsing and dumping library powered by purpose-built Rust NIFs.

It provides a drop-in replacement for NimbleCSV with the same API, while offering
multiple parsing strategies optimized for different use cases.

## Quick Start

Use the pre-defined `RustyCSV.RFC4180` parser:

    alias RustyCSV.RFC4180, as: CSV

    CSV.parse_string("name,age\njohn,27\n")
    #=> [["john", "27"]]

    CSV.parse_string("name,age\njohn,27\n", skip_headers: false)
    #=> [["name", "age"], ["john", "27"]]

## Defining Custom Parsers

You can define custom CSV parsers with `define/2`:

    RustyCSV.define(MyParser,
      separator: ",",
      escape: "\"",
      line_separator: "\n"
    )

    MyParser.parse_string("a,b\n1,2\n")
    #=> [["1", "2"]]

## Parsing Strategies

RustyCSV supports the `:strategy` option for backward compatibility, but
`:simd`, `:basic`, `:indexed`, and `:zero_copy` are all equivalent — they
use the same SIMD structural boundary scanner and hybrid sub-binary term
builder. The only meaningfully distinct strategies are:

  * `:simd` (default) — SIMD structural boundary scan, single-threaded
  * `:parallel` — same SIMD scan, multi-threaded field extraction via rayon
    (best for very large files 500 MB+)

The streaming parser (used automatically with `parse_stream/2`) is a
separate stateful approach for bounded-memory processing.

Example:

    CSV.parse_string(large_csv, strategy: :parallel)

## Scheduling

All parsing NIFs run on BEAM dirty CPU schedulers, so they never block
normal schedulers. Parallel parsing (`:parallel` strategy) additionally
runs on a dedicated `rustycsv-*` rayon thread pool to avoid contention
with other Rayon users in the same VM.

## Streaming

For large files, use `parse_stream/2` which uses a bounded-memory streaming parser:

    "huge.csv"
    |> File.stream!()
    |> CSV.parse_stream()
    |> Stream.each(&process_row/1)
    |> Stream.run()

Streaming parsers are safe to share across processes — the underlying
Rust resource is protected by a mutex. However, concurrent access is
serialized, so for maximum throughput use one parser per process.

## Encoding (Dumping)

Convert rows back to CSV format:

    CSV.dump_to_iodata([["name", "age"], ["john", "27"]])
    #=> "name,age\njohn,27\n"

Encoding uses a SIMD-accelerated Rust NIF that writes all CSV bytes into
a single flat binary. The NIF handles four modes: plain UTF-8, UTF-8 with
formula escaping, non-UTF-8 encoding, and both combined.

> **Difference from NimbleCSV:** NimbleCSV's `dump_to_iodata/1` returns an
> iodata list (a nested list of small binaries) that callers typically flatten
> back into a single binary via `IO.iodata_to_binary/1` before writing to a
> file, sending as a download, or passing to an API. RustyCSV skips that
> roundtrip — it returns the final binary directly, ready for use with
> `IO.binwrite/2`, `Conn.send_resp/3`, `:gen_tcp.send/2`, `File.write/2`,
> etc. The output bytes are identical; there is nothing to traverse or flatten.
>
> Code that pattern-matches on the return value expecting a list will need
> adjustment. This is a deliberate trade-off: building an iodata list across
> the NIF boundary requires allocating one Erlang term per field, separator,
> and newline, which is 18–63% slower and uses 3–6x more NIF memory than
> returning the bytes directly.

### Encoding Strategies

`dump_to_iodata/2` accepts a `:strategy` option:

  * *default* (no option) — Single-threaded SIMD-accelerated encoder.
    Writes all CSV bytes into a single flat binary. Best for most workloads.

  * `:parallel` — Multi-threaded encoding via rayon. Copies all field data
    into Rust-owned memory, splits rows into chunks, and encodes each chunk
    on a separate thread. Returns a short list of large binaries. Best for
    quoting-heavy data (user-generated content with embedded commas/quotes/newlines).

Example:

    # Default (recommended for most cases)
    CSV.dump_to_iodata(rows)

    # Parallel (opt in for quoting-heavy data)
    CSV.dump_to_iodata(rows, strategy: :parallel)

### High-Throughput Concurrent Exports

The encoding NIF runs on dirty CPU schedulers with per-thread mimalloc
arenas, making it suitable for concurrent export workloads — e.g.,
thousands of users downloading CSV reports simultaneously:

    # Phoenix controller — each request encodes independently
    rows = MyApp.Reports.fetch_rows(user_id)
    csv = MyCSV.dump_to_iodata(rows)
    send_download(conn, {:binary, csv}, filename: "report.csv")

For very large exports, use chunked NIF encoding for bounded memory:

    MyApp.Reports.stream_rows(user_id)
    |> Stream.chunk_every(5_000)
    |> Stream.map(&MyCSV.dump_to_iodata/1)
    |> Enum.each(&Conn.chunk(conn, &1))

## NimbleCSV Compatibility

RustyCSV is designed as a drop-in replacement for NimbleCSV. The API is identical:

  * `parse_string/2` - Parse CSV string to list of rows
  * `parse_stream/2` - Lazily parse a stream
  * `parse_enumerable/2` - Parse any enumerable
  * `dump_to_iodata/2` - Convert rows to iodata (returns a flat binary, not an iodata list — see "Encoding" section)
  * `dump_to_stream/1` - Lazily convert rows to iodata stream
  * `to_line_stream/1` - Convert arbitrary chunks to lines
  * `options/0` - Return module configuration

RustyCSV extends NimbleCSV with additional options:

  * `:strategy` on `parse_string/2` - Select the parsing approach (`:simd`,
    `:basic`, `:indexed`, `:parallel`, `:zero_copy`)
  * `:strategy` on `dump_to_iodata/2` - Select the encoding approach
    (default or `:parallel`)
  * `:headers` - Return rows as maps instead of lists

## Headers-to-Maps

Use the `:headers` option to get maps instead of lists:

    CSV.parse_string("name,age\njohn,27\n", headers: true)
    #=> [%{"name" => "john", "age" => "27"}]

    CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
    #=> [%{name: "john", age: "27"}]

    CSV.parse_string("name,age\njohn,27\n", headers: ["n", "a"])
    #=> [%{"n" => "john", "a" => "27"}]

Streaming also supports headers:

    "huge.csv"
    |> File.stream!()
    |> CSV.parse_stream(headers: true)
    |> Stream.each(&process_map/1)
    |> Stream.run()

### How `:headers` interacts with `:skip_headers`

With `headers: true`, the first row is always consumed as keys — `:skip_headers`
has no effect.

With `headers: [keys]`, the `:skip_headers` option controls whether the first
row is skipped (default: `true`). Most CSV files have a header row, so skipping
it avoids mapping the header row itself into a map. If your file has no header
row, pass `skip_headers: false`:

    # File with header row (typical) — first row skipped by default
    CSV.parse_string("name,age\njohn,27\n", headers: [:n, :a])
    #=> [%{n: "john", a: "27"}]

    # File without header row — include all rows
    CSV.parse_string("john,27\njane,30\n", headers: [:n, :a], skip_headers: false)
    #=> [%{n: "john", a: "27"}, %{n: "jane", a: "30"}]

### Edge cases

  * Fewer columns than keys — missing values are `nil`
  * More columns than keys — extra columns are ignored
  * Duplicate headers — last column wins
  * Empty header field — key is `""`

## Multi-Separator Support

Like NimbleCSV, RustyCSV supports multiple separator characters. Separators
can be single-byte or multi-byte:

    RustyCSV.define(MyParser,
      separator: [",", ";"],
      escape: "\""
    )

    # Any separator in the list is recognized when parsing
    MyParser.parse_string("a,b;c\\n1;2,3\\n", skip_headers: false)
    #=> [["a", "b", "c"], ["1", "2", "3"]]

    # Only the FIRST separator is used when dumping
    MyParser.dump_to_iodata([["a", "b", "c"]]) |> IO.iodata_to_binary()
    #=> "a,b,c\\n"

Multi-byte separators are supported:

    RustyCSV.define(MyParser,
      separator: "::",
      escape: "\""
    )

    MyParser.parse_string("a::b::c\\n", skip_headers: false)
    #=> [["a", "b", "c"]]

You can also mix single-byte and multi-byte separators:

    RustyCSV.define(MyParser,
      separator: [",", "::"],
      escape: "\""
    )

## Multi-Byte Escape Support

Escape sequences can also be multi-byte:

    RustyCSV.define(MyParser,
      separator: ",",
      escape: "$$"
    )

    MyParser.parse_string("$$hello$$,world\\n", skip_headers: false)
    #=> [["hello", "world"]]

## Encoding Support

RustyCSV supports character encoding conversion via the `:encoding` option.
This is useful when exporting CSVs with non-ASCII characters (accents, CJK,
emoji) that need to open correctly in spreadsheet applications:

    alias RustyCSV.Spreadsheet

    # Export data with international characters for Excel/Google Sheets/Numbers
    rows = [["名前", "年齢"], ["田中", "27"], ["Müller", "35"]]
    csv = Spreadsheet.dump_to_iodata(rows) |> IO.iodata_to_binary()
    File.write!("export.csv", csv)

The pre-defined `RustyCSV.Spreadsheet` module outputs UTF-16 LE with BOM,
which spreadsheet applications auto-detect correctly. You can also define
custom encodings:

    RustyCSV.define(MySpreadsheet,
      separator: "\t",
      encoding: {:utf16, :little},
      trim_bom: true,
      dump_bom: true
    )

Supported encodings:
  * `:utf8` - UTF-8 (default, no conversion overhead)
  * `:latin1` - ISO-8859-1 / Latin-1
  * `{:utf16, :little}` - UTF-16 Little Endian
  * `{:utf16, :big}` - UTF-16 Big Endian
  * `{:utf32, :little}` - UTF-32 Little Endian
  * `{:utf32, :big}` - UTF-32 Big Endian

# `define_options`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L445)

```elixir
@type define_options() :: [
  separator: String.t() | [String.t()],
  escape: String.t(),
  newlines: [String.t()],
  line_separator: String.t(),
  trim_bom: boolean(),
  dump_bom: boolean(),
  reserved: [String.t()],
  escape_formula: map() | nil,
  encoding: encoding(),
  strategy: strategy(),
  moduledoc: String.t() | false | nil
]
```

Options for `define/2`.

## Parsing Options

  * `:separator` - Field separator character(s). Can be a single string (e.g., `","`)
    or a list of strings for multi-separator support (e.g., `[",", ";"]`).
    When parsing, any separator in the list is recognized as a field delimiter.
    When dumping, only the **first** separator is used for output.
    Defaults to `","`.
  * `:escape` - Escape/quote character. Defaults to `"""`.
  * `:newlines` - List of recognized line endings. Defaults to `["
", "
"]`.
  * `:trim_bom` - Remove BOM when parsing strings. Defaults to `false`.
  * `:encoding` - Character encoding. Defaults to `:utf8`. See `t:encoding/0`.

## Dumping Options

  * `:line_separator` - Line separator for output. Defaults to `"
"`.
  * `:dump_bom` - Include BOM in output. Defaults to `false`.
  * `:reserved` - Additional characters requiring escaping.
  * `:escape_formula` - Map for formula injection prevention. Defaults to `nil`.
    When set, fields starting with trigger characters are prefixed with a
    replacement string inside quotes. Handled natively in the Rust NIF.

## Other Options

  * `:strategy` - Default parsing strategy. Defaults to `:simd`.
  * `:moduledoc` - Documentation for the generated module.

# `dump_options`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L400)

```elixir
@type dump_options() :: [{:strategy, :parallel}]
```

Options for `dump_to_iodata/2`.

## Options

  * `:strategy` - Encoding strategy to use. Defaults to the single-threaded
    SIMD-accelerated encoder (no option needed). Pass `:parallel` for
    multi-threaded encoding via rayon, which is faster for quoting-heavy data.

# `encoding`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L413)

```elixir
@type encoding() ::
  :utf8 | :latin1 | {:utf16, :little | :big} | {:utf32, :little | :big}
```

Encoding for CSV data.

Supported encodings:
  * `:utf8` - UTF-8 (default, no conversion)
  * `:latin1` - ISO-8859-1 / Latin-1
  * `{:utf16, :little}` - UTF-16 Little Endian
  * `{:utf16, :big}` - UTF-16 Big Endian
  * `{:utf32, :little}` - UTF-32 Little Endian
  * `{:utf32, :big}` - UTF-32 Big Endian

# `parse_options`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L381)

```elixir
@type parse_options() :: [
  skip_headers: boolean(),
  strategy: strategy(),
  headers: boolean() | [atom() | String.t()],
  chunk_size: pos_integer(),
  batch_size: pos_integer(),
  max_buffer_size: pos_integer()
]
```

Options for parsing functions.

## Common Options

  * `:skip_headers` - When `true`, skips the first row. Defaults to `true`.
  * `:strategy` - The parsing strategy to use. One of:
    * `:simd` - SIMD structural boundary scan (default)
    * `:basic` - Alias for `:simd`
    * `:indexed` - Alias for `:simd`
    * `:parallel` - Multi-threaded via rayon
    * `:zero_copy` - Alias for `:simd`
  * `:headers` - Controls header handling. Defaults to `false`.
    * `false` - Return rows as lists (default behavior)
    * `true` - Use first row as string keys, return list of maps.
      `:skip_headers` is ignored (first row is always consumed as keys).
    * list of atoms or strings - Use as explicit keys, return list of maps.
      The first row is skipped by default (`:skip_headers` applies). Pass
      `skip_headers: false` if the file has no header row.

## Limits

  * **Input size** - Batch parsing (all strategies except streaming) is limited
    to inputs of at most 4 GiB (`u32::MAX` bytes) because the SIMD structural
    scanner uses 32-bit positions. Passing a larger binary returns
    `{:error, :input_too_large}`. Use the streaming parser for files exceeding
    this limit.

## Streaming Options

  * `:chunk_size` - Bytes per IO read for streaming. Defaults to `65536`.
  * `:batch_size` - Rows per batch for streaming. Defaults to `1000`.
  * `:max_buffer_size` - Maximum streaming buffer size in bytes. Defaults to
    `268_435_456` (256 MB). If the internal buffer exceeds this limit during
    `streaming_feed/2`, a `:buffer_overflow` exception is raised. Increase
    this if your data contains rows longer than 256 MB. Decrease it to fail
    faster on malformed input that lacks newlines.

# `row`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L289)

```elixir
@type row() :: [binary()]
```

A single row of CSV data, represented as a list of field binaries.

# `rows`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L294)

```elixir
@type rows() :: [row()]
```

Multiple rows of CSV data.

# `strategy`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L340)

```elixir
@type strategy() :: :simd | :basic | :indexed | :parallel | :zero_copy
```

Parsing strategy to use.

These strategies apply to `parse_string/2` and other parsing functions.
For encoding strategies, see `t:dump_options/0`.

## Available Strategies

  * `:simd` (default) — SIMD structural boundary scan, single-threaded.
  * `:basic` — alias for `:simd` (retained for backward compatibility).
  * `:indexed` — alias for `:simd` (retained for backward compatibility).
  * `:parallel` — same SIMD scan, multi-threaded field extraction via rayon.
    Best for very large files (500 MB+).
  * `:zero_copy` — alias for `:simd` (retained for backward compatibility).

> #### Strategy equivalence {: .info}
>
> `:simd`, `:basic`, `:indexed`, and `:zero_copy` all execute the same
> code path: a portable-SIMD structural scanner followed by a hybrid
> sub-binary term builder. They produce identical results with identical
> performance. The names are kept for API stability.

## Memory Model

All batch strategies use boundary-based parsing: the NIF scans the input
to find field boundaries, then returns sub-binary references for clean
fields (zero-copy) and only allocates new binaries for fields that
require unescaping. The input binary is kept alive while any sub-binary
references it.

| Strategy | Best When |
|----------|-----------|
| `:simd` | Default, fastest for most files |
| `:parallel` | Large files 500 MB+, complex quoting |

## Examples

    # Default strategy
    CSV.parse_string(data)

    # Parallel for large files
    CSV.parse_string(large_data, strategy: :parallel)

# `dump_to_iodata`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L532)

```elixir
@callback dump_to_iodata(Enumerable.t()) :: iodata()
```

Converts rows to iodata in CSV format.

Returns a single flat binary (not an iodata list). A binary is valid
`t:iodata/0`, so it works with `IO.binwrite/2`, `IO.iodata_to_binary/1`,
etc. See "Encoding (Dumping)" in the module doc for details on how this
differs from NimbleCSV.

## Options

  * `:strategy` - Encoding strategy. Defaults to the single-threaded
    SIMD-accelerated encoder. Pass `:parallel` for multi-threaded encoding
    via rayon, which is faster for quoting-heavy data.

# `dump_to_iodata`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L533)

```elixir
@callback dump_to_iodata(Enumerable.t(), dump_options()) :: iodata()
```

# `dump_to_stream`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L538)

```elixir
@callback dump_to_stream(Enumerable.t()) :: Enumerable.t()
```

Lazily converts rows to a stream of iodata in CSV format.

# `options`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L485)

```elixir
@callback options() :: keyword()
```

Returns the options used to define this CSV module.

# `parse_enumerable`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L510)

```elixir
@callback parse_enumerable(Enumerable.t()) :: rows()
```

Eagerly parses an enumerable of CSV data into a list of rows.

# `parse_enumerable`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L515)

```elixir
@callback parse_enumerable(Enumerable.t(), parse_options()) :: rows()
```

Eagerly parses an enumerable of CSV data into a list of rows with options.

# `parse_stream`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L500)

```elixir
@callback parse_stream(Enumerable.t()) :: Enumerable.t()
```

Lazily parses a stream of CSV data into a stream of rows.

# `parse_stream`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L505)

```elixir
@callback parse_stream(Enumerable.t(), parse_options()) :: Enumerable.t()
```

Lazily parses a stream of CSV data into a stream of rows with options.

# `parse_string`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L490)

```elixir
@callback parse_string(binary()) :: rows()
```

Parses a CSV string into a list of rows.

# `parse_string`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L495)

```elixir
@callback parse_string(binary(), parse_options()) :: rows()
```

Parses a CSV string into a list of rows with options.

# `to_line_stream`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L543)

```elixir
@callback to_line_stream(Enumerable.t()) :: Enumerable.t()
```

Converts a stream of arbitrary binary chunks into a line-oriented stream.

# `define`
[🔗](https://github.com/jeffhuen/rustycsv/blob/v0.3.10/lib/rusty_csv.ex#L662)

```elixir
@spec define(module(), define_options()) :: :ok
```

Defines a new CSV parser/dumper module.

## Options

### Parsing Options

  * `:separator` - The field separator(s). Can be a single string
    (e.g., `","`, `"::"`) or a list of strings for multi-separator support
    (e.g., `[",", ";"]`, `[",", "::"]`). Separators can be multi-byte.
    Defaults to `","`.

    When multiple separators are specified:
    - **Parsing**: Any separator in the list is recognized as a field delimiter
    - **Dumping**: Only the **first** separator is used for output

    This is useful for parsing files with inconsistent delimiters or mixed
    comma/semicolon separators (common in European locales).

  * `:escape` - The escape/quote sequence. Can be multi-byte (e.g., `"$$"`).
    Defaults to `"\""`.

  * `:newlines` - List of recognized line endings for parsing.
    Defaults to `["\r\n", "\n"]`. Both CRLF and LF are always recognized.

  * `:trim_bom` - When `true`, removes the BOM (byte order marker)
    from the beginning of strings before parsing. Defaults to `false`.

  * `:encoding` - Character encoding for input/output. Defaults to `:utf8`.
    Supported encodings:
    * `:utf8` - UTF-8 (default, no conversion overhead)
    * `:latin1` - ISO-8859-1 / Latin-1
    * `{:utf16, :little}` - UTF-16 Little Endian
    * `{:utf16, :big}` - UTF-16 Big Endian
    * `{:utf32, :little}` - UTF-32 Little Endian
    * `{:utf32, :big}` - UTF-32 Big Endian

    When encoding is not `:utf8`, input data is converted to UTF-8 for
    parsing, and output is converted back to the target encoding.

### Dumping Options

  * `:line_separator` - The line separator for dumped output.
    Defaults to `"\n"`.

  * `:dump_bom` - When `true`, includes the appropriate BOM at the start of
    dumped output. Defaults to `false`.

  * `:reserved` - Additional characters that should trigger field escaping
    when dumping. By default, fields containing the separator, escape
    character, or newlines are escaped.

  * `:escape_formula` - A map of characters to their escaped versions
    for preventing CSV formula injection. When set, fields starting with
    these characters will be prefixed with a tab. Defaults to `nil`.

    Example: `%{"=" => true, "+" => true, "-" => true, "@" => true}`

### Strategy Options

  * `:strategy` - The default parsing strategy. One of:
    * `:simd` - SIMD structural boundary scan (default)
    * `:basic` - Alias for `:simd`
    * `:indexed` - Alias for `:simd`
    * `:parallel` - Multi-threaded via rayon
    * `:zero_copy` - Alias for `:simd`

### Documentation

  * `:moduledoc` - The `@moduledoc` for the generated module.
    Set to `false` to disable documentation.

## Examples

    # Define a standard CSV parser
    RustyCSV.define(MyApp.CSV,
      separator: ",",
      escape: "\"",
      line_separator: "\n"
    )

    # Use it
    MyApp.CSV.parse_string("a,b\n1,2\n")
    #=> [["1", "2"]]

    # Define a UTF-16 spreadsheet parser
    RustyCSV.define(MyApp.Spreadsheet,
      separator: "\t",
      encoding: {:utf16, :little},
      trim_bom: true,
      dump_bom: true
    )

    # Define a multi-separator parser (comma or semicolon)
    RustyCSV.define(MyApp.FlexibleCSV,
      separator: [",", ";"],
      escape: "\""
    )

    # Parse files with mixed delimiters
    MyApp.FlexibleCSV.parse_string("a,b;c\n1;2,3\n", skip_headers: false)
    #=> [["a", "b", "c"], ["1", "2", "3"]]

    # Dumping uses the first separator (comma)
    MyApp.FlexibleCSV.dump_to_iodata([["x", "y"]]) |> IO.iodata_to_binary()
    #=> "x,y\n"

    # Get the configuration
    MyApp.CSV.options()
    #=> [separator: ",", escape: "\"", ...]

---

*Consult [api-reference.md](api-reference.md) for complete listing*
