RustyCSV behaviour (RustyCSV v0.3.9)

RustyCSV is an ultra-fast CSV parsing and dumping library powered by purpose-built Rust NIFs.

It provides a drop-in replacement for NimbleCSV with the same API, while offering multiple parsing strategies optimized for different use cases.

Quick Start

Use the pre-defined RustyCSV.RFC4180 parser:

alias RustyCSV.RFC4180, as: CSV

CSV.parse_string("name,age\njohn,27\n")
#=> [["john", "27"]]

CSV.parse_string("name,age\njohn,27\n", skip_headers: false)
#=> [["name", "age"], ["john", "27"]]

Defining Custom Parsers

You can define custom CSV parsers with define/2:

RustyCSV.define(MyParser,
  separator: ",",
  escape: "\"",
  line_separator: "\n"
)

MyParser.parse_string("a,b\n1,2\n")
#=> [["1", "2"]]

Parsing Strategies

RustyCSV supports multiple parsing strategies via the :strategy option:

:simd - SIMD-accelerated scanning via memchr (default, fastest for most files)
:basic - Simple byte-by-byte parsing (good for debugging)
:indexed - Two-phase index-then-extract (good for re-extracting rows)
:parallel - Multi-threaded via rayon (best for very large files 500MB+ with complex quoting)
:zero_copy - Sub-binary references (NimbleCSV-like memory profile, max speed)

Example:

CSV.parse_string(large_csv, strategy: :parallel)

Scheduling

All parsing NIFs run on BEAM dirty CPU schedulers, so they never block normal schedulers. Parallel parsing (:parallel strategy) additionally runs on a dedicated rustycsv-* rayon thread pool to avoid contention with other Rayon users in the same VM.

Streaming

For large files, use parse_stream/2 which uses a bounded-memory streaming parser:

"huge.csv"
|> File.stream!()
|> CSV.parse_stream()
|> Stream.each(&process_row/1)
|> Stream.run()

Streaming parsers are safe to share across processes — the underlying Rust resource is protected by a mutex. However, concurrent access is serialized, so for maximum throughput use one parser per process.

Encoding (Dumping)

Convert rows back to CSV format:

CSV.dump_to_iodata([["name", "age"], ["john", "27"]])
#=> "name,age\njohn,27\n"

Encoding uses a SIMD-accelerated Rust NIF that writes all CSV bytes into a single flat binary. The NIF handles four modes: plain UTF-8, UTF-8 with formula escaping, non-UTF-8 encoding, and both combined.

Difference from NimbleCSV: NimbleCSV's dump_to_iodata/1 returns an iodata list (a nested list of small binaries) that callers typically flatten back into a single binary via IO.iodata_to_binary/1 before writing to a file, sending as a download, or passing to an API. RustyCSV skips that roundtrip — it returns the final binary directly, ready for use with IO.binwrite/2, Conn.send_resp/3, :gen_tcp.send/2, File.write/2, etc. The output bytes are identical; there is nothing to traverse or flatten.
Code that pattern-matches on the return value expecting a list will need adjustment. This is a deliberate trade-off: building an iodata list across the NIF boundary requires allocating one Erlang term per field, separator, and newline, which is 18–63% slower and uses 3–6x more NIF memory than returning the bytes directly.

Encoding Strategies

dump_to_iodata/2 accepts a :strategy option:

default (no option) — Single-threaded SIMD-accelerated encoder. Writes all CSV bytes into a single flat binary. Best for most workloads.
:parallel — Multi-threaded encoding via rayon. Copies all field data into Rust-owned memory, splits rows into chunks, and encodes each chunk on a separate thread. Returns a short list of large binaries. Best for quoting-heavy data (user-generated content with embedded commas/quotes/newlines).

Example:

# Default (recommended for most cases)
CSV.dump_to_iodata(rows)

# Parallel (opt in for quoting-heavy data)
CSV.dump_to_iodata(rows, strategy: :parallel)

High-Throughput Concurrent Exports

The encoding NIF runs on dirty CPU schedulers with per-thread mimalloc arenas, making it suitable for concurrent export workloads — e.g., thousands of users downloading CSV reports simultaneously:

# Phoenix controller — each request encodes independently
rows = MyApp.Reports.fetch_rows(user_id)
csv = MyCSV.dump_to_iodata(rows)
send_download(conn, {:binary, csv}, filename: "report.csv")

For very large exports, use chunked NIF encoding for bounded memory:

MyApp.Reports.stream_rows(user_id)
|> Stream.chunk_every(5_000)
|> Stream.map(&MyCSV.dump_to_iodata/1)
|> Enum.each(&Conn.chunk(conn, &1))

NimbleCSV Compatibility

RustyCSV is designed as a drop-in replacement for NimbleCSV. The API is identical:

parse_string/2 - Parse CSV string to list of rows
parse_stream/2 - Lazily parse a stream
parse_enumerable/2 - Parse any enumerable
dump_to_iodata/2 - Convert rows to iodata (returns a flat binary, not an iodata list — see "Encoding" section)
dump_to_stream/1 - Lazily convert rows to iodata stream
to_line_stream/1 - Convert arbitrary chunks to lines
options/0 - Return module configuration

RustyCSV extends NimbleCSV with additional options:

:strategy on parse_string/2 - Select the parsing approach (:simd, :basic, :indexed, :parallel, :zero_copy)
:strategy on dump_to_iodata/2 - Select the encoding approach (default or :parallel)
:headers - Return rows as maps instead of lists

Headers-to-Maps

Use the :headers option to get maps instead of lists:

CSV.parse_string("name,age\njohn,27\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}]

CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]

CSV.parse_string("name,age\njohn,27\n", headers: ["n", "a"])
#=> [%{"n" => "john", "a" => "27"}]

Streaming also supports headers:

"huge.csv"
|> File.stream!()
|> CSV.parse_stream(headers: true)
|> Stream.each(&process_map/1)
|> Stream.run()

How `:headers` interacts with `:skip_headers`

With headers: true, the first row is always consumed as keys — :skip_headers has no effect.

With headers: [keys], the :skip_headers option controls whether the first row is skipped (default: true). Most CSV files have a header row, so skipping it avoids mapping the header row itself into a map. If your file has no header row, pass skip_headers: false:

# File with header row (typical) — first row skipped by default
CSV.parse_string("name,age\njohn,27\n", headers: [:n, :a])
#=> [%{n: "john", a: "27"}]

# File without header row — include all rows
CSV.parse_string("john,27\njane,30\n", headers: [:n, :a], skip_headers: false)
#=> [%{n: "john", a: "27"}, %{n: "jane", a: "30"}]

Edge cases

Fewer columns than keys — missing values are nil
More columns than keys — extra columns are ignored
Duplicate headers — last column wins
Empty header field — key is ""

Multi-Separator Support

Like NimbleCSV, RustyCSV supports multiple separator characters. Separators can be single-byte or multi-byte:

RustyCSV.define(MyParser,
  separator: [",", ";"],
  escape: "\""
)

# Any separator in the list is recognized when parsing
MyParser.parse_string("a,b;c\\n1;2,3\\n", skip_headers: false)
#=> [["a", "b", "c"], ["1", "2", "3"]]

# Only the FIRST separator is used when dumping
MyParser.dump_to_iodata([["a", "b", "c"]]) |> IO.iodata_to_binary()
#=> "a,b,c\\n"

Multi-byte separators are supported:

RustyCSV.define(MyParser,
  separator: "::",
  escape: "\""
)

MyParser.parse_string("a::b::c\\n", skip_headers: false)
#=> [["a", "b", "c"]]

You can also mix single-byte and multi-byte separators:

RustyCSV.define(MyParser,
  separator: [",", "::"],
  escape: "\""
)

Multi-Byte Escape Support

Escape sequences can also be multi-byte:

RustyCSV.define(MyParser,
  separator: ",",
  escape: "$$"
)

MyParser.parse_string("$$hello$$,world\\n", skip_headers: false)
#=> [["hello", "world"]]

Encoding Support

RustyCSV supports character encoding conversion via the :encoding option. This is useful when exporting CSVs with non-ASCII characters (accents, CJK, emoji) that need to open correctly in spreadsheet applications:

alias RustyCSV.Spreadsheet

# Export data with international characters for Excel/Google Sheets/Numbers
rows = [["名前", "年齢"], ["田中", "27"], ["Müller", "35"]]
csv = Spreadsheet.dump_to_iodata(rows) |> IO.iodata_to_binary()
File.write!("export.csv", csv)

The pre-defined RustyCSV.Spreadsheet module outputs UTF-16 LE with BOM, which spreadsheet applications auto-detect correctly. You can also define custom encodings:

RustyCSV.define(MySpreadsheet,
  separator: "\t",
  encoding: {:utf16, :little},
  trim_bom: true,
  dump_bom: true
)

Supported encodings:

:utf8 - UTF-8 (default, no conversion overhead)
:latin1 - ISO-8859-1 / Latin-1
{:utf16, :little} - UTF-16 Little Endian
{:utf16, :big} - UTF-16 Big Endian
{:utf32, :little} - UTF-32 Little Endian
{:utf32, :big} - UTF-32 Big Endian

Summary

Types

define_options()

Options for define/2.

dump_options()

Options for dump_to_iodata/2.

encoding()

Encoding for CSV data.

parse_options()

Options for parsing functions.

row()

A single row of CSV data, represented as a list of field binaries.

rows()

Multiple rows of CSV data.

strategy()

Parsing strategy to use.

Callbacks

dump_to_iodata(t)

Converts rows to iodata in CSV format.

dump_to_iodata(t, dump_options)

dump_to_stream(t)

Lazily converts rows to a stream of iodata in CSV format.

options()

Returns the options used to define this CSV module.

parse_enumerable(t)

Eagerly parses an enumerable of CSV data into a list of rows.

parse_enumerable(t, parse_options)

Eagerly parses an enumerable of CSV data into a list of rows with options.

parse_stream(t)

Lazily parses a stream of CSV data into a stream of rows.

parse_stream(t, parse_options)

Lazily parses a stream of CSV data into a stream of rows with options.

parse_string(binary)

Parses a CSV string into a list of rows.

parse_string(binary, parse_options)

Parses a CSV string into a list of rows with options.

to_line_stream(t)

Converts a stream of arbitrary binary chunks into a line-oriented stream.

Functions

define(module, options \\ [])

Defines a new CSV parser/dumper module.

Types

define_options()

@type define_options() :: [
  separator: String.t() | [String.t()],
  escape: String.t(),
  newlines: [String.t()],
  line_separator: String.t(),
  trim_bom: boolean(),
  dump_bom: boolean(),
  reserved: [String.t()],
  escape_formula: map() | nil,
  encoding: encoding(),
  strategy: strategy(),
  moduledoc: String.t() | false | nil
]

Options for define/2.

Parsing Options

:separator - Field separator character(s). Can be a single string (e.g., ",") or a list of strings for multi-separator support (e.g., [",", ";"]). When parsing, any separator in the list is recognized as a field delimiter. When dumping, only the first separator is used for output. Defaults to ",".
:escape - Escape/quote character. Defaults to """.
:newlines - List of recognized line endings. Defaults to [" ", " "].
:trim_bom - Remove BOM when parsing strings. Defaults to false.
:encoding - Character encoding. Defaults to :utf8. See encoding/0.

Dumping Options

:line_separator - Line separator for output. Defaults to " ".
:dump_bom - Include BOM in output. Defaults to false.
:reserved - Additional characters requiring escaping.
:escape_formula - Map for formula injection prevention. Defaults to nil. When set, fields starting with trigger characters are prefixed with a replacement string inside quotes. Handled natively in the Rust NIF.

Other Options

:strategy - Default parsing strategy. Defaults to :simd.
:moduledoc - Documentation for the generated module.

dump_options()

@type dump_options() :: [{:strategy, :parallel}]

Options for dump_to_iodata/2.

Options

:strategy - Encoding strategy to use. Defaults to the single-threaded SIMD-accelerated encoder (no option needed). Pass :parallel for multi-threaded encoding via rayon, which is faster for quoting-heavy data.

encoding()

@type encoding() ::
  :utf8 | :latin1 | {:utf16, :little | :big} | {:utf32, :little | :big}

Encoding for CSV data.

Supported encodings:

:utf8 - UTF-8 (default, no conversion)
:latin1 - ISO-8859-1 / Latin-1
{:utf16, :little} - UTF-16 Little Endian
{:utf16, :big} - UTF-16 Big Endian
{:utf32, :little} - UTF-32 Little Endian
{:utf32, :big} - UTF-32 Big Endian

parse_options()

@type parse_options() :: [
  skip_headers: boolean(),
  strategy: strategy(),
  headers: boolean() | [atom() | String.t()],
  chunk_size: pos_integer(),
  batch_size: pos_integer(),
  max_buffer_size: pos_integer()
]

Options for parsing functions.

Common Options

:skip_headers - When true, skips the first row. Defaults to true.
:strategy - The parsing strategy to use. One of:
- :simd - SIMD-accelerated (default)
- :basic - Simple byte-by-byte
- :indexed - Two-phase index-then-extract
- :parallel - Multi-threaded via rayon
- :zero_copy - Sub-binary references (keeps parent binary alive)
:headers - Controls header handling. Defaults to false.
- false - Return rows as lists (default behavior)
- true - Use first row as string keys, return list of maps. :skip_headers is ignored (first row is always consumed as keys).
- list of atoms or strings - Use as explicit keys, return list of maps. The first row is skipped by default (:skip_headers applies). Pass skip_headers: false if the file has no header row.

Streaming Options

:chunk_size - Bytes per IO read for streaming. Defaults to 65536.
:batch_size - Rows per batch for streaming. Defaults to 1000.
:max_buffer_size - Maximum streaming buffer size in bytes. Defaults to 268_435_456 (256 MB). If the internal buffer exceeds this limit during streaming_feed/2, a :buffer_overflow exception is raised. Increase this if your data contains rows longer than 256 MB. Decrease it to fail faster on malformed input that lacks newlines.

row()

@type row() :: [binary()]

A single row of CSV data, represented as a list of field binaries.

rows()

@type rows() :: [row()]

Multiple rows of CSV data.

strategy()

@type strategy() :: :simd | :basic | :indexed | :parallel | :zero_copy

Parsing strategy to use.

These strategies apply to parse_string/2 and other parsing functions. For encoding strategies, see dump_options/0.

Available Strategies

:simd - SIMD-accelerated scanning via memchr (default, fastest for most files)
:basic - Simple byte-by-byte parsing (useful for debugging)
:indexed - Two-phase index-then-extract (good for re-extracting rows)
:parallel - Multi-threaded via rayon (best for very large files 500MB+ with complex quoting)
:zero_copy - Sub-binary references (maximum speed, keeps parent binary alive)

Memory Model Comparison

All strategies use boundary-based parsing: the NIF scans the input to find field boundaries, then returns sub-binary references for clean fields (zero copy) and only allocates new binaries for fields that require unescaping. The input binary is kept alive while any sub-binary references it.

Strategy	Best When
`:simd`	Default, fastest for most files
`:basic`	Debugging, baseline
`:indexed`	Row range extraction
`:parallel`	Large files 500MB+, complex quoting
`:zero_copy`	Speed-critical, short-lived results

Examples

# Default SIMD strategy
CSV.parse_string(data)

# Parallel for large files
CSV.parse_string(large_data, strategy: :parallel)

# Zero-copy for maximum speed
CSV.parse_string(data, strategy: :zero_copy)

Callbacks

dump_to_iodata(t)

@callback dump_to_iodata(Enumerable.t()) :: iodata()

Converts rows to iodata in CSV format.

Returns a single flat binary (not an iodata list). A binary is valid iodata/0, so it works with IO.binwrite/2, IO.iodata_to_binary/1, etc. See "Encoding (Dumping)" in the module doc for details on how this differs from NimbleCSV.

Options

:strategy - Encoding strategy. Defaults to the single-threaded SIMD-accelerated encoder. Pass :parallel for multi-threaded encoding via rayon, which is faster for quoting-heavy data.

dump_to_iodata(t, dump_options)

@callback dump_to_iodata(Enumerable.t(), dump_options()) :: iodata()

dump_to_stream(t)

@callback dump_to_stream(Enumerable.t()) :: Enumerable.t()

Lazily converts rows to a stream of iodata in CSV format.

options()

@callback options() :: keyword()

Returns the options used to define this CSV module.

parse_enumerable(t)

@callback parse_enumerable(Enumerable.t()) :: rows()

Eagerly parses an enumerable of CSV data into a list of rows.

parse_enumerable(t, parse_options)

@callback parse_enumerable(Enumerable.t(), parse_options()) :: rows()

Eagerly parses an enumerable of CSV data into a list of rows with options.

parse_stream(t)

@callback parse_stream(Enumerable.t()) :: Enumerable.t()

Lazily parses a stream of CSV data into a stream of rows.

parse_stream(t, parse_options)

@callback parse_stream(Enumerable.t(), parse_options()) :: Enumerable.t()

Lazily parses a stream of CSV data into a stream of rows with options.

parse_string(binary)

@callback parse_string(binary()) :: rows()

Parses a CSV string into a list of rows.

parse_string(binary, parse_options)

@callback parse_string(binary(), parse_options()) :: rows()

Parses a CSV string into a list of rows with options.

to_line_stream(t)

@callback to_line_stream(Enumerable.t()) :: Enumerable.t()

Converts a stream of arbitrary binary chunks into a line-oriented stream.

Functions

define(module, options \\ [])

@spec define(module(), define_options()) :: :ok

Defines a new CSV parser/dumper module.

Options

Parsing Options

:separator - The field separator(s). Can be a single string (e.g., ",", "::") or a list of strings for multi-separator support (e.g., [",", ";"], [",", "::"]). Separators can be multi-byte. Defaults to ",".
When multiple separators are specified:
- Parsing: Any separator in the list is recognized as a field delimiter
- Dumping: Only the first separator is used for output
This is useful for parsing files with inconsistent delimiters or mixed comma/semicolon separators (common in European locales).
:escape - The escape/quote sequence. Can be multi-byte (e.g., "$$"). Defaults to "\"".
:newlines - List of recognized line endings for parsing. Defaults to ["\r\n", "\n"]. Both CRLF and LF are always recognized.
:trim_bom - When true, removes the BOM (byte order marker) from the beginning of strings before parsing. Defaults to false.
:encoding - Character encoding for input/output. Defaults to :utf8. Supported encodings:
- :utf8 - UTF-8 (default, no conversion overhead)
- :latin1 - ISO-8859-1 / Latin-1
- {:utf16, :little} - UTF-16 Little Endian
- {:utf16, :big} - UTF-16 Big Endian
- {:utf32, :little} - UTF-32 Little Endian
- {:utf32, :big} - UTF-32 Big Endian
When encoding is not :utf8, input data is converted to UTF-8 for parsing, and output is converted back to the target encoding.

Dumping Options

:line_separator - The line separator for dumped output. Defaults to "\n".
:dump_bom - When true, includes the appropriate BOM at the start of dumped output. Defaults to false.
:reserved - Additional characters that should trigger field escaping when dumping. By default, fields containing the separator, escape character, or newlines are escaped.
:escape_formula - A map of characters to their escaped versions for preventing CSV formula injection. When set, fields starting with these characters will be prefixed with a tab. Defaults to nil.
Example: %{"=" => true, "+" => true, "-" => true, "@" => true}

Strategy Options

:strategy - The default parsing strategy. One of:
- :simd - SIMD-accelerated via memchr (default, fastest)
- :basic - Simple byte-by-byte parsing
- :indexed - Two-phase index-then-extract
- :parallel - Multi-threaded via rayon
- :zero_copy - Sub-binary references (NimbleCSV-like memory, max speed)

Documentation

:moduledoc - The @moduledoc for the generated module. Set to false to disable documentation.

Examples

# Define a standard CSV parser
RustyCSV.define(MyApp.CSV,
  separator: ",",
  escape: "\"",
  line_separator: "\n"
)

# Use it
MyApp.CSV.parse_string("a,b\n1,2\n")
#=> [["1", "2"]]

# Define a UTF-16 spreadsheet parser
RustyCSV.define(MyApp.Spreadsheet,
  separator: "\t",
  encoding: {:utf16, :little},
  trim_bom: true,
  dump_bom: true
)

# Define a multi-separator parser (comma or semicolon)
RustyCSV.define(MyApp.FlexibleCSV,
  separator: [",", ";"],
  escape: "\""
)

# Parse files with mixed delimiters
MyApp.FlexibleCSV.parse_string("a,b;c\n1;2,3\n", skip_headers: false)
#=> [["a", "b", "c"], ["1", "2", "3"]]

# Dumping uses the first separator (comma)
MyApp.FlexibleCSV.dump_to_iodata([["x", "y"]]) |> IO.iodata_to_binary()
#=> "x,y\n"

# Get the configuration
MyApp.CSV.options()
#=> [separator: ",", escape: "\"", ...]

RustyCSV behaviour (RustyCSV v0.3.9)

Quick Start

Defining Custom Parsers

Parsing Strategies

Scheduling

Streaming

Encoding (Dumping)

Encoding Strategies

High-Throughput Concurrent Exports

NimbleCSV Compatibility

Headers-to-Maps

How :headers interacts with :skip_headers

Edge cases

Multi-Separator Support

Multi-Byte Escape Support

Encoding Support

Summary

Types

Callbacks

Functions

Types

define_options()

Parsing Options

Dumping Options

Other Options

dump_options()

Options

encoding()

parse_options()

Common Options

Streaming Options

row()

rows()

strategy()

Available Strategies

Memory Model Comparison

Examples

Callbacks

dump_to_iodata(t)

Options

dump_to_iodata(t, dump_options)

dump_to_stream(t)

options()

parse_enumerable(t)

parse_enumerable(t, parse_options)

parse_stream(t)

parse_stream(t, parse_options)

parse_string(binary)

parse_string(binary, parse_options)

to_line_stream(t)

Functions

define(module, options \\ [])

Options

Parsing Options

Dumping Options

Strategy Options

Documentation

Examples

How `:headers` interacts with `:skip_headers`