RustyCSV behaviour (RustyCSV v0.3.9)

Copy Markdown View Source

RustyCSV is an ultra-fast CSV parsing and dumping library powered by purpose-built Rust NIFs.

It provides a drop-in replacement for NimbleCSV with the same API, while offering multiple parsing strategies optimized for different use cases.

Quick Start

Use the pre-defined RustyCSV.RFC4180 parser:

alias RustyCSV.RFC4180, as: CSV

CSV.parse_string("name,age\njohn,27\n")
#=> [["john", "27"]]

CSV.parse_string("name,age\njohn,27\n", skip_headers: false)
#=> [["name", "age"], ["john", "27"]]

Defining Custom Parsers

You can define custom CSV parsers with define/2:

RustyCSV.define(MyParser,
  separator: ",",
  escape: "\"",
  line_separator: "\n"
)

MyParser.parse_string("a,b\n1,2\n")
#=> [["1", "2"]]

Parsing Strategies

RustyCSV supports multiple parsing strategies via the :strategy option:

  • :simd - SIMD-accelerated scanning via memchr (default, fastest for most files)
  • :basic - Simple byte-by-byte parsing (good for debugging)
  • :indexed - Two-phase index-then-extract (good for re-extracting rows)
  • :parallel - Multi-threaded via rayon (best for very large files 500MB+ with complex quoting)
  • :zero_copy - Sub-binary references (NimbleCSV-like memory profile, max speed)

Example:

CSV.parse_string(large_csv, strategy: :parallel)

Scheduling

All parsing NIFs run on BEAM dirty CPU schedulers, so they never block normal schedulers. Parallel parsing (:parallel strategy) additionally runs on a dedicated rustycsv-* rayon thread pool to avoid contention with other Rayon users in the same VM.

Streaming

For large files, use parse_stream/2 which uses a bounded-memory streaming parser:

"huge.csv"
|> File.stream!()
|> CSV.parse_stream()
|> Stream.each(&process_row/1)
|> Stream.run()

Streaming parsers are safe to share across processes — the underlying Rust resource is protected by a mutex. However, concurrent access is serialized, so for maximum throughput use one parser per process.

Encoding (Dumping)

Convert rows back to CSV format:

CSV.dump_to_iodata([["name", "age"], ["john", "27"]])
#=> "name,age\njohn,27\n"

Encoding uses a SIMD-accelerated Rust NIF that writes all CSV bytes into a single flat binary. The NIF handles four modes: plain UTF-8, UTF-8 with formula escaping, non-UTF-8 encoding, and both combined.

Difference from NimbleCSV: NimbleCSV's dump_to_iodata/1 returns an iodata list (a nested list of small binaries) that callers typically flatten back into a single binary via IO.iodata_to_binary/1 before writing to a file, sending as a download, or passing to an API. RustyCSV skips that roundtrip — it returns the final binary directly, ready for use with IO.binwrite/2, Conn.send_resp/3, :gen_tcp.send/2, File.write/2, etc. The output bytes are identical; there is nothing to traverse or flatten.

Code that pattern-matches on the return value expecting a list will need adjustment. This is a deliberate trade-off: building an iodata list across the NIF boundary requires allocating one Erlang term per field, separator, and newline, which is 18–63% slower and uses 3–6x more NIF memory than returning the bytes directly.

Encoding Strategies

dump_to_iodata/2 accepts a :strategy option:

  • default (no option) — Single-threaded SIMD-accelerated encoder. Writes all CSV bytes into a single flat binary. Best for most workloads.

  • :parallel — Multi-threaded encoding via rayon. Copies all field data into Rust-owned memory, splits rows into chunks, and encodes each chunk on a separate thread. Returns a short list of large binaries. Best for quoting-heavy data (user-generated content with embedded commas/quotes/newlines).

Example:

# Default (recommended for most cases)
CSV.dump_to_iodata(rows)

# Parallel (opt in for quoting-heavy data)
CSV.dump_to_iodata(rows, strategy: :parallel)

High-Throughput Concurrent Exports

The encoding NIF runs on dirty CPU schedulers with per-thread mimalloc arenas, making it suitable for concurrent export workloads — e.g., thousands of users downloading CSV reports simultaneously:

# Phoenix controller — each request encodes independently
rows = MyApp.Reports.fetch_rows(user_id)
csv = MyCSV.dump_to_iodata(rows)
send_download(conn, {:binary, csv}, filename: "report.csv")

For very large exports, use chunked NIF encoding for bounded memory:

MyApp.Reports.stream_rows(user_id)
|> Stream.chunk_every(5_000)
|> Stream.map(&MyCSV.dump_to_iodata/1)
|> Enum.each(&Conn.chunk(conn, &1))

NimbleCSV Compatibility

RustyCSV is designed as a drop-in replacement for NimbleCSV. The API is identical:

  • parse_string/2 - Parse CSV string to list of rows
  • parse_stream/2 - Lazily parse a stream
  • parse_enumerable/2 - Parse any enumerable
  • dump_to_iodata/2 - Convert rows to iodata (returns a flat binary, not an iodata list — see "Encoding" section)
  • dump_to_stream/1 - Lazily convert rows to iodata stream
  • to_line_stream/1 - Convert arbitrary chunks to lines
  • options/0 - Return module configuration

RustyCSV extends NimbleCSV with additional options:

  • :strategy on parse_string/2 - Select the parsing approach (:simd, :basic, :indexed, :parallel, :zero_copy)
  • :strategy on dump_to_iodata/2 - Select the encoding approach (default or :parallel)
  • :headers - Return rows as maps instead of lists

Headers-to-Maps

Use the :headers option to get maps instead of lists:

CSV.parse_string("name,age\njohn,27\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}]

CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]

CSV.parse_string("name,age\njohn,27\n", headers: ["n", "a"])
#=> [%{"n" => "john", "a" => "27"}]

Streaming also supports headers:

"huge.csv"
|> File.stream!()
|> CSV.parse_stream(headers: true)
|> Stream.each(&process_map/1)
|> Stream.run()

How :headers interacts with :skip_headers

With headers: true, the first row is always consumed as keys — :skip_headers has no effect.

With headers: [keys], the :skip_headers option controls whether the first row is skipped (default: true). Most CSV files have a header row, so skipping it avoids mapping the header row itself into a map. If your file has no header row, pass skip_headers: false:

# File with header row (typical) — first row skipped by default
CSV.parse_string("name,age\njohn,27\n", headers: [:n, :a])
#=> [%{n: "john", a: "27"}]

# File without header row — include all rows
CSV.parse_string("john,27\njane,30\n", headers: [:n, :a], skip_headers: false)
#=> [%{n: "john", a: "27"}, %{n: "jane", a: "30"}]

Edge cases

  • Fewer columns than keys — missing values are nil
  • More columns than keys — extra columns are ignored
  • Duplicate headers — last column wins
  • Empty header field — key is ""

Multi-Separator Support

Like NimbleCSV, RustyCSV supports multiple separator characters. Separators can be single-byte or multi-byte:

RustyCSV.define(MyParser,
  separator: [",", ";"],
  escape: "\""
)

# Any separator in the list is recognized when parsing
MyParser.parse_string("a,b;c\\n1;2,3\\n", skip_headers: false)
#=> [["a", "b", "c"], ["1", "2", "3"]]

# Only the FIRST separator is used when dumping
MyParser.dump_to_iodata([["a", "b", "c"]]) |> IO.iodata_to_binary()
#=> "a,b,c\\n"

Multi-byte separators are supported:

RustyCSV.define(MyParser,
  separator: "::",
  escape: "\""
)

MyParser.parse_string("a::b::c\\n", skip_headers: false)
#=> [["a", "b", "c"]]

You can also mix single-byte and multi-byte separators:

RustyCSV.define(MyParser,
  separator: [",", "::"],
  escape: "\""
)

Multi-Byte Escape Support

Escape sequences can also be multi-byte:

RustyCSV.define(MyParser,
  separator: ",",
  escape: "$$"
)

MyParser.parse_string("$$hello$$,world\\n", skip_headers: false)
#=> [["hello", "world"]]

Encoding Support

RustyCSV supports character encoding conversion via the :encoding option. This is useful when exporting CSVs with non-ASCII characters (accents, CJK, emoji) that need to open correctly in spreadsheet applications:

alias RustyCSV.Spreadsheet

# Export data with international characters for Excel/Google Sheets/Numbers
rows = [["名前", "年齢"], ["田中", "27"], ["Müller", "35"]]
csv = Spreadsheet.dump_to_iodata(rows) |> IO.iodata_to_binary()
File.write!("export.csv", csv)

The pre-defined RustyCSV.Spreadsheet module outputs UTF-16 LE with BOM, which spreadsheet applications auto-detect correctly. You can also define custom encodings:

RustyCSV.define(MySpreadsheet,
  separator: "\t",
  encoding: {:utf16, :little},
  trim_bom: true,
  dump_bom: true
)

Supported encodings:

  • :utf8 - UTF-8 (default, no conversion overhead)
  • :latin1 - ISO-8859-1 / Latin-1
  • {:utf16, :little} - UTF-16 Little Endian
  • {:utf16, :big} - UTF-16 Big Endian
  • {:utf32, :little} - UTF-32 Little Endian
  • {:utf32, :big} - UTF-32 Big Endian

Summary

Types

Options for define/2.

Options for dump_to_iodata/2.

Encoding for CSV data.

Options for parsing functions.

A single row of CSV data, represented as a list of field binaries.

Multiple rows of CSV data.

Parsing strategy to use.

Callbacks

Converts rows to iodata in CSV format.

Lazily converts rows to a stream of iodata in CSV format.

Returns the options used to define this CSV module.

Eagerly parses an enumerable of CSV data into a list of rows.

Eagerly parses an enumerable of CSV data into a list of rows with options.

Lazily parses a stream of CSV data into a stream of rows.

Lazily parses a stream of CSV data into a stream of rows with options.

Parses a CSV string into a list of rows.

Parses a CSV string into a list of rows with options.

Converts a stream of arbitrary binary chunks into a line-oriented stream.

Functions

Defines a new CSV parser/dumper module.

Types

define_options()

@type define_options() :: [
  separator: String.t() | [String.t()],
  escape: String.t(),
  newlines: [String.t()],
  line_separator: String.t(),
  trim_bom: boolean(),
  dump_bom: boolean(),
  reserved: [String.t()],
  escape_formula: map() | nil,
  encoding: encoding(),
  strategy: strategy(),
  moduledoc: String.t() | false | nil
]

Options for define/2.

Parsing Options

  • :separator - Field separator character(s). Can be a single string (e.g., ",") or a list of strings for multi-separator support (e.g., [",", ";"]). When parsing, any separator in the list is recognized as a field delimiter. When dumping, only the first separator is used for output. Defaults to ",".
  • :escape - Escape/quote character. Defaults to """.
  • :newlines - List of recognized line endings. Defaults to [" ", " "].
  • :trim_bom - Remove BOM when parsing strings. Defaults to false.
  • :encoding - Character encoding. Defaults to :utf8. See encoding/0.

Dumping Options

  • :line_separator - Line separator for output. Defaults to " ".
  • :dump_bom - Include BOM in output. Defaults to false.
  • :reserved - Additional characters requiring escaping.
  • :escape_formula - Map for formula injection prevention. Defaults to nil. When set, fields starting with trigger characters are prefixed with a replacement string inside quotes. Handled natively in the Rust NIF.

Other Options

  • :strategy - Default parsing strategy. Defaults to :simd.
  • :moduledoc - Documentation for the generated module.

dump_options()

@type dump_options() :: [{:strategy, :parallel}]

Options for dump_to_iodata/2.

Options

  • :strategy - Encoding strategy to use. Defaults to the single-threaded SIMD-accelerated encoder (no option needed). Pass :parallel for multi-threaded encoding via rayon, which is faster for quoting-heavy data.

encoding()

@type encoding() ::
  :utf8 | :latin1 | {:utf16, :little | :big} | {:utf32, :little | :big}

Encoding for CSV data.

Supported encodings:

  • :utf8 - UTF-8 (default, no conversion)
  • :latin1 - ISO-8859-1 / Latin-1
  • {:utf16, :little} - UTF-16 Little Endian
  • {:utf16, :big} - UTF-16 Big Endian
  • {:utf32, :little} - UTF-32 Little Endian
  • {:utf32, :big} - UTF-32 Big Endian

parse_options()

@type parse_options() :: [
  skip_headers: boolean(),
  strategy: strategy(),
  headers: boolean() | [atom() | String.t()],
  chunk_size: pos_integer(),
  batch_size: pos_integer(),
  max_buffer_size: pos_integer()
]

Options for parsing functions.

Common Options

  • :skip_headers - When true, skips the first row. Defaults to true.
  • :strategy - The parsing strategy to use. One of:
    • :simd - SIMD-accelerated (default)
    • :basic - Simple byte-by-byte
    • :indexed - Two-phase index-then-extract
    • :parallel - Multi-threaded via rayon
    • :zero_copy - Sub-binary references (keeps parent binary alive)
  • :headers - Controls header handling. Defaults to false.
    • false - Return rows as lists (default behavior)
    • true - Use first row as string keys, return list of maps. :skip_headers is ignored (first row is always consumed as keys).
    • list of atoms or strings - Use as explicit keys, return list of maps. The first row is skipped by default (:skip_headers applies). Pass skip_headers: false if the file has no header row.

Streaming Options

  • :chunk_size - Bytes per IO read for streaming. Defaults to 65536.
  • :batch_size - Rows per batch for streaming. Defaults to 1000.
  • :max_buffer_size - Maximum streaming buffer size in bytes. Defaults to 268_435_456 (256 MB). If the internal buffer exceeds this limit during streaming_feed/2, a :buffer_overflow exception is raised. Increase this if your data contains rows longer than 256 MB. Decrease it to fail faster on malformed input that lacks newlines.

row()

@type row() :: [binary()]

A single row of CSV data, represented as a list of field binaries.

rows()

@type rows() :: [row()]

Multiple rows of CSV data.

strategy()

@type strategy() :: :simd | :basic | :indexed | :parallel | :zero_copy

Parsing strategy to use.

These strategies apply to parse_string/2 and other parsing functions. For encoding strategies, see dump_options/0.

Available Strategies

  • :simd - SIMD-accelerated scanning via memchr (default, fastest for most files)
  • :basic - Simple byte-by-byte parsing (useful for debugging)
  • :indexed - Two-phase index-then-extract (good for re-extracting rows)
  • :parallel - Multi-threaded via rayon (best for very large files 500MB+ with complex quoting)
  • :zero_copy - Sub-binary references (maximum speed, keeps parent binary alive)

Memory Model Comparison

All strategies use boundary-based parsing: the NIF scans the input to find field boundaries, then returns sub-binary references for clean fields (zero copy) and only allocates new binaries for fields that require unescaping. The input binary is kept alive while any sub-binary references it.

StrategyBest When
:simdDefault, fastest for most files
:basicDebugging, baseline
:indexedRow range extraction
:parallelLarge files 500MB+, complex quoting
:zero_copySpeed-critical, short-lived results

Examples

# Default SIMD strategy
CSV.parse_string(data)

# Parallel for large files
CSV.parse_string(large_data, strategy: :parallel)

# Zero-copy for maximum speed
CSV.parse_string(data, strategy: :zero_copy)

Callbacks

dump_to_iodata(t)

@callback dump_to_iodata(Enumerable.t()) :: iodata()

Converts rows to iodata in CSV format.

Returns a single flat binary (not an iodata list). A binary is valid iodata/0, so it works with IO.binwrite/2, IO.iodata_to_binary/1, etc. See "Encoding (Dumping)" in the module doc for details on how this differs from NimbleCSV.

Options

  • :strategy - Encoding strategy. Defaults to the single-threaded SIMD-accelerated encoder. Pass :parallel for multi-threaded encoding via rayon, which is faster for quoting-heavy data.

dump_to_iodata(t, dump_options)

@callback dump_to_iodata(Enumerable.t(), dump_options()) :: iodata()

dump_to_stream(t)

@callback dump_to_stream(Enumerable.t()) :: Enumerable.t()

Lazily converts rows to a stream of iodata in CSV format.

options()

@callback options() :: keyword()

Returns the options used to define this CSV module.

parse_enumerable(t)

@callback parse_enumerable(Enumerable.t()) :: rows()

Eagerly parses an enumerable of CSV data into a list of rows.

parse_enumerable(t, parse_options)

@callback parse_enumerable(Enumerable.t(), parse_options()) :: rows()

Eagerly parses an enumerable of CSV data into a list of rows with options.

parse_stream(t)

@callback parse_stream(Enumerable.t()) :: Enumerable.t()

Lazily parses a stream of CSV data into a stream of rows.

parse_stream(t, parse_options)

@callback parse_stream(Enumerable.t(), parse_options()) :: Enumerable.t()

Lazily parses a stream of CSV data into a stream of rows with options.

parse_string(binary)

@callback parse_string(binary()) :: rows()

Parses a CSV string into a list of rows.

parse_string(binary, parse_options)

@callback parse_string(binary(), parse_options()) :: rows()

Parses a CSV string into a list of rows with options.

to_line_stream(t)

@callback to_line_stream(Enumerable.t()) :: Enumerable.t()

Converts a stream of arbitrary binary chunks into a line-oriented stream.

Functions

define(module, options \\ [])

@spec define(module(), define_options()) :: :ok

Defines a new CSV parser/dumper module.

Options

Parsing Options

  • :separator - The field separator(s). Can be a single string (e.g., ",", "::") or a list of strings for multi-separator support (e.g., [",", ";"], [",", "::"]). Separators can be multi-byte. Defaults to ",".

    When multiple separators are specified:

    • Parsing: Any separator in the list is recognized as a field delimiter
    • Dumping: Only the first separator is used for output

    This is useful for parsing files with inconsistent delimiters or mixed comma/semicolon separators (common in European locales).

  • :escape - The escape/quote sequence. Can be multi-byte (e.g., "$$"). Defaults to "\"".

  • :newlines - List of recognized line endings for parsing. Defaults to ["\r\n", "\n"]. Both CRLF and LF are always recognized.

  • :trim_bom - When true, removes the BOM (byte order marker) from the beginning of strings before parsing. Defaults to false.

  • :encoding - Character encoding for input/output. Defaults to :utf8. Supported encodings:

    • :utf8 - UTF-8 (default, no conversion overhead)
    • :latin1 - ISO-8859-1 / Latin-1
    • {:utf16, :little} - UTF-16 Little Endian
    • {:utf16, :big} - UTF-16 Big Endian
    • {:utf32, :little} - UTF-32 Little Endian
    • {:utf32, :big} - UTF-32 Big Endian

    When encoding is not :utf8, input data is converted to UTF-8 for parsing, and output is converted back to the target encoding.

Dumping Options

  • :line_separator - The line separator for dumped output. Defaults to "\n".

  • :dump_bom - When true, includes the appropriate BOM at the start of dumped output. Defaults to false.

  • :reserved - Additional characters that should trigger field escaping when dumping. By default, fields containing the separator, escape character, or newlines are escaped.

  • :escape_formula - A map of characters to their escaped versions for preventing CSV formula injection. When set, fields starting with these characters will be prefixed with a tab. Defaults to nil.

    Example: %{"=" => true, "+" => true, "-" => true, "@" => true}

Strategy Options

  • :strategy - The default parsing strategy. One of:
    • :simd - SIMD-accelerated via memchr (default, fastest)
    • :basic - Simple byte-by-byte parsing
    • :indexed - Two-phase index-then-extract
    • :parallel - Multi-threaded via rayon
    • :zero_copy - Sub-binary references (NimbleCSV-like memory, max speed)

Documentation

  • :moduledoc - The @moduledoc for the generated module. Set to false to disable documentation.

Examples

# Define a standard CSV parser
RustyCSV.define(MyApp.CSV,
  separator: ",",
  escape: "\"",
  line_separator: "\n"
)

# Use it
MyApp.CSV.parse_string("a,b\n1,2\n")
#=> [["1", "2"]]

# Define a UTF-16 spreadsheet parser
RustyCSV.define(MyApp.Spreadsheet,
  separator: "\t",
  encoding: {:utf16, :little},
  trim_bom: true,
  dump_bom: true
)

# Define a multi-separator parser (comma or semicolon)
RustyCSV.define(MyApp.FlexibleCSV,
  separator: [",", ";"],
  escape: "\""
)

# Parse files with mixed delimiters
MyApp.FlexibleCSV.parse_string("a,b;c\n1;2,3\n", skip_headers: false)
#=> [["a", "b", "c"], ["1", "2", "3"]]

# Dumping uses the first separator (comma)
MyApp.FlexibleCSV.dump_to_iodata([["x", "y"]]) |> IO.iodata_to_binary()
#=> "x,y\n"

# Get the configuration
MyApp.CSV.options()
#=> [separator: ",", escape: "\"", ...]