Ultra-fast CSV parsing and encoding for Elixir. A purpose-built Rust NIF with SIMD acceleration, parallel parsing, and bounded-memory streaming. Drop-in replacement for NimbleCSV.

Hex.pm Tests RFC 4180

Why RustyCSV?

The Problem: CSV parsing in Elixir can be optimized further:

  1. Speed: Pure Elixir parsing, while well-optimized, can't match native code with SIMD acceleration for large files.

  2. Flexibility: Different workloads benefit from different strategies—parallel processing for huge files, streaming for unbounded data.

  3. Binary chunk streaming: RustyCSV can process arbitrary binary chunks (useful for network streams, compressed data, etc.).

Why not wrap an existing Rust CSV library? The excellent csv crate is designed for Rust workflows, not BEAM integration. Wrapping it would require serializing data between Rust and Erlang formats—adding overhead and losing the benefits of direct term construction.

RustyCSV's approach: The Rust NIF is purpose-built for BEAM integration—no wrapped CSV libraries, no unnecessary abstractions, and resource-efficient at runtime with modular features you opt into—focusing on:

  1. Bounded memory streaming - Process multi-GB files with ~64KB memory footprint
  2. Sub-binary field references - Near-zero BEAM allocation; fields reference the input binary directly
  3. Multiple strategies - Choose SIMD, parallel, or streaming based on your workload
  4. Reduced scheduler load - Parallel strategy runs on dirty CPU schedulers
  5. Full NimbleCSV compatibility - Same API, drop-in replacement

Feature Comparison

FeatureRustyCSVPure Elixir (NimbleCSV)
Parsing strategies3 (SIMD, parallel, streaming)1
SIMD acceleration✅ via std::simd portable SIMD
Parallel parsing✅ via rayon
Binary chunk streaming✅ arbitrary chunks❌ line-delimited only
Multi-separator support[",", ";"], "::"
Encoding support✅ UTF-8, UTF-16, Latin-1, UTF-32
Memory modelSub-binary referencesSub-binary references
NIF encoding✅ Returns flat binary (same bytes, ready to use — no flattening needed)Returns iodata list (typically flattened by caller)
High-performance allocator✅ mimallocSystem
Drop-in replacement✅ Same API-
Headers-to-mapsheaders: true or explicit keys
RFC 4180 compliant✅ 464 tests
Benchmark (7MB CSV)~20ms~215ms

Purpose-Built for Elixir

RustyCSV isn't a wrapper around an existing Rust CSV library. It's custom-built from the ground up for optimal Elixir/BEAM integration:

  • Boundary-based sub-binary fields - SIMD scanner finds field boundaries, then creates BEAM sub-binary references directly (zero-copy for clean fields, copy only when unescaping """)
  • Dirty scheduler aware - long-running parallel parses run on dirty CPU schedulers, never blocking your BEAM schedulers
  • ResourceArc-based streaming - stateful parser properly integrated with BEAM's garbage collector
  • Direct term building - parsing results go straight to BEAM terms; encoding writes directly to a flat binary

Parsing Strategies

Choose the right tool for the job:

StrategyUse CaseHow It Works
:simdDefault. Fastest for most filesSingle-pass SIMD structural scanner via std::simd
:parallelFiles 500MB+ with complex quotingMulti-threaded row parsing via rayon
:streamingUnbounded/huge filesBounded-memory chunk processing

Memory Model:

All batch strategies use boundary-based sub-binaries — the SIMD scanner finds field boundaries, then creates BEAM sub-binary references that point into the original input binary. Only fields requiring quote unescaping (""") are copied.

StrategyMemory ModelInput BinaryBest When
:simdSub-binaryKept alive until fields GC'dDefault — fast, low memory
:parallelSub-binaryKept alive until fields GC'dLarge files, many cores
:streamingCopy (chunked)Freed per chunkUnbounded files
# Automatic strategy selection
CSV.parse_string(data)                           # Uses :simd (default)
CSV.parse_string(huge_data, strategy: :parallel) # 500MB+ files with complex quoting
File.stream!("huge.csv") |> CSV.parse_stream()   # Bounded memory

Installation

def deps do
  [{:rusty_csv, "~> 0.3.9"}]
end

Requires Rust nightly (for std::simd portable SIMD — see note on stabilization). Automatically compiled via Rustler.

Quick Start

alias RustyCSV.RFC4180, as: CSV

# Parse CSV (skips headers by default, like NimbleCSV)
CSV.parse_string("name,age\njohn,27\njane,30\n")
#=> [["john", "27"], ["jane", "30"]]

# Include headers
CSV.parse_string(csv, skip_headers: false)
#=> [["name", "age"], ["john", "27"], ["jane", "30"]]

# Stream large files with bounded memory
"huge.csv"
|> File.stream!()
|> CSV.parse_stream()
|> Stream.each(&process_row/1)
|> Stream.run()

# Parse to maps with headers
CSV.parse_string("name,age\njohn,27\njane,30\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}, %{"name" => "jane", "age" => "30"}]

# With atom keys
CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]

# Dump back to CSV
CSV.dump_to_iodata([["name", "age"], ["john", "27"]])
#=> "name,age\r\njohn,27\r\n"

Drop-in NimbleCSV Replacement

# Before
alias NimbleCSV.RFC4180, as: CSV

# After
alias RustyCSV.RFC4180, as: CSV

# That's it. Same API, 3-9x faster on typical workloads.

All NimbleCSV functions are supported:

FunctionDescription
parse_string/2Parse CSV string to list of rows (or maps with headers:)
parse_stream/2Lazily parse a stream (or maps with headers:)
parse_enumerable/2Parse any enumerable
dump_to_iodata/2*Convert rows to a flat binary (strategy: :parallel for quoting-heavy data)
dump_to_stream/1*Lazily convert rows to stream of binaries (one per row)
to_line_stream/1Convert arbitrary chunks to lines
options/0Return module configuration

* NimbleCSV returns iodata lists; RustyCSV returns flat binaries (same bytes, no flattening needed).

Benchmarks

3.5x-9x faster than pure Elixir on synthetic benchmarks for typical data. Up to 18x faster on heavily quoted CSVs.

13-28% faster than pure Elixir on real-world TSV files (10K+ rows). Speedup varies by data complexity—quoted fields with escapes show the largest gains.

mix run bench/csv_bench.exs

See docs/BENCHMARK.md for detailed methodology and results.

When to Use RustyCSV

ScenarioRecommendation
Large files (1-500MB)✅ Use :simd (default) - biggest wins
Very large files (500MB+)✅ Use :parallel with complex quoted data
Huge/unbounded files✅ Use parse_stream/2 - bounded memory
Maximum speed✅ Use :simd (default) - sub-binary refs, 5-14x less memory
High-throughput APIs✅ Reduced scheduler load
Small files (<100KB)Either works - NIF overhead negligible
Need pure ElixirUse NimbleCSV

Custom Parsers

Define parsers with custom separators and options:

# TSV parser
RustyCSV.define(MyApp.TSV,
  separator: "\t",
  escape: "\"",
  line_separator: "\n"
)

# Pipe-separated
RustyCSV.define(MyApp.PSV,
  separator: "|",
  escape: "\"",
  line_separator: "\n"
)

MyApp.TSV.parse_string("a\tb\tc\n1\t2\t3\n")
#=> [["1", "2", "3"]]

Define Options

OptionDescriptionDefault
:separatorField separator(s) — string or list of strings (multi-byte OK)","
:escapeQuote/escape sequence (multi-byte OK)"\""
:line_separatorLine ending for dumps"\r\n"
:newlinesAccepted line endings["\r\n", "\n"]
:encodingCharacter encoding (see below):utf8
:trim_bomRemove BOM when parsingfalse
:dump_bomAdd BOM when dumpingfalse
:escape_formulaEscape formula injectionnil
:strategyDefault parsing strategy:simd

Multi-Separator Support

For files with inconsistent delimiters (common in European locales), specify multiple separators:

# Accept both comma and semicolon as delimiters
RustyCSV.define(MyApp.FlexibleCSV,
  separator: [",", ";"],
  escape: "\""
)

# Parse files with mixed separators
MyApp.FlexibleCSV.parse_string("a,b;c\n1;2,3\n", skip_headers: false)
#=> [["a", "b", "c"], ["1", "2", "3"]]

# Dumping uses only the FIRST separator
MyApp.FlexibleCSV.dump_to_iodata([["x", "y", "z"]]) |> IO.iodata_to_binary()
#=> "x,y,z\n"

Separators and escape sequences can be multi-byte:

# Double-colon separator
RustyCSV.define(MyApp.DoubleColon,
  separator: "::",
  escape: "\""
)

# Multi-byte escape
RustyCSV.define(MyApp.DollarEscape,
  separator: ",",
  escape: "$$"
)

# Mix single-byte and multi-byte separators
RustyCSV.define(MyApp.Mixed,
  separator: [",", "::"],
  escape: "\""
)

Headers-to-Maps

Return rows as maps instead of lists using the :headers option:

# First row becomes string keys
CSV.parse_string("name,age\njohn,27\njane,30\n", headers: true)
#=> [%{"name" => "john", "age" => "27"}, %{"name" => "jane", "age" => "30"}]

# Explicit atom keys (first row skipped by default)
CSV.parse_string("name,age\njohn,27\n", headers: [:name, :age])
#=> [%{name: "john", age: "27"}]

# Explicit string keys
CSV.parse_string("name,age\njohn,27\n", headers: ["n", "a"])
#=> [%{"n" => "john", "a" => "27"}]

# Works with streaming too
"huge.csv"
|> File.stream!()
|> CSV.parse_stream(headers: true)
|> Stream.each(&process_map/1)
|> Stream.run()

Edge cases: fewer columns than headers fills with nil, extra columns are ignored, duplicate headers use last value, empty headers become "".

Key interning is done Rust-side for parse_string — header terms are allocated once and reused across all rows. Streaming uses Elixir-side Stream.transform for map conversion.

Encoding Support

RustyCSV supports character encoding conversion:

# UTF-16 Little Endian (Excel/Windows exports)
RustyCSV.define(MyApp.Spreadsheet,
  separator: "\t",
  encoding: {:utf16, :little},
  trim_bom: true,
  dump_bom: true
)

# Or use the pre-defined spreadsheet parser
alias RustyCSV.Spreadsheet
Spreadsheet.parse_string(utf16_data)
EncodingDescription
:utf8UTF-8 (default, no conversion overhead)
:latin1ISO-8859-1 / Latin-1
{:utf16, :little}UTF-16 Little Endian
{:utf16, :big}UTF-16 Big Endian
{:utf32, :little}UTF-32 Little Endian
{:utf32, :big}UTF-32 Big Endian

RFC 4180 Compliance

RustyCSV is fully RFC 4180 compliant and validated against industry-standard test suites:

Test SuiteTestsStatus
csv-spectrum17✅ All pass
csv-test-data23✅ All pass
Edge cases (PapaParse-inspired)53✅ All pass
Core + NimbleCSV compat36✅ All pass
Encoding (UTF-16, Latin-1, etc.)20✅ All pass
Multi-separator support19✅ All pass
Multi-byte separator13✅ All pass
Multi-byte escape12✅ All pass
Native API separator/escape40✅ All pass
Headers-to-maps97✅ All pass
Custom newlines18✅ All pass
Streaming safety12✅ All pass
Concurrent access7✅ All pass
Total464

See docs/COMPLIANCE.md for full compliance details.

How It Works

Why Not Wrap the Rust csv Crate?

The Rust ecosystem has excellent CSV libraries like csv and polars. But wrapping them for Elixir has overhead:

  1. Parse CSV → Rust data structures (allocation)
  2. Convert Rust structs → Erlang terms (allocation + serialization)
  3. Return to BEAM

RustyCSV eliminates the middle step by parsing directly into BEAM terms:

  1. Parse CSV → Erlang terms directly (single pass)
  2. Return to BEAM

Strategy Implementations

All batch strategies share a single-pass SIMD structural scanner that finds field boundaries, then create BEAM sub-binary references directly.

StrategyScanningTerm BuildingMemoryBest For
:simdSIMD structural scanner via std::simdBoundary → sub-binaryO(n)Default, fastest for most files
:parallelSIMD structural scannerBoundary → sub-binaryO(n)Large files with many cores
:streamingByte-by-byteCopy (chunked)O(chunk)Unbounded/huge files

Shared across batch strategies (:simd, :parallel):

  • Single-pass SIMD structural scanner (finds all unquoted separators and row endings in one sweep)
  • Boundary-based sub-binary field references (near-zero BEAM allocation)
  • Hybrid unescaping: sub-binaries for clean fields, copy only when """ unescaping needed
  • Direct Erlang term construction via Rustler (no serde)
  • mimalloc high-performance allocator

:parallel specifics:

  • Runs on dirty CPU schedulers to avoid blocking BEAM
  • Rayon workers compute boundary pairs (pure index arithmetic on the shared structural index) — no data copying
  • Main thread builds BEAM sub-binary terms from boundaries (Env is not thread-safe, so term construction is serial)

:streaming specifics:

  • ResourceArc integrates parser state with BEAM GC
  • Tracks quote state across chunk boundaries
  • Copies field data (since input chunks are temporary)

NIF-Accelerated Encoding

RustyCSV's dump_to_iodata returns a single flat binary rather than an iodata list. The output bytes are identical to NimbleCSV — the flat binary is ready for use directly with IO.binwrite/2, File.write/2, or Conn.send_resp/3 without any flattening step.

Note: NimbleCSV returns an iodata list (nested small binaries) that callers typically flatten back into a binary. RustyCSV skips that roundtrip. Code that pattern-matches on dump_to_iodata expecting a list will need adjustment — the return value is a binary, which is valid iodata/0.

See docs/BENCHMARK.md for encoding throughput and memory numbers.

Encoding Strategies

dump_to_iodata/2 accepts a :strategy option:

# Default: single-threaded flat binary encoder.
# SIMD scan for quoting, writes directly to output buffer.
# Best for most workloads.
CSV.dump_to_iodata(rows)

# Parallel: multi-threaded encoding via rayon.
# Copies field data into Rust-owned memory, encodes chunks on separate threads.
# Faster when fields frequently need quoting (commas, quotes, newlines in values).
CSV.dump_to_iodata(rows, strategy: :parallel)
Encoding StrategyBest ForOutput
defaultMost data — clean fields, moderate quotingSingle flat binary
:parallelQuoting-heavy data (user-generated content, free-text with embedded commas/quotes/newlines)Short list of large binaries

High-Throughput Concurrent Exports

RustyCSV's encoding NIF runs on BEAM dirty CPU schedulers with per-thread mimalloc arenas, making it well-suited for concurrent export workloads (e.g., thousands of users downloading CSV reports simultaneously in a Phoenix application):

# Phoenix controller — concurrent CSV download
def export(conn, %{"id" => id}) do
  rows = MyApp.Reports.fetch_rows(id)
  csv = MyCSV.dump_to_iodata(rows)

  conn
  |> put_resp_content_type("text/csv")
  |> put_resp_header("content-disposition", ~s(attachment; filename="report.csv"))
  |> send_resp(200, csv)
end

For very large exports where you want bounded memory, use chunked NIF encoding:

# Chunked encoding — bounded memory with NIF speed
def stream_export(conn, %{"id" => id}) do
  conn = conn
  |> put_resp_content_type("text/csv")
  |> put_resp_header("content-disposition", ~s(attachment; filename="report.csv"))
  |> send_chunked(200)

  MyApp.Reports.stream_rows(id)
  |> Stream.chunk_every(5_000)
  |> Stream.each(fn chunk ->
    csv = MyCSV.dump_to_iodata(chunk)
    Conn.chunk(conn, csv)
  end)
  |> Stream.run()

  conn
end

Key characteristics for concurrent workloads:

  • Each NIF call is independent — no shared mutable state between requests
  • Dirty CPU schedulers prevent encoding from blocking normal BEAM schedulers
  • mimalloc's per-thread arenas avoid allocator contention under concurrency
  • The real bottleneck is typically DB queries and connection pool sizing, not CSV encoding

Architecture

RustyCSV is built with a modular Rust architecture:

native/rustycsv/src/
 lib.rs                 # NIF entry points, separator/escape decoding, dispatch
 core/
    simd_scanner.rs    # Single-pass SIMD structural scanner (prefix-XOR quote detection)
    simd_index.rs      # StructuralIndex, RowIter, RowFieldIter, FieldIter
    scanner.rs         # Byte-level helpers (separator matching)
    field.rs           # Field extraction, quote handling
    newlines.rs        # Custom newline support
 strategy/
    direct.rs          # Basic + SIMD strategies (single-byte)
    two_phase.rs       # Indexed strategy (single-byte)
    streaming.rs       # Stateful streaming parser (single-byte)
    parallel.rs        # Rayon-based parallel parsing (single-byte)
    zero_copy.rs       # Sub-binary reference parsing (single-byte)
    general.rs         # Multi-byte separator/escape (all strategies)
    encode.rs          # SIMD field scanning, quoting helpers
    encoding.rs        # UTF-8 → target encoding converters (UTF-16, Latin-1, etc.)
 term.rs                # BEAM term building (sub-binary + copy fallback)
 resource.rs            # ResourceArc for streaming state

See docs/ARCHITECTURE.md for detailed implementation notes.

Memory Efficiency

The streaming parser uses bounded memory regardless of file size:

# Process a 10GB file with ~64KB memory
File.stream!("huge.csv", [], 65_536)
|> CSV.parse_stream()
|> Stream.each(&process/1)
|> Stream.run()

Streaming Buffer Limit

The streaming parser enforces a maximum internal buffer size (default 256 MB) to prevent unbounded memory growth when parsing data without newlines or with very long rows. If a feed exceeds this limit, a :buffer_overflow exception is raised.

To adjust the limit, pass :max_buffer_size (in bytes):

# Increase for files with very long rows
CSV.parse_stream(stream, max_buffer_size: 512 * 1024 * 1024)

# Decrease to fail fast on malformed input
CSV.parse_stream(stream, max_buffer_size: 10 * 1024 * 1024)

# Also works on direct streaming APIs
RustyCSV.Streaming.stream_file("data.csv", max_buffer_size: 1_073_741_824)

High-Performance Allocator

RustyCSV uses mimalloc as the default allocator, providing:

  • 10-20% faster allocation for many small objects
  • Reduced memory fragmentation
  • Zero tracking overhead in default configuration

To disable mimalloc (for exotic build targets):

# In mix.exs
def project do
  [
    # Force local build without mimalloc
    compilers: [:rustler] ++ Mix.compilers(),
    rustler_crates: [rustycsv: [features: []]]
  ]
end

Optional Memory Tracking (Benchmarking Only)

For profiling Rust-side memory during development and benchmarking. Not intended for production — it wraps every allocation with atomic counter updates, adding overhead. This is also the only source of unsafe in the codebase (required by the GlobalAlloc trait). Enable the memory_tracking feature:

# In native/rustycsv/Cargo.toml
[features]
default = ["mimalloc", "memory_tracking"]

Then use:

RustyCSV.Native.reset_rust_memory_stats()
result = CSV.parse_string(large_csv)
peak = RustyCSV.Native.get_rust_memory_peak()
IO.puts("Peak Rust memory: #{peak / 1_000_000} MB")

When disabled (default), these functions return 0 with zero overhead.

SIMD and Rust Nightly

RustyCSV uses Rust's std::simd portable SIMD, which currently requires nightly via #![feature(portable_simd)]. However, RustyCSV only uses the stabilization-safe subset of the API:

  • Simd::from_slice, splat, simd_eq, bitwise ops (&, |, !)
  • Mask::to_bitmask() for extracting bit positions

We deliberately avoid the APIs that are blocking stabilization: swizzle, scatter/gather, and lane-count generics (LaneCount<N>: SupportedLaneCount). The items blocking the portable_simd tracking issue — mask semantics, supported vector size limits, and swizzle design — are unrelated to the operations we use. When std::simd stabilizes, RustyCSV will work on stable Rust with no code changes.

The prefix-XOR quote detection uses a portable shift-and-xor cascade rather than architecture-specific intrinsics, keeping the entire scanner free of unsafe code. Benchmarks show no measurable difference for the 16/32-bit masks used in CSV scanning.

Development

# Install dependencies
mix deps.get

# Compile (includes Rust NIF)
mix compile

# Run tests (464 tests)
mix test

# Run benchmarks
mix run bench/csv_bench.exs

# Code quality
mix credo --strict
mix dialyzer

License

MIT License - see LICENSE file for details.


RustyCSV - Purpose-built Rust NIF for ultra-fast CSV parsing in Elixir.