RustyCSV Compliance & Validation

Copy Markdown View Source

RustyCSV takes correctness seriously. With 464 tests across multiple test suites, including industry-standard validation suites used by CSV parsers across multiple languages, RustyCSV is one of the most thoroughly tested CSV libraries available for Elixir.

This document describes RFC 4180 compliance and the validation methodology.

RFC 4180 Compliance

RustyCSV.RFC4180 is fully compliant with RFC 4180 (Common Format and MIME Type for Comma-Separated Values).

RFC 4180 Requirements

SectionRequirementStatus
2.1Records separated by line breaks (CRLF)✅ Accepts CRLF and LF; outputs CRLF
2.2Last record may or may not have trailing line break
2.3Optional header line✅ Via skip_headers and headers: options
2.4Each record should have same number of fields✅ Parses variable-width rows
2.5Spaces are part of the field✅ Preserved exactly
2.6Fields may be enclosed in double quotes
2.6Fields containing CRLF must be quoted
2.6Fields containing double quotes must be quoted
2.6Fields containing commas must be quoted
2.7Double quotes escaped by doubling ("")

Line Ending Behavior

Parsing:

  • Accepts both CRLF (\r\n) and LF (\n) as record separators
  • Preserves embedded CRLF/LF inside quoted fields exactly as-is

Dumping:

  • Uses CRLF (\r\n) as the record separator (RFC 4180 compliant)
  • Matches NimbleCSV.RFC4180 output exactly

Differences from Strict RFC 4180

RustyCSV makes one practical concession shared by most CSV implementations:

  1. Accepts LF line endings - RFC 4180 specifies CRLF, but LF-only files are common on Unix systems. RustyCSV parses both.

Bare Carriage Return (\r)

A bare \r not followed by \n is treated as field data, not a line ending. This matches:

  • The RFC 4180 ABNF grammar (bare \r is not in TEXTDATA, only valid inside quoted fields)
  • NimbleCSV (only \r\n and \n are line endings)
  • Go encoding/csv, Ruby CSV, PostgreSQL COPY

Python's csv module differs — it treats bare \r as a line ending via universal newline handling.


Industry Test Suites

RustyCSV validates correctness against two industry-standard CSV test suites.

csv-spectrum (Acid Test)

Source: https://github.com/max-mapper/csv-spectrum

The csv-spectrum suite is a widely-used "acid test" for CSV parsers, providing CSV files with JSON expected outputs for verification.

Note: The csv-spectrum repository's raw files have LF line endings due to git normalization. Our test fixtures match the actual content served by GitHub, and we verify that both RustyCSV and NimbleCSV produce identical output for these files.

Test FileEdge CaseStatus
simple.csvBasic parsing
simple_crlf.csvCRLF line endings
comma_in_quotes.csvCommas inside quoted fields
escaped_quotes.csvDoubled quotes (""")
newlines.csvLF inside quoted fields
newlines_crlf.csvCRLF inside quoted fields
quotes_and_newlines.csvCombined edge cases
empty.csvHeaders only (LF)
empty_crlf.csvHeaders only (CRLF)
utf8.csvUnicode content
json.csvJSON-like content in fields
location_coordinates.csvNumeric/coordinate data

Test file: test/csv_spectrum_test.exs

csv-test-data (RFC 4180 Focused)

Source: https://github.com/sineemore/csv-test-data

A comprehensive RFC 4180-focused test suite with both valid and invalid CSV cases.

Valid Cases

Test FileEdge CaseStatus
simple-lf.csvBasic with LF endings
simple-crlf.csvBasic with CRLF endings
quotes-with-comma.csvCommas in quoted fields
quotes-with-escaped-quote.csvEscaped quotes
quotes-with-newline.csvNewlines in quoted fields
quotes-with-space.csvSpaces in quoted fields
quotes-empty.csvEmpty quoted fields
empty-field.csvEmpty unquoted fields
one-column.csvSingle column
empty-one-column.csvSingle empty column
leading-space.csvLeading spaces preserved
trailing-space.csvTrailing spaces preserved
trailing-newline.csvFile ends with newline
utf8.csvUTF-8 encoded content
header-simple.csvBasic with header row
header-no-rows.csvHeaders only, no data
all-empty.csvAll empty fields

Test file: test/rfc4180_test_data_test.exs


Edge Case Tests (PapaParse-inspired)

Source: https://github.com/mholt/PapaParse/blob/master/tests/test-cases.js

A comprehensive edge case test suite inspired by PapaParse, covering malformed input, unusual delimiters, and stress testing.

CategoryTest Cases
Basic parsingEmpty input, single field, delimiter-only
WhitespaceEdges, tabs, quoted whitespace
Quoted fieldsDelimiters, newlines, escaped quotes
Empty fieldsLeading, trailing, consecutive
Line endingsLF, CRLF, mixed, no trailing
Field countsRagged rows, single/many columns
UnicodeUTF-8, emoji, mixed scripts, BOM
Special charsNull bytes, control chars, backslash
Large data100K char fields, 1000 rows, 500 columns
Strategy consistencyAll strategies produce identical output

Test file: test/edge_cases_test.exs


Cross-Strategy Validation

All parsing strategies must produce identical output for the same input. This is verified by running every test file through all six strategies:

StrategyDescriptionValidates Against
:basicSIMD scan + basic field extractionAll test suites
:simdSIMD structural scanner (default)All test suites
:indexedSIMD scan + two-phase index-then-extractAll test suites
:parallelSIMD scan + multi-threaded via rayonAll test suites
:zero_copySIMD scan + sub-binary referencesAll test suites
:streamingStateful chunked parserAll test suites
# From test/csv_spectrum_test.exs
for strategy <- [:basic, :simd, :indexed, :parallel] do
  test "all tests pass with #{strategy} strategy" do
    for name <- test_files do
      result = CSV.parse_string(csv, strategy: strategy)
      assert result == expected
    end
  end
end

NimbleCSV Compatibility

RustyCSV is designed as a drop-in replacement for NimbleCSV. Compatibility is verified by:

  1. API compatibility tests - All NimbleCSV API functions work identically
  2. Output matching - dump_to_iodata/1 produces identical output to NimbleCSV
  3. Round-trip tests - Parse → dump → parse produces identical data
  4. Full-file validation - 100K-row CSV parsed through both libraries produces identical row-by-row output

Test file: test/nimble_csv_compat_test.exs

# Verify dump output matches NimbleCSV exactly
test "dump output matches NimbleCSV" do
  data = [["a", "b"], ["1", "2"]]
  assert RustyCSV.RFC4180.dump_to_iodata(data) ==
         NimbleCSV.RFC4180.dump_to_iodata(data)
end

Known Behavioral Differences

Any input NimbleCSV accepts, RustyCSV also accepts with identical output. Two edge cases differ:

1. Unquoted fields containing double-quote characters

Input: "a",b\n (space before quote)

ParserBehavior
NimbleCSVRaises ParseError
RustyCSVParses as [" \"a\"", "b"]

Both behaviors are defensible for malformed input. NimbleCSV treats any " as significant. RustyCSV treats a field as quoted only if it starts with the escape character, matching Python's csv module and Go's encoding/csv (with LazyQuotes).

2. parse_stream/2 with non-line-delimited chunks

The two libraries use different streaming architectures. NimbleCSV's parse_stream expects each element of the input enumerable to be a complete line. RustyCSV's streaming parser accepts arbitrary chunk boundaries because the Rust NIF maintains parse state across feed() calls.

This difference is invisible for the standard use case (File.stream! |> parse_stream), where both produce identical output. It only surfaces when manually constructing a stream of non-line-delimited binary chunks.


Running Compliance Tests

# Run all tests including compliance suites
mix test

# Run only compliance tests
mix test test/csv_spectrum_test.exs test/rfc4180_test_data_test.exs

# Run with specific strategy
mix test --only strategy:parallel

Test Fixtures

Test fixtures are stored in test/fixtures/:

test/fixtures/
 csv-spectrum/           # csv-spectrum acid test suite
    *.csv              # CSV test files
    *.json             # Expected JSON outputs
 csv-test-data/         # RFC 4180 test suite
     *.csv              # Valid/invalid CSV files
     *.json             # Expected outputs

Test Summary

SuiteTestsPurpose
Core tests36Basic functionality and NimbleCSV compatibility
csv-spectrum17Industry acid test
csv-test-data23RFC 4180 compliance
Edge cases53Stress testing and malformed input
Encoding20UTF-16, UTF-32, Latin-1 conversion
Multi-separator19Multiple single-byte separator support

| Multi-byte separator | 13 | Multi-byte separator support (::, ||, mixed) | | Multi-byte escape | 12 | Multi-byte escape support ($$) | | Native API | 40 | NIF-level separator/escape encoding | | Headers-to-maps | 97 | headers: option, cross-strategy consistency, stream parity | | Custom newlines | 18 | Custom newline parsing and streaming | | Streaming safety | 12 | Buffer overflow, mutex poisoning, concurrent access | | Concurrent access | 7 | Multi-process streaming safety | | Total | 464 | |


Additional Test Resources

The following resources provide additional CSV test cases that may be valuable for future validation:

W3C CSVW Test Suite

The W3C CSV on the Web (CSVW) test suite contains 550+ tests for CSV validation and conversion to JSON/RDF. While focused on metadata and semantic representation, the parsing tests are valuable.

csv-fuzz (Fuzzing)

Fuzzing-based testing using Jazzer to find crashes, exceptions, and memory issues.


References