hnc-csv - CSV Decoder/Encoder
View SourceDecoding
Whole CSV binary documents can be decoded with decode/1,2
.
decode/1
assumes default RFC4180-style
options, that is:
- Fields are separated by commas.
- Fields are optionally enclosed in double quotes.
- Double quotes in enclosed fields are quoted by another double quote.
decode/2
allows using custom options:
#{separator => Separator, % any byte except $\r or $\n (defaul $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote} % 'undefined', 'enclosure', or any byte except $\r or $\n (defaults 'enclosure')
Restrictions for option combinations:
- If
Enclosure
isundefined
(ie, no enclosing),Quote
must beenclosure
orundefined
. - If
Enclosure
is notundefined
,Quote
must also not beundefined
. - If
Enclosure
is notundefined
, it must not be the same asSeparator
.
Lines are separated by \r
, \n
or \r\n
. Empty lines are ignored by the decoder.
The result of decoding is a list of CSV lines, which are in turn lists of CSV fields, which are in turn binaries representing the field values.
Example
Assume the following CSV data:
a,b,c
"d,d","e""e","f
f"
In an Erlang binary, this will look like:
1> CsvBinary = <<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>.
<<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
Decoded with decode/1
, this will become:
2> hnc_csv:decode(CsvBinary).
[[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]]
Higher Order Functions for Decoding
hnc_csv
provides the functions decode_fold/3,4
, decode_filter/2,3
,
decode_map/2,3
, decode_filtermap/2,3
and decode_foreach/2,3
which
allow decoding and processing decoded lines in one operation, much
like the lists
functions foldl/3
, filter/2
, map/2
, filtermap/2
and foreach/2
.
In fact, decode/1,2
is implemented via decode_fold/3,4
.
Providers
Those functions take a provider
as their first parameter. A provider
here means a 0-arity function which, when called, returns either a tuple
where the first element is a chunk of binary data and the second is
a new provider function for the next chunk of data, or the atom
end_of_data
to indicate that the provider has delivered all data.
hnc_csv
comes with two convenience functions, get_binary_provider/1,2
and get_file_provider/1,2
which return providers for binaries or
files, respectively.
Example
The following is an implementation of a provider which delivers data taken from a given list of binaries:
-module(example_provider).
-export([get_list_provider/1]).
get_list_provider(L) ->
fun() -> list_provider(L) end.
list_provider([]) ->
end_of_data;
list_provider([Bin|More]) when is_binary(Bin) ->
{Bin, fun() -> list_provider(More) end}.
get_list_provider/1
creates the initial provider, which is a call tolist_provider/1
wrapped in a 0-arity function.list_provider/1
is the actual implementation of the provider, which returns eitherend_of_data
when the list given as argument is exhausted, or otherwise a tuple with the head element of the list as first and a call to itself with the tail of the list wrapped in a 0-arity function as second element.
This provider can then be used as follows, for example to count the lines and fields in the CSV data which the provider delivers:
1> Provider = example_provider:get_list_provider([<<"a,b">>, <<",c\r">>,
<<"\nd,">>, <<"e,f">>,
<<"\r\n">>]).
#Fun<example_provider.0.64990923>
2> hnc_csv:decode_fold(Provider,
fun(Line, {LCnt, FCnt}) -> {LCnt+1, FCnt+length(Line)} end,
{0, 0}).
{2,6}
Advanced Usage
For more complex scenarios than what the built-in functions provide
for, the functions decode_init/0,1,2
, decode_add_data/2
,
decode_next_line/1
and decode_flush/1
can be used together to
decode and process CSV documents.
decode_init/0,1,2
creates a decoder state to be used in the other functions listed above.decode_add_data/2
adds another chunk of unprocessed data to the state and returns an updated state.decode_next_line/1
decodes and returns the next line, together with an updated state. If the data in the state is exhausted, the atomend_of_data
is returned instead of a line.decode_flush/1
returns any as yet unfinished line in the given state, together with any yet unprocessed data. If there is no unfinished line in the state, the atomundefined
is returned instead of a line.
In fact, decode_fold/4
is implemented using those functions.
Encoding
CSV documents can be encoded with encode/1,2
.
encode/1
assumes default RFC4180-style
options, that is:
- Fields are separated by commas
- Fields are optionally enclosed in double quotes
- Double quotes in enclosed fields are quoted by another double quote
- Lines are separated by
\r\n
encode/2
allows using custom options:
#{separator => Separator, % any byte except $\r and $\n (default $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote, % 'undefined', 'enclosure', or any byte except $\r or $\n (default 'enclosure')
enclose => Enclose, % 'optional' (default), 'never' or 'always'
end_of_line => EndOfLine} % `<<"\r\n">> (default), <<"\n">> or <<"\r">>
Restrictions for option combinations:
- If
Enclose
isnever
(ie, no enclosing),Enclosure
must beundefined
andQuote
must beundefined
orenclosure
. - If
Enclose
isoptional
oralways
,Enclosure
andQuote
must not beundefined
. - If
Enclosure
is notundefined
, it must not be the same asSeparator
.
The input for encoding is a list of CSV lines, which are in turn lists of CSV fields, which are in turn binaries representing the field values.
The result is a CSV binary document.
Example
Assume the following CSV structure:
1> Csv = [[<<"a">>,<<"b">>,<<"c">>],[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]].
Encoded with encode/1
, this will become:
2> hnc_csv:encode(Csv).
<<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
Authors
- Maria Scott (Maria-12648430)
- Jan Uhlig (juhlig)