Csv Schema
Csv schema is a library helping you to build Ecto.Schema-like modules having a csv file as source.
The idea behind this library is give the possibility to create, at compile-time, a self-contained module exposing functions to retrieve data starting from a CSV.
Installation
The package can be installed by adding :csv_schema
to your list of dependencies in mix.exs
:
def deps do
[
{:csv_schema, "~> 0.2.8"}
]
end
Usage
Supposing you have a CSV file looking like this:
id | first_name | last_name | email | gender | ip_address | date_of_birth | :----|:-----------|:-----------|:------------------------------|:-------|:----------------|:--------------| 1 | Ivory | Overstreet | ioverstreet0@businessweek.com | Female | 30.138.91.62 | 10/22/2018 | 2 | Ulick | Vasnev | uvasnev1@vkontakte.ru | Male | 35.15.164.70 | 01/19/2018 | 3 | Chloe | Freemantle | cfreemantle2@parallels.com | Female | 133.133.113.255 | 08/13/2018 | ... | ... | ... | ... | ... | ... | ... |
It is possible to create an Ecto.Schema-like repository using Csv.Schema
macro:
defmodule Person do
use Csv.Schema
alias Csv.Schema.Parser
schema path: "path/to/person.csv" do
field :id, "id"
field :first_name, "first_name", filter_by: true
field :last_name, "last_name", sort: :asc
field :identifier, ["first_name", "last_name"], key: true, join: " "
field :email, "email", unique: true
field :gender, "gender", filter_by: true, sort: :desc
field :ip_address, "ip_address"
field :date_of_birth, "date_of_birth", parser: &Parser.date!(&1, "{0M}/{0D}/{0YYYY}")
end
end
It is possible to define the schema with string:
param in order to directly use a string to generate content
@data """
id,first_name,last_name,email,gender,ip_address,date_of_birth
1,Ivory,Overstreet,ioverstreet0@businessweek.com,Female,30.138.91.62,10/22/2018
2,Ulick,Vasnev,uvasnev1@vkontakte.ru,Male,35.15.164.70,01/19/2018
3,Chloe,Freemantle,cfreemantle2@parallels.com,Female,133.133.113.255,08/13/2018
"""
schema data: @data do
...
end
Note that it's not a requirement to map all fields, but every field mapped must have a column in csv file. For example the following field configuration will result in a compilation error:
field :id, "non_existing_id", ...
Schema could be configured using a custom separator (default is ?,)
use Csv.Schema, separator: ?,
Moreover it's possible to configure if csv file has or has not an header. Depending on header param value field config changes:
# Default header value is `true`
use Csv.Schema
# Csv with header
schema path: "path/to/person.csv" do
field :id, "id", key: true
...
end
# Csv without header. Note that field 1 is binded with the first csv column.
use Csv.Schema, header: false
# Index goes from 1 to N
schema path: "path/to/person.csv" do
field :id, 1, key: true
...
end
Now Person module is a struct, defined like this:
defmodule Person do
defstruct id: nil,
first_name: nil,
last_name: nil,
email: nil,
gender: nil,
ip_address: nil,
date_of_birth: nil
end
This macro creates for you inside Person module those functions:
def by_id(integer_key), do: ...
def filter_by_first_name(string_value), do: ...
def by_email(string_value), do: ...
def filter_by_gender(string_value), do: ...
def get_all, do: ...
Where:
by_id
returns a%Person{}
ornil
if key is not mapped in csvfilter_by_first_name
returns a[%Person{}, %Person{}, ...]
or[]
if input predicate does not match any personby_email
returns a%Person{}
ornil
if no person have provided email in csvfilter_by_gender
returns a[%Person{}, %Person{}, ...]
or[]
if input predicate does not match any person genderget_all
return all csv rows as a Stream
Field configuration
Every field should be formed like this:
field {struct_field}, {csv_header}, {opts}
where:
{struct_field}
will be the struct field name. Could be configured asstring
or asatom
{csv_header}
is the csv column name from where get values. Must be configured using string only{opts}
is a keyword list containing special configurations
opts:
:key
: boolean. At most one key could be set. If set to true creates theby_{name}
function for you.:unique
: boolean. If set to true creates theby_{name}
function for you. All csv values must be unique or an exception is raised:filter_by
: boolean. If set to true creates thefilter_by_{name}
function:parser
: function. An arity 1 function used to map values from string to a custom type:sort
::asc
or:desc
. It sorts according to Erlang's term ordering withnil
exception (number < atom < reference < fun < port < pid < tuple < list < bit-string < nil
):join
: string. If present it joins the given fields into a binary using the separator
Note that every configuration is optional
Keep in mind
Compilation time increase in an exponential manner if csv contains lots of lines and you
configure multiple fields candidate for method creation (flags key
, unique
and/or filter_by
set to true).
Because "without data you're just another person with an opinion" here some data:
Compilation time
csv rows | key | unique | filter_by | compile time |
---|---|---|---|---|
1_000 | false | 0 | 0 | 301_727 µs |
1_000 | false | 2 | 0 | 352_522 µs |
1_000 | false | 0 | 4 | 318_225 µs |
1_000 | true | 0 | 0 | 334_240 µs |
1_000 | true | 1 | 1 | 348_697 µs |
1_000 | true | 2 | 0 | 406_367 µs |
1_000 | true | 0 | 4 | 385_850 µs |
1_000 | true | 2 | 2 | 414_617 µs |
1_000 | true | 2 | 4 | 446_155 µs |
5_000 | false | 0 | 0 | 2_734_565 µs |
5_000 | false | 2 | 0 | 3_450_438 µs |
5_000 | false | 0 | 4 | 3_464_593 µs |
5_000 | true | 0 | 0 | 3_084_923 µs |
5_000 | true | 1 | 1 | 3_795_718 µs |
5_000 | true | 2 | 0 | 3_752_112 µs |
5_000 | true | 0 | 4 | 3_387_067 µs |
5_000 | true | 2 | 2 | 3_839_068 µs |
5_000 | true | 2 | 4 | 4_113_228 µs |
10_000 | false | 0 | 0 | 6_889_505 µs |
10_000 | false | 2 | 0 | 8_667_683 µs |
10_000 | false | 0 | 4 | 8_606_961 µs |
10_000 | true | 0 | 0 | 7_892_421 µs |
10_000 | true | 1 | 1 | 8_449_838 µs |
10_000 | true | 2 | 0 | 9_507_693 µs |
10_000 | true | 0 | 4 | 10_339_080 µs |
10_000 | true | 2 | 2 | 10_518_744 µs |
10_000 | true | 2 | 4 | 10_480_884 µs |
Execution time
csv rows | key | unique | filter_by | by avg | by tot | filter_by avg | filter_by tot |
---|---|---|---|---|---|---|---|
1_000 | true | 1 | 1 | 0.74 µs/op | 74_412 µs | 0.89 µs/op | 89_275 µs |
5_000 | true | 1 | 1 | 0.79 µs/op | 79_776 µs | 1.18 µs/op | 118_786 µs |
10_000 | true | 1 | 1 | 0.78 µs/op | 78_908 µs | 1.83 µs/op | 183_642 µs |
Execution details
Executed on my machine:
Lenovo Thinkpad T480
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
RAM: 32GB
Try yourself
If you like to run compilation benchmarks yourself:
iex -S mix
c "benchmark/timings.exs"
Copyright and License
Copyright (c) 2019 PrimaIt
This work is free. You can redistribute it and/or modify it under the terms of the MIT License. See the LICENSE.md file for more details.