Csv Schema

Build Status Module Version Hex Docs Total Download License Last Updated

Csv schema is a library helping you to build Ecto.Schema-like modules having a csv file as source.

The idea behind this library is give the possibility to create, at compile-time, a self-contained module exposing functions to retrieve data starting from a CSV.

Installation

The package can be installed by adding :csv_schema to your list of dependencies in mix.exs:

def deps do
  [
    {:csv_schema, "~> 0.2.8"}
  ]
end

Usage

Supposing you have a CSV file looking like this:

id | first_name | last_name | email | gender | ip_address | date_of_birth | :----|:-----------|:-----------|:------------------------------|:-------|:----------------|:--------------| 1 | Ivory | Overstreet | ioverstreet0@businessweek.com | Female | 30.138.91.62 | 10/22/2018 | 2 | Ulick | Vasnev | uvasnev1@vkontakte.ru | Male | 35.15.164.70 | 01/19/2018 | 3 | Chloe | Freemantle | cfreemantle2@parallels.com | Female | 133.133.113.255 | 08/13/2018 | ... | ... | ... | ... | ... | ... | ... |

It is possible to create an Ecto.Schema-like repository using Csv.Schema macro:

defmodule Person do
  use Csv.Schema
  alias Csv.Schema.Parser

  schema path: "path/to/person.csv" do
    field :id, "id"
    field :first_name, "first_name", filter_by: true
    field :last_name, "last_name", sort: :asc
    field :identifier, ["first_name", "last_name"], key: true, join: " "
    field :email, "email", unique: true
    field :gender, "gender", filter_by: true, sort: :desc
    field :ip_address, "ip_address"
    field :date_of_birth, "date_of_birth", parser: &Parser.date!(&1, "{0M}/{0D}/{0YYYY}")
  end
end

It is possible to define the schema with string: param in order to directly use a string to generate content

@data """
id,first_name,last_name,email,gender,ip_address,date_of_birth
1,Ivory,Overstreet,ioverstreet0@businessweek.com,Female,30.138.91.62,10/22/2018
2,Ulick,Vasnev,uvasnev1@vkontakte.ru,Male,35.15.164.70,01/19/2018
3,Chloe,Freemantle,cfreemantle2@parallels.com,Female,133.133.113.255,08/13/2018
"""

schema data: @data do
...
end

Note that it's not a requirement to map all fields, but every field mapped must have a column in csv file. For example the following field configuration will result in a compilation error:

field :id, "non_existing_id", ...

Schema could be configured using a custom separator (default is ?,)

use Csv.Schema, separator: ?,

Moreover it's possible to configure if csv file has or has not an header. Depending on header param value field config changes:

# Default header value is `true`
use Csv.Schema
# Csv with header
schema path: "path/to/person.csv" do
  field :id, "id", key: true
  ...
end

# Csv without header. Note that field 1 is binded with the first csv column.
use Csv.Schema, header: false
# Index goes from 1 to N
schema path: "path/to/person.csv" do
  field :id, 1, key: true
  ...
end

Now Person module is a struct, defined like this:

defmodule Person do
  defstruct id: nil,
            first_name: nil,
            last_name: nil,
            email: nil,
            gender: nil,
            ip_address: nil,
            date_of_birth: nil
end

This macro creates for you inside Person module those functions:

def by_id(integer_key), do: ...

def filter_by_first_name(string_value), do: ...

def by_email(string_value), do: ...

def filter_by_gender(string_value), do: ...

def get_all, do: ...

Where:

  • by_id returns a %Person{} or nil if key is not mapped in csv
  • filter_by_first_name returns a [%Person{}, %Person{}, ...] or [] if input predicate does not match any person
  • by_email returns a %Person{} or nil if no person have provided email in csv
  • filter_by_gender returns a [%Person{}, %Person{}, ...] or [] if input predicate does not match any person gender
  • get_all return all csv rows as a Stream

Field configuration

Every field should be formed like this:

field {struct_field}, {csv_header}, {opts}

where:

  • {struct_field} will be the struct field name. Could be configured as string or as atom
  • {csv_header} is the csv column name from where get values. Must be configured using string only
  • {opts} is a keyword list containing special configurations

opts:

  • :key : boolean. At most one key could be set. If set to true creates the by_{name} function for you.
  • :unique : boolean. If set to true creates the by_{name} function for you. All csv values must be unique or an exception is raised
  • :filter_by : boolean. If set to true creates the filter_by_{name} function
  • :parser : function. An arity 1 function used to map values from string to a custom type
  • :sort : :asc or :desc. It sorts according to Erlang's term ordering with nil exception (number < atom < reference < fun < port < pid < tuple < list < bit-string < nil)
  • :join : string. If present it joins the given fields into a binary using the separator

Note that every configuration is optional

Keep in mind

Compilation time increase in an exponential manner if csv contains lots of lines and you configure multiple fields candidate for method creation (flags key, unique and/or filter_by set to true).

Because "without data you're just another person with an opinion" here some data:

Compilation time

csv rowskeyuniquefilter_bycompile time
1_000false00301_727 µs
1_000false20352_522 µs
1_000false04318_225 µs
1_000true00334_240 µs
1_000true11348_697 µs
1_000true20406_367 µs
1_000true04385_850 µs
1_000true22414_617 µs
1_000true24446_155 µs
5_000false002_734_565 µs
5_000false203_450_438 µs
5_000false043_464_593 µs
5_000true003_084_923 µs
5_000true113_795_718 µs
5_000true203_752_112 µs
5_000true043_387_067 µs
5_000true223_839_068 µs
5_000true244_113_228 µs
10_000false006_889_505 µs
10_000false208_667_683 µs
10_000false048_606_961 µs
10_000true007_892_421 µs
10_000true118_449_838 µs
10_000true209_507_693 µs
10_000true0410_339_080 µs
10_000true2210_518_744 µs
10_000true2410_480_884 µs

Execution time

csv rowskeyuniquefilter_byby avgby totfilter_by avgfilter_by tot
1_000true110.74 µs/op74_412 µs0.89 µs/op89_275 µs
5_000true110.79 µs/op79_776 µs1.18 µs/op118_786 µs
10_000true110.78 µs/op78_908 µs1.83 µs/op183_642 µs

Execution details

Executed on my machine:

Lenovo Thinkpad T480
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
RAM: 32GB

Try yourself

If you like to run compilation benchmarks yourself:

iex -S mix
c "benchmark/timings.exs"

Copyright (c) 2019 PrimaIt

This work is free. You can redistribute it and/or modify it under the terms of the MIT License. See the LICENSE.md file for more details.