DataQuacker v0.1.1 DataQuacker View Source
DataQuacker is a library which aims at helping validating, transforming and parsing non-sandboxed data.
The most common example for such data, and the original idea behind this project, is CSV files.
The scope of this library is not, however, in any way limited to CSV files.
This library ships by default with two adapters: DataQuacker.Adapters.CSV
for CSV files,
and DataQuacker.Adapters.Identity
for "in-memory data".
Any other data source may be used with the help of a third party adapters; see: DataQuacker.Adapter
.
This library is comprised of three main components:
DataQuacker
, which provides theparse/4
function to parse data using a schemaDataQuacker.Schema
, which a DSL for declaratively defining schemas which describe the mapping between the source data and the desired outputDataQuacker.Adapters.CSV
andDataQuacker.Adapters.Identity
, which extract data from sources into a format required by theparse/4
function
Note: If you find anything missing from or unclear in the documentation, please do not hesitate to open an issue on the project's Github repository.
Testing
The tests for parsing data which is external or non-sandboxed are often difficult to implement well, since that data may need to change over time. For example, editing CSV files used for tests, when the requirements change, can be tedious.
For this reason, using a different adapter, which takes Elixir data as the input, for tests is recommend.
In integration tests for this library the DataQuacker.Adapters.Identity
adapter is used.
The easiest way to switch out adapters in tests is to put the desired adapter in the test.exs
config.
You can find out how to do this under the "Options" section in the documentation for the parse/4
function.
Examples
Note: Most of the "juice", like transforming, validating, nesting, skipping, etc., is in the
DataQuacker.Schema
module, so the more complex and interesting examples also live there. Please take a look at its documentation for more in-depth examples.
Note: A fully working implementation of these examples can be found in the tests inside the "examples" directory.
Given the following table of ducks in a pond, in the form of a CSV file:
Type | Colour | Age |
---|---|---|
Mallard | green | 3 |
Domestic | white | 2 |
Mandarin | multi-coloured | 4 |
we want to have a list of maps with :type
, :colour
and :age
as the keys.
This can be achieved by creating the following schema and parser modules:
Schema
defmodule PondSchema do
use DataQuacker.Schema
schema :pond do
field :type do
source("type")
end
field :colour do
# make the "u" optional
# in case we get an American data source :)
source(~r/colou?r/i)
end
field :age do
source("age")
end
end
end
Parser
defmodule PondParser do
def parse(file_path) do
DataQuacker.parse(
file_path,
PondSchema.schema_structure(:pond),
nil
)
end
end
iex> PondParser.parse("path/to/file.csv")
iex> {:ok, [
iex> {:ok, %{type: "Mandarin", colour: "multi-coloured", age: "4"}},
iex> {:ok, %{type: "Domestic", colour: "white", age: "2"}},
iex> {:ok, %{type: "Mallard", colour: "green", age: "3"}},
iex> ]}
Using this schema and parser we get a tuple of :ok
or :error
, and a list of rows,
each of which is also a tuple of :ok
or :error
, but with a map as the second element.
The topmost :ok
or :error
indicates whether all rows are valid,
and those for individual rows indicate whether that particular row is valid
Note: The rows in the result are in the reverse order compared to the source rows. This is because for large lists reversing may be an expensive operation, which is often redundant, for example if the result is supposed to be inserted in a database.
Now suppose we also want to validate that the type is one in a list of types we know, and get the age in the form of an integer. We need to make some changes to our schema
defmodule PondSchema do
use DataQuacker.Schema
schema :pond do
field :type do
validate(fn type -> type in ["Mallard", "Domestic", "Mandarin"] end)
source("type")
end
field :colour do
# make the "u" optional
# in case we get an American data source :)
source(~r/colou?r/i)
end
field :age do
transform(fn age_str ->
case Integer.parse(str) do
{age_int, _} -> {:ok, age_int}
:error -> :error
end
end)
source("age")
end
end
end
Using the same input file the output is now:
iex> PondParser.parse("path/to/file.csv")
iex> {:ok, [
iex> {:ok, %{type: "Mandarin", colour: "multi-coloured", age: 4}},
iex> {:ok, %{type: "Domestic", colour: "white", age: 2}},
iex> {:ok, %{type: "Mallard", colour: "green", age: 3}},
iex> ]}
(the difference is in the type of "age")
If we add some invalid fields to the file, however, the result will be quite different:
Type | Colour | Age |
---|---|---|
Mallard | green | 3 |
Domestic | white | 2 |
Mandarin | multi-coloured | 4 |
Mystery | golden | 100 |
Black | black | Infinity |
iex> PondParser.parse("path/to/file.csv")
iex> {:error, [
iex> :error,
iex> :error,
iex> {:ok, %{type: "Mandarin", colour: "multi-coloured", age: 4}}
iex> {:ok, %{type: "Domestic", colour: "white", age: 2}},
iex> {:ok, %{type: "Mallard", colour: "green", age: 3}},
iex> ]}
Since the last two rows of the input are invalid, the first two rows in the output are errors.
Note: The errors can be made more descriptive by returning tuples
{:error, any()}
from the validators and parsers. You can see this in action in the examples for theDataQuacker.Schema
module.
Link to this section Summary
Functions
Takes in a source, a schema, support data, and a keyword list of options.
Returns a tuple with :ok
or :error
(indicating whether all rows are valid) as the first element,
and a list of tuples {:ok, map()} | {:error, any()} | :error)
.
In case of {:ok, map()}
for a given row, the map is the output defined in the schema.
Link to this section Functions
Takes in a source, a schema, support data, and a keyword list of options.
Returns a tuple with :ok
or :error
(indicating whether all rows are valid) as the first element,
and a list of tuples {:ok, map()} | {:error, any()} | :error)
.
In case of {:ok, map()}
for a given row, the map is the output defined in the schema.
Source
Any data which will be given to the adapter so that it can retrieve the source data.
In case of the DataQuacker.Adapter.CSV
this can be a file path or a file url.
Schema
A schema formed with the DSL from DataQuacker.Schema
.
Support data
Any data which is supposed to be accessible inside various schema elements when parsing a source.
Options
The options can also be specified in the config, for example:
use Mix.Config
# ...
config :data_quacker,
adapter: DataQuacker.Adapters.Identity,
adapter_opts: []
# ...
:adapter
- the adapter module to be used to retrieve the source data; defaults toDataQuacker.Adapters.CSV
:adapter_opts
- a keyword list of opts to be passed to the adapter; defaults to[separator: ?,, local?: true]
; for a list of available adapter options see the documentation for the particular adapter