DataQuacker v0.1.1 DataQuacker View Source

DataQuacker is a library which aims at helping validating, transforming and parsing non-sandboxed data.

The most common example for such data, and the original idea behind this project, is CSV files. The scope of this library is not, however, in any way limited to CSV files. This library ships by default with two adapters: DataQuacker.Adapters.CSV for CSV files, and DataQuacker.Adapters.Identity for "in-memory data". Any other data source may be used with the help of a third party adapters; see: DataQuacker.Adapter.

This library is comprised of three main components:

Note: If you find anything missing from or unclear in the documentation, please do not hesitate to open an issue on the project's Github repository.

Testing

The tests for parsing data which is external or non-sandboxed are often difficult to implement well, since that data may need to change over time. For example, editing CSV files used for tests, when the requirements change, can be tedious.

For this reason, using a different adapter, which takes Elixir data as the input, for tests is recommend. In integration tests for this library the DataQuacker.Adapters.Identity adapter is used.

The easiest way to switch out adapters in tests is to put the desired adapter in the test.exs config. You can find out how to do this under the "Options" section in the documentation for the parse/4 function.

Examples

Note: Most of the "juice", like transforming, validating, nesting, skipping, etc., is in the DataQuacker.Schema module, so the more complex and interesting examples also live there. Please take a look at its documentation for more in-depth examples.

Note: A fully working implementation of these examples can be found in the tests inside the "examples" directory.

Given the following table of ducks in a pond, in the form of a CSV file:

TypeColourAge
Mallardgreen3
Domesticwhite2
Mandarinmulti-coloured4

we want to have a list of maps with :type, :colour and :age as the keys.

This can be achieved by creating the following schema and parser modules:

Schema

defmodule PondSchema do
  use DataQuacker.Schema

  schema :pond do
    field :type do
      source("type")
    end

    field :colour do
      # make the "u" optional
      # in case we get an American data source :)

      source(~r/colou?r/i)
    end

    field :age do
      source("age")
    end
  end
end

Parser

defmodule PondParser do
  def parse(file_path) do
    DataQuacker.parse(
      file_path,
      PondSchema.schema_structure(:pond),
      nil
    )
  end
end
iex> PondParser.parse("path/to/file.csv")
iex> {:ok, [
iex>   {:ok, %{type: "Mandarin", colour: "multi-coloured", age: "4"}},
iex>   {:ok, %{type: "Domestic", colour: "white", age: "2"}},
iex>   {:ok, %{type: "Mallard", colour: "green", age: "3"}},
iex> ]}

Using this schema and parser we get a tuple of :ok or :error, and a list of rows, each of which is also a tuple of :ok or :error, but with a map as the second element. The topmost :ok or :error indicates whether all rows are valid, and those for individual rows indicate whether that particular row is valid

Note: The rows in the result are in the reverse order compared to the source rows. This is because for large lists reversing may be an expensive operation, which is often redundant, for example if the result is supposed to be inserted in a database.

Now suppose we also want to validate that the type is one in a list of types we know, and get the age in the form of an integer. We need to make some changes to our schema

defmodule PondSchema do
  use DataQuacker.Schema

  schema :pond do
    field :type do
      validate(fn type -> type in ["Mallard", "Domestic", "Mandarin"] end)

      source("type")
    end

    field :colour do
      # make the "u" optional
      # in case we get an American data source :)

      source(~r/colou?r/i)
    end

    field :age do
      transform(fn age_str ->
        case Integer.parse(str) do
          {age_int, _} -> {:ok, age_int}
          :error -> :error
        end
      end)

      source("age")
    end
  end
end

Using the same input file the output is now:

iex> PondParser.parse("path/to/file.csv")
iex> {:ok, [
iex>   {:ok, %{type: "Mandarin", colour: "multi-coloured", age: 4}},
iex>   {:ok, %{type: "Domestic", colour: "white", age: 2}},
iex>   {:ok, %{type: "Mallard", colour: "green", age: 3}},
iex> ]}

(the difference is in the type of "age")

If we add some invalid fields to the file, however, the result will be quite different:

TypeColourAge
Mallardgreen3
Domesticwhite2
Mandarinmulti-coloured4
Mysterygolden100
BlackblackInfinity
iex> PondParser.parse("path/to/file.csv")
iex> {:error, [
iex>   :error,
iex>   :error,
iex>   {:ok, %{type: "Mandarin", colour: "multi-coloured", age: 4}}
iex>   {:ok, %{type: "Domestic", colour: "white", age: 2}},
iex>   {:ok, %{type: "Mallard", colour: "green", age: 3}},
iex> ]}

Since the last two rows of the input are invalid, the first two rows in the output are errors.

Note: The errors can be made more descriptive by returning tuples {:error, any()} from the validators and parsers. You can see this in action in the examples for the DataQuacker.Schema module.

Link to this section Summary

Functions

Takes in a source, a schema, support data, and a keyword list of options. Returns a tuple with :ok or :error (indicating whether all rows are valid) as the first element, and a list of tuples {:ok, map()} | {:error, any()} | :error). In case of {:ok, map()} for a given row, the map is the output defined in the schema.

Link to this section Functions

Link to this function

parse(source, schema, support_data, opts \\ [])

View Source
parse(any(), map(), any(), Keyword.t()) ::
  {:ok, [{:ok, map()} | {:error, any()} | :error]}
  | {:error, [{:ok, map()} | {:error, any()} | :error]}

Takes in a source, a schema, support data, and a keyword list of options. Returns a tuple with :ok or :error (indicating whether all rows are valid) as the first element, and a list of tuples {:ok, map()} | {:error, any()} | :error). In case of {:ok, map()} for a given row, the map is the output defined in the schema.

Source

Any data which will be given to the adapter so that it can retrieve the source data. In case of the DataQuacker.Adapter.CSV this can be a file path or a file url.

Schema

A schema formed with the DSL from DataQuacker.Schema.

Support data

Any data which is supposed to be accessible inside various schema elements when parsing a source.

Options

The options can also be specified in the config, for example:

use Mix.Config

# ...

config :data_quacker,
  adapter: DataQuacker.Adapters.Identity,
  adapter_opts: []

# ...
  • :adapter - the adapter module to be used to retrieve the source data; defaults to DataQuacker.Adapters.CSV
  • :adapter_opts - a keyword list of opts to be passed to the adapter; defaults to [separator: ?,, local?: true]; for a list of available adapter options see the documentation for the particular adapter