Module csv

Copyright © 2021 Serge Aleynikov

Authors: Serge Aleynikov (saleyn@gmail.com).

Description

CSV file parsing functions

Data Types

load_options()

load_options() = 
    [{load_type, recreate | replace | ignore_dups | update_dups} |
     {col_types,
      #{binary() =>
            ColType ::
                atom() | {ColType :: atom(), ColLen :: integer()}}} |
     {batch_size, integer()} |
     {blob_size, integer()} |
     {create_table, boolean()} |
     {save_create_sql_to_file, string()} |
     {transforms, #{binary() => fun((term()) -> term())}} |
     {guess_types, boolean()} |
     {guess_limit_rows, integer()} |
     {max_nulls_pcnt, float()} |
     {primary_key, PKColumns :: binary() | [binary() | list()]} |
     {drop_temp_table, boolean()} |
     {encoding, string() | atom()} |
     {verbose, boolean() | integer()}]

Options for loading data to a database.

{load_type, Type}
Type of loading to perform. recreate will replace the table by atomically dropping the old one, creating/loading the new one, and replacing the table. replace will do an insert by using REPLACE INTO statement. ignore_dups will use INSERT IGNORE INTO statement to ignore records with duplicate keys. update_dups will do an INSERT INTO and ON DUPLICATE KEY UPDATE, so that the old records are updated and the new ones are inserted.
{create_table, Allow}
Allow to create a table if it doesn't exist
{col_types, Map}
Types of data for all or some columens. The Map is in the format: ColName::binary() => ColInfo, where ColInfo is ColType | {ColType, ColLen::integer()}, and ColType is: date | datetime | integer | float | blob | number.
{transforms, Map}
Value transformation function for columns. The Map is in the format: ColName::binary() => fun((Value::term()) -> term()).
{batch_size, Size}
Number of records per SQL insert/update/replace call
{blob_size, Size}
Threshold in number of bytes at which a VARCHAR field is defined as BLOB
{save_create_sql_to_file, Filename::string()}
Save CREATE TABLE sql statement to a file
guess_types
When specified, the function will try to guess the type of data in columns instead of treating all data as string fields. The possible data typed guessed: integer, float, date, datetime, number, string
{guess_limit_rows, Limit}
Limit the number of rows for guessing the column data types
{max_nulls_pcnt, Percent}
A percentage threshold of permissible NULLs in a column (0-100), above which the column data type is forced to be treated as string
{primary_key, Fields}
Names of primary key fields in the created table
{drop_temp_table, boolean()}
When true (default), temp table is dropped.
{encoding, Encoding}
The name of the encoding to use for storing data. For the list of permissible values [see this link](https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html)
verbose
Print additional details to stdout

parse_options()

parse_options() = 
    [fix_lengths | binary | list |
     {open, list()} |
     {columns, [binary() | string()]} |
     {converters,
      [{binary() | string() | all,
        fun((binary(), binary()) -> binary()) |
        {rex, binary(), binary()}}]}]

Function Index

guess_data_type/1Guess the type of data by its value.
guess_data_types/2Guess data types of fields in the given CSV list of rows obtained by parsing a CSV file with parse(File,[fix_lengths]).
load_to_mysql/4Load CSV data from a File to a MySQL database.
max_field_lengths/2Get max field lengths for a list obtained by parsing a CSV file with parse_csv_file(File,[fix_lengths]).
parse/1Parse a CSV file using default options.
parse/2Parse a given CSV file.
parse_line/1Parse a CSV line.

Function Details

guess_data_type/1

guess_data_type(S :: binary()) ->
                   {null | date | datetime | integer | float |
                    string,
                    term(),
                    string()}

Guess the type of data by its value

guess_data_types/2

guess_data_types(HasHeaderRow :: boolean(),
                 Rows :: [Fields :: [binary()]]) ->
                    {Type ::
                         string | integer | number | float |
                         date | datetime,
                     MaxFieldLen :: integer(),
                     NumOfNulls :: integer()}

Guess data types of fields in the given CSV list of rows obtained by parsing a CSV file with parse(File,[fix_lengths]). The function returns a list of tuples {Type, MaxFieldLen, NumOfNulls}, where the Type is a field type, MaxFieldLen is the max length of data in this column, and NumOfNulls is the number of rows with empty values in this column.

load_to_mysql/4

load_to_mysql(File :: string(),
              Tab :: string(),
              MySqlPid :: pid(),
              Opts :: load_options()) ->
                 {Columns :: list(),
                  AffectedCount :: integer(),
                  RecCount :: integer()}

Load CSV data from a File to a MySQL database. Tab is the name of a table where to load data. MySqlPid is the pid of a MySQL database connection returned by mysql:start_link/1. The data in the table is replaced according to {import_type, Type}:

NOTE: this function requires https://github.com/mysql-otp/mysql-otp.git

max_field_lengths/2

max_field_lengths(HasHeaderRow :: boolean(),
                  Rows :: [Fields :: list()]) ->
                     [Len :: integer()]

Get max field lengths for a list obtained by parsing a CSV file with parse_csv_file(File,[fix_lengths]).

parse/1

parse(File :: string()) -> [[binary()]]

Parse a CSV file using default options.

parse/2

parse(File :: binary() | string(), Opts :: parse_options()) ->
         [[string()]]

Parse a given CSV file.

parse_line/1

parse_line(Line :: binary()) -> list()

Parse a CSV line