View Source Explorer.DataFrame (Explorer v0.4.0)
The DataFrame struct and API.
Dataframes are two-dimensional tabular data structures similar to a spreadsheet. For example, the Iris dataset:
iex> Explorer.Datasets.iris()
#Explorer.DataFrame<
Polars[150 x 5]
sepal_length float [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width float [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length float [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width float [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
This dataframe has 150 rows and five columns. Each column is an Explorer.Series
of the same size (150):
iex> df = Explorer.Datasets.iris()
iex> df["sepal_length"]
#Explorer.Series<
Polars[150]
float [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, ...]
>
creating-dataframes
Creating dataframes
Dataframes can be created from normal Elixir objects. The main way you might do this is
with the new/1
function. For example:
iex> Explorer.DataFrame.new(a: ["a", "b"], b: [1, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a string ["a", "b"]
b integer [1, 2]
>
Or with a list of maps:
iex> Explorer.DataFrame.new([%{"col1" => "a", "col2" => 1}, %{"col1" => "b", "col2" => 2}])
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["a", "b"]
col2 integer [1, 2]
>
verbs
Verbs
Explorer uses the idea of a consistent set of SQL-like verbs
like
dplyr
which can help solve common
data manipulation challenges. These are split into single table verbs,
multiple table verbs, and row-based verbs:
single-table-verbs
Single table verbs
Single table verbs are (unsurprisingly) used for manipulating a single dataframe. Those operations typically driven by column names. These are:
select/2
for picking columns anddiscard/2
to discard themfilter/2
for picking rows based on predicatesmutate/2
for adding or replacing columns that are functions of existing columnsarrange/2
for changing the ordering of rowsdistinct/2
for picking unique rowssummarise/2
for reducing multiple rows down to a single summarypivot_longer/3
andpivot_wider/4
for massaging dataframes into longer or wider forms, respectively
Each of these combine with Explorer.DataFrame.group_by/2
for operating by group.
multiple-table-verbs
Multiple table verbs
Multiple table verbs are used for combining tables. These are:
join/3
for performing SQL-like joinsconcat_columns/1
for horizontally "stacking" dataframesconcat_rows/1
for vertically "stacking" dataframes
row-based-verbs
Row-based verbs
Those operations are driven by the row index. These are:
head/2
for picking the first rowstail/2
for picking the last rowsslice/2
for slicing the dataframe by row indexes or a rangeslice/3
for slicing a section by an offsetsample/2
for sampling the data-frame by row
io-operations
IO operations
Explorer supports reading and writing of:
- delimited files (such as CSV)
- Parquet
- Arrow IPC
- Arrow Streaming IPC
- Newline Delimited JSON
The convention Explorer uses is to have from_*
and to_*
functions to read and write
to files in the formats above. load_*
and dump_*
versions are also available to read
and write those formats directly in memory.
access
Access
In addition to this "grammar" of data manipulation, you'll find useful functions for
slicing and dicing dataframes such as pull/2
, head/2
, sample/3
, slice/2
, and
slice/3
.
Explorer.DataFrame
also implements the Access
behaviour (also known as the brackets
syntax). This should be familiar for users coming from other language with dataframes
such as R or Python. For example:
iex> df = Explorer.Datasets.wine()
iex> df["class"]
#Explorer.Series<
Polars[178]
integer [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>
Link to this section Summary
Types
Represents a column name or its index.
Represents a column name as atom or string.
Represents multiple column names as atoms or strings.
Represents a column pair where the value is a column name or
a column index, and the value is of type value
.
Represents multiple columns.
Represents a dataframe.
Functions: Conversion
This collects the lazy data frame into an eager one, computing the query.
Creates a new dataframe.
Converts a dataframe to a list of columns with lists as values.
Converts the dataframe to the lazy version of the current backend.
Converts a dataframe to a list of maps (rows).
Converts a dataframe to a list of columns with series as values.
Functions: Single-table
Arranges/sorts rows by columns using Explorer.Query
.
Arranges/sorts rows by columns using a callback function.
Discards a subset of columns by name.
Takes distinct rows by a selection of columns.
Drop nil values.
Turns a set of columns to dummy variables.
Picks rows based on Explorer.Query
.
Picks rows based on a callback function.
Group the dataframe by one or more variables.
Picks rows based on a list or series of values.
Creates or modifies columns based on Explorer.Query
.
Creates or modifies columns using a callback function.
Pivot data from wide to long.
Pivot data from long to wide.
Extracts a single column as a series.
Creates or modifies a single column.
Renames columns.
Renames columns with a function.
Selects a subset of columns by name.
Summarise each group to a single row using Explorer.Query
.
Summarise each group to a single row using a callback function.
Removes grouping variables.
Functions: Multi-table
Combine two or more dataframes column-wise.
Combine two dataframes column-wise.
Combine two or more dataframes row-wise (stack).
Combine two dataframes row-wise.
Join two tables.
Functions: Row-based
Returns the first n rows of the dataframe.
Sample rows from a dataframe.
Slices rows at the given indices as a new dataframe.
Subset a continuous set of rows.
Returns the last n rows of the dataframe.
Functions: Introspection
Gets the dtypes of the dataframe columns.
Returns the groups of a dataframe.
Returns the number of columns in the dataframe.
Returns the number of rows in the dataframe.
Gets the names of the dataframe columns.
Gets the shape of the dataframe as a {height, width}
tuple.
Display the DataFrame in a tabular fashion.
Functions: IO
Writes a dataframe to a binary representation of a delimited file.
Similar to dump_csv/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC file.
Similar to dump_ipc/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC Stream file.
Similar to dump_ipc_stream/2
, but raises in case of error.
Writes a dataframe to a binary representation of a NDJSON file.
Similar to dump_ndjson!/2
, but raises in case of error.
Writes a dataframe to a binary representation of a Parquet file.
Similar to dump_parquet/2
, but raises in case of error.
Reads a delimited file into a dataframe.
Similar to from_csv/2
but raises if there is a problem reading the CSV.
Reads an IPC file into a dataframe.
Similar to from_ipc/2
but raises if there is a problem reading the IPC file.
Reads an IPC Streaming file into a dataframe.
Similar to from_ipc_stream/2
but raises if there is a problem reading the IPC Stream file.
Read a file of JSON objects or lists separated by new lines
Similar to from_ndjson/2
, but raises in case of error.
Reads a parquet file into a dataframe.
Similar to from_parquet/2
but raises if there is a problem reading the Parquet file.
Reads a representation of a CSV file into a dataframe.
Similar to load_csv/2
but raises if there is a problem reading the CSV.
Reads a binary representing an IPC file into a dataframe.
Similar to load_ipc/2
but raises if there is a problem reading the IPC file.
Reads a binary representing an IPC Stream file into a dataframe.
Similar to load_ipc_stream/2
but raises if there is a problem.
Reads a representation of a NDJSON file into a dataframe.
Similar to load_ndjson/2
, but raises in case of error.
Reads a binary representation of a parquet file into a dataframe.
Similar to load_parquet/2
but raises if there is a problem reading the Parquet file.
Writes a dataframe to a delimited file.
Similar to to_csv/3
but raises if there is a problem reading the CSV.
Writes a dataframe to an IPC file.
Similar to to_ipc/3
, but raises in case of error.
Writes a dataframe to an IPC Stream file.
Writes a dataframe to a ndjson file.
Writes a dataframe to a parquet file.
Similar to to_parquet/3
, but raises in case of error.
Link to this section Types
@type column() :: column_name() | non_neg_integer()
Represents a column name or its index.
Represents a column name as atom or string.
@type column_names() :: [column_name()]
Represents multiple column names as atoms or strings.
Represents a column pair where the value is a column name or
a column index, and the value is of type value
.
@type columns() :: [column()] | Range.t() | (String.t() -> boolean()) | (String.t(), Explorer.Series.dtype() -> boolean())
Represents multiple columns.
The columns may be specified as one of:
a list of columns indexes or names as atoms and strings
a range
a one-arity function that receives column names and returns true for column names to keep
a two-arity function that receives column names and types and returns true for column names to keep
@type t() :: %Explorer.DataFrame{ data: Explorer.Backend.DataFrame.t(), dtypes: %{required(String.t()) => Explorer.Series.dtype()}, groups: [String.t()], names: [String.t()] }
Represents a dataframe.
Link to this section Functions: Conversion
This collects the lazy data frame into an eager one, computing the query.
If already eager, this is a noop.
Collecting a grouped dataframe should return a grouped dataframe.
@spec new( Table.Reader.t() | series_pairs, opts :: Keyword.t() ) :: t() when series_pairs: %{required(column_name()) => Explorer.Series.t()} | [{column_name(), Explorer.Series.t()}]
Creates a new dataframe.
Accepts any tabular data adhering to the Table.Reader
protocol, as well as a map or a keyword list with series.
options
Options
:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.
examples
Examples
From series:
iex> Explorer.DataFrame.new(%{floats: Explorer.Series.from_list([1.0, 2.0]), ints: Explorer.Series.from_list([1, nil])})
#Explorer.DataFrame<
Polars[2 x 2]
floats float [1.0, 2.0]
ints integer [1, nil]
>
From columnar data:
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, nil]})
#Explorer.DataFrame<
Polars[2 x 2]
floats float [1.0, 2.0]
ints integer [1, nil]
>
iex> Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
#Explorer.DataFrame<
Polars[2 x 2]
floats float [1.0, 2.0]
ints integer [1, nil]
>
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, "wrong"]})
** (ArgumentError) cannot create series "ints": the value "wrong" does not match the inferred series dtype :integer
From row data:
iex> rows = [%{id: 1, name: "José"}, %{id: 2, name: "Christopher"}, %{id: 3, name: "Cristine"}]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id integer [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
iex> rows = [[id: 1, name: "José"], [id: 2, name: "Christopher"], [id: 3, name: "Cristine"]]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id integer [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
Converts a dataframe to a list of columns with lists as values.
See to_series/2
if you want a list of columns with series as values.
Note that this function does not take into account groups.
options
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
examples
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> Explorer.DataFrame.to_columns(df)
%{"floats" => [1.0, 2.0], "ints" => [1, nil]}
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_columns(df, atom_keys: true)
%{floats: [1.0, 2.0], ints: [1, nil]}
Converts the dataframe to the lazy version of the current backend.
If already lazy, this is a noop.
Converting a grouped dataframe should return a lazy dataframe with groups.
Converts a dataframe to a list of maps (rows).
Warning
This may be an expensive operation because
polars
stores data in columnar format.
options
Options
:atom_keys
- Configure if the resultant maps should have atom keys. (default:false
)
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df, atom_keys: true)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]
Converts a dataframe to a list of columns with series as values.
See to_columns/2
if you want a list of columns with lists as values.
Note that this function does not take into account groups.
options
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
examples
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> map = Explorer.DataFrame.to_series(df)
iex> Explorer.Series.to_list(map["floats"])
[1.0, 2.0]
iex> Explorer.Series.to_list(map["ints"])
[1, nil]
Link to this section Functions: Single-table
Arranges/sorts rows by columns using Explorer.Query
.
Notice
This is macro, therefore you must
require Explorer.DataFrame
before using it.
See arrange_with/2
for a callback version of this function without
Explorer.Query
.
examples
Examples
A single column name will sort ascending by that column:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.arrange(df, a)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b integer [3, 1, 2]
>
You can also sort descending:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.arrange(df, desc: a)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["c", "b", "a"]
b integer [2, 1, 3]
>
Sorting by more than one column sorts them in the order they are entered:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.arrange(df, asc: total, desc: country)
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2011, 2011, 2012, ...]
country string ["NIUE", "TUVALU", "TUVALU", "NIUE", "NIUE", ...]
total integer [1, 2, 2, 2, 2, ...]
solid_fuel integer [0, 0, 0, 0, 0, ...]
liquid_fuel integer [1, 2, 2, 2, 2, ...]
gas_fuel integer [0, 0, 0, 0, 0, ...]
cement integer [0, 0, 0, 0, 0, ...]
gas_flaring integer [0, 0, 0, 0, 0, ...]
per_capita float [0.52, 0.0, 0.0, 1.04, 1.04, ...]
bunker_fuels integer [0, 0, 0, 0, 0, ...]
>
grouped-examples
Grouped examples
When used in a grouped dataframe, arrange is going to sort each group individually and then return the entire dataframe with the existing groups. If one of the arrange columns is also a group, the sorting for that column is not going to work. It is necessary to first summarise the desired column and then arrange it.
Here is an example using the Iris dataset. We group by species and then we try to sort the dataframe by species and petal length, but only "petal length" is taken into account because "species" is a group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.arrange(grouped, desc: species, asc: sepal_width)
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length float [4.5, 4.4, 4.9, 4.8, 4.3, ...]
sepal_width float [2.3, 2.9, 3.0, 3.0, 3.0, ...]
petal_length float [1.3, 1.4, 1.4, 1.4, 1.1, ...]
petal_width float [0.3, 0.2, 0.2, 0.1, 0.1, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec arrange_with( df :: t(), (Explorer.Backend.LazyFrame.t() -> Explorer.Series.lazy_t() | [Explorer.Series.lazy_t()] | [{:asc | :desc, Explorer.Series.lazy_t()}]) ) :: t()
Arranges/sorts rows by columns using a callback function.
This is a callback version of arrange/2
.
examples
Examples
A single column name will sort ascending by that column:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.arrange_with(df, &(&1["a"]))
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b integer [3, 1, 2]
>
You can also sort descending:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.arrange_with(df, &[desc: &1["a"]])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["c", "b", "a"]
b integer [2, 1, 3]
>
Sorting by more than one column sorts them in the order they are entered:
iex> df = Explorer.DataFrame.new(a: [3, 1, 3], b: [2, 1, 3])
iex> Explorer.DataFrame.arrange_with(df, &[desc: &1["a"], asc: &1["b"]])
#Explorer.DataFrame<
Polars[3 x 2]
a integer [3, 3, 1]
b integer [2, 3, 1]
>
grouped-examples
Grouped examples
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.arrange_with(grouped, &[desc: &1["species"], asc: &1["sepal_width"]])
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length float [4.5, 4.4, 4.9, 4.8, 4.3, ...]
sepal_width float [2.3, 2.9, 3.0, 3.0, 3.0, ...]
petal_length float [1.3, 1.4, 1.4, 1.4, 1.1, ...]
petal_width float [0.3, 0.2, 0.2, 0.1, 0.1, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Discards a subset of columns by name.
It's important to notice that groups are kept: you can't discard grouping columns.
examples
Examples
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.discard(df, ["b"])
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.discard(df, ["a", "b"])
#Explorer.DataFrame<
Polars[3 x 1]
c integer [4, 5, 6]
>
Ranges and functions are also accepted in column names, as in select/2
.
grouped-examples
Grouped examples
You cannot discard grouped columns. You need to ungroup before removing them:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.discard(grouped, ["species"])
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length float [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width float [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length float [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width float [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Takes distinct rows by a selection of columns.
Distinct is not affected by groups, although groups are kept in the
columns selection if keep_all
option is false (the default).
options
Options
:keep_all
- If set totrue
, keep all columns. Default isfalse
.
examples
Examples
By default will return unique values of the requested columns:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
If keep_all
is set to true
, then the first value of each column not in the requested
columns will be returned:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"], keep_all: true)
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
A callback on the dataframe's names can be passed instead of a list (like select/2
):
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> Explorer.DataFrame.distinct(df, &String.starts_with?(&1, "x"))
#Explorer.DataFrame<
Polars[2 x 2]
x1 integer [1, 3]
x2 string ["a", "c"]
>
If the dataframe has groups, then the columns of each group will be added to the distinct columns:
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> df = Explorer.DataFrame.group_by(df, "x1")
iex> Explorer.DataFrame.distinct(df, ["x2"])
#Explorer.DataFrame<
Polars[2 x 2]
Groups: ["x1"]
x1 integer [1, 3]
x2 string ["a", "c"]
>
Drop nil values.
Optionally accepts a subset of columns.
examples
Examples
To drop nils on all columns:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df)
#Explorer.DataFrame<
Polars[1 x 2]
a integer [1]
b integer [1]
>
To select some columns:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, [:a, :c])
#Explorer.DataFrame<
Polars[1 x 3]
a integer [2]
b integer [nil]
c integer [5]
>
To select some columns by range:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, 0..1)
#Explorer.DataFrame<
Polars[1 x 3]
a integer [1]
b integer [1]
c integer [nil]
>
Or to select columns by a callback on the names:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, fn name -> name == "a" or name == "b" end)
#Explorer.DataFrame<
Polars[1 x 3]
a integer [1]
b integer [1]
c integer [nil]
>
Or to select columns by a callback on the names and types:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5.0, 6.0])
iex> Explorer.DataFrame.drop_nil(df, fn _name, type -> type == :float end)
#Explorer.DataFrame<
Polars[2 x 3]
a integer [2, nil]
b integer [nil, 3]
c float [5.0, 6.0]
>
Turns a set of columns to dummy variables.
In case the dataframe is using groups, all groups will be removed.
examples
Examples
To mark a single column as dummy:
iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, "col_x")
#Explorer.DataFrame<
Polars[4 x 3]
col_x_a integer [1, 0, 1, 0]
col_x_b integer [0, 1, 0, 0]
col_x_c integer [0, 0, 0, 1]
>
Or multiple columns:
iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, ["col_x", "col_y"])
#Explorer.DataFrame<
Polars[4 x 6]
col_x_a integer [1, 0, 1, 0]
col_x_b integer [0, 1, 0, 0]
col_x_c integer [0, 0, 0, 1]
col_y_b integer [1, 0, 1, 0]
col_y_a integer [0, 1, 0, 0]
col_y_d integer [0, 0, 0, 1]
>
Or all string columns:
iex> df = Explorer.DataFrame.new(num: [1, 2, 3, 4], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, fn _name, type -> type == :string end)
#Explorer.DataFrame<
Polars[4 x 3]
col_y_b integer [1, 0, 1, 0]
col_y_a integer [0, 1, 0, 0]
col_y_d integer [0, 0, 0, 1]
>
Picks rows based on Explorer.Query
.
The query is compiled and runs efficiently against the dataframe. The query must return a boolean expression.
Notice
This is macro, therefore you must
require Explorer.DataFrame
before using it.
Besides element-wise series operations, you can also use window functions and aggregations inside comparisons. In such cases, grouped dataframes may have different results than ungrouped ones, because the filtering is computed withing groups. See examples below.
See filter_with/2
for a callback version of this function without
Explorer.Query
.
examples
Examples
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col2 > 2)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["c"]
col2 integer [3]
>
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col1 == "b")
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 integer [2]
>
Returning a non-boolean expression errors:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2))
** (ArgumentError) expecting the function to return a boolean LazySeries, but instead it returned a LazySeries of type :integer
Which can be addressed by converting it to boolean:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2) == 1)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["a"]
col2 integer [1]
>
grouped-examples
Grouped examples
In a grouped dataframe, the aggregation is calculated within each group.
In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter(grouped, petal_length > mean(petal_length))
#Explorer.DataFrame<
Polars[79 x 5]
Groups: ["species"]
sepal_length float [4.6, 5.4, 5.0, 4.9, 5.4, ...]
sepal_width float [3.1, 3.9, 3.4, 3.1, 3.7, ...]
petal_length float [1.5, 1.7, 1.5, 1.5, 1.5, ...]
petal_width float [0.2, 0.4, 0.2, 0.1, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec filter_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> Explorer.Series.lazy_t()) ) :: t()
Picks rows based on a callback function.
This is a callback version of filter/2
.
examples
Examples
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, &Explorer.Series.greater(&1["col2"], 2))
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["c"]
col2 integer [3]
>
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.equal(df["col1"], "b") end)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 integer [2]
>
grouped-examples
Grouped examples
In a grouped dataframe, the aggregation is calculated within each group.
In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter_with(grouped, &Explorer.Series.greater(&1["petal_length"], Explorer.Series.mean(&1["petal_length"])))
#Explorer.DataFrame<
Polars[79 x 5]
Groups: ["species"]
sepal_length float [4.6, 5.4, 5.0, 4.9, 5.4, ...]
sepal_width float [3.1, 3.9, 3.4, 3.1, 3.7, ...]
petal_length float [1.5, 1.7, 1.5, 1.5, 1.5, ...]
petal_width float [0.2, 0.4, 0.2, 0.1, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec group_by(df :: t(), groups_or_group :: column_names() | column_name()) :: t()
Group the dataframe by one or more variables.
When the dataframe has grouping variables, operations are performed per group.
Explorer.DataFrame.ungroup/2
removes grouping.
examples
Examples
You can group by a single variable:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, "country")
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country"]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Or you can group by multiple:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, ["country", "year"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country", "year"]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
@spec mask(df :: t(), mask :: Explorer.Series.t() | [boolean()]) :: t()
Picks rows based on a list or series of values.
examples
Examples
This function must only be used when you need to select rows based on external values that are not available to the dataframe. For example, you can pass a list:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, [false, true, false])
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 integer [2]
>
You must avoid using masks when the masks themselves are computed from other columns. For example, DO NOT do this:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, Explorer.Series.greater(df["col2"], 1))
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["b", "c"]
col2 integer [2, 3]
>
Instead, do this:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.greater(df["col2"], 1) end)
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["b", "c"]
col2 integer [2, 3]
>
The filter_with/2
version is much more efficient because it doesn't need
to create intermediate series representations to apply the mask.
Creates or modifies columns based on Explorer.Query
.
The query is compiled and runs efficiently against the dataframe. New variables overwrite existing variables of the same name. Column names are coerced from atoms to strings.
Notice
This is macro, therefore you must
require Explorer.DataFrame
before using it.
Besides element-wise series operations, you can also use window functions and aggregations inside mutations. In such cases, grouped dataframes may have different results than ungrouped ones, because the mutation is computed withing groups. See examples below.
See mutate_with/2
for a callback version of this function without
Explorer.Query
.
examples
Examples
Mutations are useful to add or modify columns in your dataframe:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, c: b + 1)
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b integer [1, 2, 3]
c integer [2, 3, 4]
>
It's also possible to overwrite existing columns:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: b * 2)
#Explorer.DataFrame<
Polars[3 x 2]
a integer [2, 4, 6]
b integer [1, 2, 3]
>
Scalar values are repeated to fill the series:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: 4)
#Explorer.DataFrame<
Polars[3 x 2]
a integer [4, 4, 4]
b integer [1, 2, 3]
>
It's also possible to use functions from the Series module, like Explorer.Series.window_sum/3
:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, b: window_sum(a, 2))
#Explorer.DataFrame<
Polars[3 x 2]
a integer [1, 2, 3]
b integer [1, 3, 5]
>
Alternatively, all of the above works with a map instead of a keyword list:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, %{"c" => cast(b, :float)})
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b integer [1, 2, 3]
c float [1.0, 2.0, 3.0]
>
grouped-examples
Grouped examples
Mutations in grouped dataframes takes the context of the group.
This enables some aggregations to be made considering each group. It's almost like summarise/2
,
but repeating the results for each member in the group.
For example, if we want to count how many elements of a given group, we can add a new
column with that aggregation:
iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate(grouped, count: count(b))
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["id"]
id string ["a", "a", "b"]
b integer [1, 2, 3]
count integer [2, 2, 1]
>
In case we want to get the average size of the petal length from the Iris dataset, we can:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.mutate(grouped, petal_length_avg: mean(petal_length))
#Explorer.DataFrame<
Polars[150 x 6]
Groups: ["species"]
sepal_length float [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width float [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length float [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width float [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
petal_length_avg float [1.4640000000000004, 1.4640000000000004, 1.4640000000000004, 1.4640000000000004, 1.4640000000000004, ...]
>
@spec mutate_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> column_pairs(Explorer.Series.lazy_t())) ) :: t()
Creates or modifies columns using a callback function.
This is a callback version of mutate/2
.
examples
Examples
Here is an example of a new column that sums the value of two other columns:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])])
#Explorer.DataFrame<
Polars[3 x 3]
a integer [4, 5, 6]
b integer [1, 2, 3]
c integer [5, 7, 9]
>
You can overwrite existing columns as well:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[b: Explorer.Series.pow(&1["b"], 2)])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b float [1.0, 4.0, 9.0]
>
It's possible to "reuse" a variable for different computations:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, fn ldf ->
iex> c = Explorer.Series.add(ldf["a"], ldf["b"])
iex> [c: c, d: Explorer.Series.window_sum(c, 2)]
iex> end)
#Explorer.DataFrame<
Polars[3 x 4]
a integer [4, 5, 6]
b integer [1, 2, 3]
c integer [5, 7, 9]
d integer [5, 12, 16]
>
grouped-examples
Grouped examples
Mutations in grouped dataframes takes the context of the group. For example, if we want to count how many elements of a given group, we can add a new column with that aggregation:
iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate_with(grouped, &[count: Explorer.Series.count(&1["b"])])
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["id"]
id string ["a", "a", "b"]
b integer [1, 2, 3]
count integer [2, 2, 1]
>
Pivot data from wide to long.
pivot_longer/3
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is pivot_wider/4
.
The second argument, columns_to_pivot
, can be either list of column names to pivot
or a filter callback on the dataframe's names. These columns must always have the same
data type.
In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.
options
Options
:select
- Columns that are not in the list of pivot and should be kept in the dataframe. May be a filter callback on the dataframe's column names. Defaults to all columns except the ones to pivot.:discard
- Columns that are not in the list of pivot and should be dropped from the dataframe. May be a filter callback on the dataframe's column names. This list of columns is going to be subtracted from the list ofselect
. Defaults to an empty list.:names_to
- A string specifying the name of the column to create from the data stored in the column names of the dataframe. Defaults to"variable"
.:values_to
- A string specifying the name of the column to create from the data stored in series element values. Defaults to"value"
.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[3282 x 9]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value integer [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"), select: ["year", "country"])
#Explorer.DataFrame<
Polars[3282 x 4]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value integer [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: ["year", "country"], discard: ["country"])
#Explorer.DataFrame<
Polars[1094 x 3]
year integer [2010, 2010, 2010, 2010, 2010, ...]
variable string ["total", "total", "total", "total", "total", ...]
value integer [2308, 1254, 32500, 141, 7924, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: [], names_to: "my_var", values_to: "my_value")
#Explorer.DataFrame<
Polars[1094 x 2]
my_var string ["total", "total", "total", "total", "total", ...]
my_value integer [2308, 1254, 32500, 141, 7924, ...]
>
grouped-examples
Grouped examples
In the following example we want to take the Iris dataset and increase the number of rows by pivoting the "sepal_length" column. This dataset is grouped by "species", so the resultant dataframe is going to keep the "species" group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["sepal_length"])
#Explorer.DataFrame<
Polars[150 x 6]
Groups: ["species"]
sepal_width float [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length float [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width float [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
variable string ["sepal_length", "sepal_length", "sepal_length", "sepal_length", "sepal_length", ...]
value float [5.1, 4.9, 4.7, 4.6, 5.0, ...]
>
Now we want to do something different: we want to pivot the "species" column that is also a group. This is going to remove the group in the resultant dataframe:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["species"])
#Explorer.DataFrame<
Polars[150 x 6]
sepal_length float [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width float [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length float [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width float [0.2, 0.2, 0.2, 0.2, 0.2, ...]
variable string ["species", "species", "species", "species", "species", ...]
value string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec pivot_wider( df :: t(), names_from :: column(), values_from :: column(), opts :: Keyword.t() ) :: t()
Pivot data from long to wide.
pivot_wider/4
"widens" data, increasing the number of columns and decreasing the number of rows.
The inverse transformation is pivot_longer/3
.
Due to a restriction upstream, values_from
must be a numeric type.
In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.
options
Options
:id_columns
- A set of columns that uniquely identifies each observation. Defaults to all columns in data except for the columns specified innames_from
andvalues_from
. Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables. May accept a filter callback, a list or a range of column names. Default value is0..-1
. If an empty list is passed, or a range that results in a empty list of column names, it raises an error.ID columns cannot be of the float type and attempting so will raise an error. If you need to use float columns as IDs, you must carefully consider rounding or truncating the column and converting it to integer, as long as doing so preserves the properties of the column.
:names_prefix
- String added to the start of every variable name. This is particularly useful ifnames_from
is a numeric vector and you want to create syntactic variable names.
examples
Examples
Suppose we have a basketball court and multiple teams that want to train in that court. They need to share a schedule with the hours each team is going to use it. Here is a dataframe representing that schedule:
iex> Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
This dataframe is going to look like this - using table/2
:
+----------------------------------------------+
| Explorer DataFrame: [rows: 10, columns: 3] |
+---------------+--------------+---------------+
| weekday | team | hour |
| <string> | <string> | <integer> |
+===============+==============+===============+
| Monday | A | 10 |
+---------------+--------------+---------------+
| Tuesday | B | 9 |
+---------------+--------------+---------------+
| Wednesday | C | 10 |
+---------------+--------------+---------------+
| Thursday | A | 10 |
+---------------+--------------+---------------+
| Friday | B | 11 |
+---------------+--------------+---------------+
| Monday | C | 15 |
+---------------+--------------+---------------+
| Tuesday | A | 14 |
+---------------+--------------+---------------+
| Wednesday | B | 16 |
+---------------+--------------+---------------+
| Thursday | C | 14 |
+---------------+--------------+---------------+
| Friday | A | 16 |
+---------------+--------------+---------------+
You can see that the "weekday" repeats, and it's not clear how free the agenda is. We can solve that by pivoting the "weekday" column in multiple columns, making each weekday a new column in the resultant dataframe.
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
team string ["A", "B", "C"]
Monday integer [10, nil, 15]
Tuesday integer [14, 9, nil]
Wednesday integer [nil, 16, 10]
Thursday integer [10, nil, 14]
Friday integer [16, 11, nil]
>
Now if we print that same dataframe with table/2
, we get a better picture of the schedule:
+----------------------------------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 6] |
+----------+-----------+-----------+-----------+-----------+-----------+
| team | Monday | Tuesday | Wednesday | Thursday | Friday |
| <string> | <integer> | <integer> | <integer> | <integer> | <integer> |
+==========+===========+===========+===========+===========+===========+
| A | 10 | 14 | | 10 | 16 |
+----------+-----------+-----------+-----------+-----------+-----------+
| B | | 9 | 16 | | 11 |
+----------+-----------+-----------+-----------+-----------+-----------+
| C | 15 | | 10 | 14 | |
+----------+-----------+-----------+-----------+-----------+-----------+
grouped-examples
Grouped examples
Now using the same idea, we can see that there is not much difference for grouped dataframes. The only detail is that groups related to the pivoting columns are going to be removed.
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "team")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
Groups: ["team"]
team string ["A", "B", "C"]
Monday integer [10, nil, 15]
Tuesday integer [14, 9, nil]
Wednesday integer [nil, 16, 10]
Thursday integer [10, nil, 14]
Friday integer [16, 11, nil]
>
In the following example the group "weekday" is going to be removed, because the column is going to be pivoted in multiple columns:
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "weekday")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
team string ["A", "B", "C"]
Monday integer [10, nil, 15]
Tuesday integer [14, 9, nil]
Wednesday integer [nil, 16, 10]
Thursday integer [10, nil, 14]
Friday integer [16, 11, nil]
>
@spec pull(df :: t(), column :: column()) :: Explorer.Series.t()
Extracts a single column as a series.
This function is not going to consider groups when pulling series.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, "total")
#Explorer.Series<
Polars[1094]
integer [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, 2)
#Explorer.Series<
Polars[1094]
integer [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
Creates or modifies a single column.
This is a simplified way to add or modify one column, accepting a series or a tensor.
If you are computing a series, it is preferrable to use
mutate/2
or mutate_with/2
to compute the series and
modify it in a single step, as it is more powerful and
it handles both expressions and scalar values accordingly.
If you are passing tensors, the tensor will be automatically
converted to a series using Explorer.Series.from_tensor/2
.
The :dtype
option can be given to customize the resulting
type of the series, otherwise it will fallback to the dtype
of a column with the same name (if it exists) or it will be
inflected from the tensor.
examples
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.transform(df[:a], fn n -> n * 2 end))
#Explorer.DataFrame<
Polars[3 x 2]
a integer [1, 2, 3]
b integer [2, 4, 6]
>
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.from_list([4, 5, 6]))
#Explorer.DataFrame<
Polars[3 x 2]
a integer [1, 2, 3]
b integer [4, 5, 6]
>
grouped-examples
Grouped examples
If the dataframe is grouped, put/3
is going to ignore the groups.
So the series must be of the same size of the entire dataframe.
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, "a")
iex> series = Explorer.Series.from_list([9, 8, 7])
iex> Explorer.DataFrame.put(grouped, :b, series)
#Explorer.DataFrame<
Polars[3 x 2]
Groups: ["a"]
a integer [1, 2, 3]
b integer [9, 8, 7]
>
tensor-examples
Tensor examples
You can also put tensors into the dataframe:
iex> df = DF.new([])
iex> DF.put(df, :a, Nx.tensor([1, 2, 3]))
#Explorer.DataFrame<
Polars[3 x 1]
a integer [1, 2, 3]
>
You can specify which dtype the tensor represents. For example, a tensor of s64 represents integers by default, but it may also represent timestamps in microseconds from the Unix epoch:
iex> df = DF.new([])
iex> DF.put(df, :a, Nx.tensor([1, 2, 3]), dtype: :datetime)
#Explorer.DataFrame<
Polars[3 x 1]
a datetime [1970-01-01 00:00:00.000001, 1970-01-01 00:00:00.000002, 1970-01-01 00:00:00.000003]
>
Tensors are automatically broadcast too:
iex> df = DF.new(a: [1, 2, 3])
iex> DF.put(df, :b, Nx.tensor(0))
#Explorer.DataFrame<
Polars[3 x 2]
a integer [1, 2, 3]
b integer [0, 0, 0]
>
If there is already a column where we want to place the tensor, the column dtype will be automatically used, this means that updating dataframes in place while preserving their types is straight-forward:
iex> df = DF.new(a: [~N[1970-01-01 00:00:00]])
iex> DF.put(df, :a, Nx.tensor(529550625987654))
#Explorer.DataFrame<
Polars[1 x 1]
a datetime [1986-10-13 01:23:45.987654]
>
On the other hand, if you try to put a floating tensor on
an integer column, an error will be raised unless a dtype
or dtype: :from_tensor
is given:
iex> df = DF.new(a: [1, 2, 3])
iex> DF.put(df, :a, Nx.tensor(1.0, type: :f64))
** (ArgumentError) dtype integer expects a tensor of type {:s, 64} but got type {:f, 64}
iex> df = DF.new(a: [1, 2, 3])
iex> DF.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: :float)
#Explorer.DataFrame<
Polars[3 x 1]
a float [1.0, 1.0, 1.0]
>
iex> df = DF.new(a: [1, 2, 3])
iex> DF.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: :from_tensor)
#Explorer.DataFrame<
Polars[3 x 1]
a float [1.0, 1.0, 1.0]
>
@spec rename( df :: t(), names :: column_names() | column_pairs(column_name()) ) :: t()
Renames columns.
Renaming a column that is also a group is going to rename the group as well.
To apply a function to a subset of columns, see rename_with/3
.
examples
Examples
You can pass in a list of new names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, ["c", "d"])
#Explorer.DataFrame<
Polars[3 x 2]
c string ["a", "b", "a"]
d integer [1, 3, 1]
>
Or you can rename individual columns using keyword args:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, a: "first")
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b integer [1, 3, 1]
>
Or you can rename individual columns using a map:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, %{"a" => "first"})
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b integer [1, 3, 1]
>
Renames columns with a function.
Renaming a column that is also a group is going to rename the group as well.
examples
Examples
If no columns are specified, it will apply the function to all column names:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
YEAR integer [2010, 2010, 2010, 2010, 2010, ...]
COUNTRY string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL integer [2308, 1254, 32500, 141, 7924, ...]
SOLID_FUEL integer [627, 117, 332, 0, 0, ...]
LIQUID_FUEL integer [1601, 953, 12381, 141, 3649, ...]
GAS_FUEL integer [74, 7, 14565, 0, 374, ...]
CEMENT integer [5, 177, 2598, 0, 204, ...]
GAS_FLARING integer [0, 0, 2623, 0, 3697, ...]
PER_CAPITA float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
BUNKER_FUELS integer [9, 7, 663, 0, 321, ...]
>
A callback can be used to filter the column names that will be renamed, similarly to select/2
:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.ends_with?(&1, "_fuel"), &String.trim_trailing(&1, "_fuel"))
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid integer [627, 117, 332, 0, 0, ...]
liquid integer [1601, 953, 12381, 141, 3649, ...]
gas integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Or you can just pass in the list of column names you'd like to apply the function to:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, ["total", "cement"], &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
CEMENT integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Selects a subset of columns by name.
It's important to notice that groups are kept: you can't select off grouping columns.
examples
Examples
You can select a single column:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, "a")
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
Or a list of names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ["a"])
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
You can also use a range or a list of integers:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, [0, 1])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b integer [1, 2, 3]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, 0..1)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b integer [1, 2, 3]
>
Or you can use a callback function that takes the dataframe's names as its first argument:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, &String.starts_with?(&1, "b"))
#Explorer.DataFrame<
Polars[3 x 1]
b integer [1, 2, 3]
>
Or a callback function that takes names and types:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, fn _name, type -> type == :integer end)
#Explorer.DataFrame<
Polars[3 x 1]
b integer [1, 2, 3]
>
grouped-examples
Grouped examples
Columns that are also groups cannot be removed, you need to ungroup before removing these columns.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.select(grouped, ["sepal_width"])
#Explorer.DataFrame<
Polars[150 x 2]
Groups: ["species"]
sepal_width float [3.5, 3.0, 3.2, 3.1, 3.6, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Summarise each group to a single row using Explorer.Query
.
To summarize, you must perform aggregation, defined in Explorer.Series
,
on the desired columns. The query is compiled and runs efficiently
against the dataframe. This function performs aggregations based on groups,
so at least one group is expected, and the query must contain at least one
aggregation. It implicitly ungroups the resultant dataframe.
Notice
This is macro, therefore you must
require Explorer.DataFrame
before using it.
See summarise_with/2
for a callback version of this function without
Explorer.Query
.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> grouped_df = Explorer.DataFrame.group_by(df, "year")
iex> Explorer.DataFrame.summarise(grouped_df, total_max: max(total), total_min: min(total))
#Explorer.DataFrame<
Polars[5 x 3]
year integer [2010, 2011, 2012, 2013, 2014]
total_max integer [2393248, 2654360, 2734817, 2797384, 2806634]
total_min integer [1, 2, 2, 2, 3]
>
Suppose you want to get the mean petal length of each Iris species. You could do something like this:
iex> df = Explorer.Datasets.iris()
iex> grouped_df = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.summarise(grouped_df, mean_petal_length: mean(petal_length))
#Explorer.DataFrame<
Polars[3 x 2]
species string ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
mean_petal_length float [1.464, 4.26, 5.552]
>
@spec summarise_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> column_pairs(Explorer.Series.lazy_t())) ) :: t()
Summarise each group to a single row using a callback function.
This is a callback version of summarise/2
.
examples
Examples
iex> alias Explorer.{DataFrame, Series}
iex> df = Explorer.Datasets.fossil_fuels() |> DataFrame.group_by("year")
iex> DataFrame.summarise_with(df, &[total_max: Series.max(&1["total"]), countries: Series.n_distinct(&1["country"])])
#Explorer.DataFrame<
Polars[5 x 3]
year integer [2010, 2011, 2012, 2013, 2014]
total_max integer [2393248, 2654360, 2734817, 2797384, 2806634]
countries integer [217, 217, 220, 220, 220]
>
@spec ungroup(df :: t(), groups_or_group :: column_names() | column_name() | :all) :: t()
Removes grouping variables.
Accepts a list of group names. If groups is not specified, then all groups are removed.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df, ["country"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["year"]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df)
#Explorer.DataFrame<
Polars[1094 x 10]
year integer [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total integer [2308, 1254, 32500, 141, 7924, ...]
solid_fuel integer [627, 117, 332, 0, 0, ...]
liquid_fuel integer [1601, 953, 12381, 141, 3649, ...]
gas_fuel integer [74, 7, 14565, 0, 374, ...]
cement integer [5, 177, 2598, 0, 204, ...]
gas_flaring integer [0, 0, 2623, 0, 3697, ...]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels integer [9, 7, 663, 0, 321, ...]
>
Link to this section Functions: Multi-table
Combine two or more dataframes column-wise.
Dataframes must have the same number of rows.
When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.
examples
Examples
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(z: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
Polars[3 x 4]
x integer [1, 2, 3]
y string ["a", "b", "c"]
z integer [4, 5, 6]
a string ["d", "e", "f"]
>
Conflicting names are suffixed with the index of the dataframe in the array:
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
Polars[3 x 4]
x integer [1, 2, 3]
y string ["a", "b", "c"]
x_1 integer [4, 5, 6]
a string ["d", "e", "f"]
>
Combine two dataframes column-wise.
When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.
concat_columns(df1, df2)
is equivalent to concat_columns([df1, df2])
.
Combine two or more dataframes row-wise (stack).
Column names and dtypes must match. The only exception is for numeric columns that can be mixed together, and casted automatically to float columns.
When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.
examples
Examples
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x integer [1, 2, 3, 4, 5, ...]
y string ["a", "b", "c", "d", "e", ...]
>
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4.2, 5.3, 6.4], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x float [1.0, 2.0, 3.0, 4.2, 5.3, ...]
y string ["a", "b", "c", "d", "e", ...]
>
Combine two dataframes row-wise.
concat_rows(df1, df2)
is equivalent to concat_rows([df1, df2])
.
When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.
Join two tables.
join-types
Join types
:inner
- Returns all rows fromleft
where there are matching values inright
, and all columns fromleft
andright
.:left
- Returns all rows fromleft
and all columns fromleft
andright
. Rows inleft
with no match inright
will havenil
values in the new columns.:right
- Returns all rows fromright
and all columns fromleft
andright
. Rows inright
with no match inleft
will havenil
values in the new columns.:outer
- Returns all rows and all columns from bothleft
andright
. Where there are not matching values, returnsnil
for the one missing.:cross
- Also known as a cartesian join. Returns all combinations ofleft
andright
. Can be very computationally expensive.
options
Options
:on
- The columns to join on. Defaults to overlapping columns. Does not apply to cross join.:how
- One of the join types (as an atom) described above. Defaults to:inner
.
examples
Examples
Inner join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right)
#Explorer.DataFrame<
Polars[3 x 3]
a integer [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
Left join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :left)
#Explorer.DataFrame<
Polars[4 x 3]
a integer [1, 2, 2, 3]
b string ["a", "b", "b", "c"]
c string ["d", "e", "f", nil]
>
Right join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :right)
#Explorer.DataFrame<
Polars[3 x 3]
a integer [1, 2, 4]
c string ["d", "e", "f"]
b string ["a", "b", nil]
>
Outer join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :outer)
#Explorer.DataFrame<
Polars[4 x 3]
a integer [1, 2, 4, 3]
b string ["a", "b", nil, "c"]
c string ["d", "e", "f", nil]
>
Cross join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :cross)
#Explorer.DataFrame<
Polars[9 x 4]
a integer [1, 1, 1, 2, 2, ...]
b string ["a", "a", "a", "b", "b", ...]
a_right integer [1, 2, 4, 1, 2, ...]
c string ["d", "e", "f", "d", "e", ...]
>
Inner join with different names:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(d: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, on: [{"a", "d"}])
#Explorer.DataFrame<
Polars[3 x 3]
a integer [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
grouped-examples
Grouped examples
When doing a join operation with grouped dataframes, the joined dataframe may keep the groups from only one side.
An inner join operation will keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right)
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["b"]
a integer [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
A left join operation will keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :left)
#Explorer.DataFrame<
Polars[4 x 3]
Groups: ["b"]
a integer [1, 2, 2, 3]
b string ["a", "b", "b", "c"]
c string ["d", "e", "f", nil]
>
A right join operation will keep the groups from the right-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :right)
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["c"]
a integer [1, 2, 4]
c string ["d", "e", "f"]
b string ["a", "b", nil]
>
An outer join operation is going to keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :outer)
#Explorer.DataFrame<
Polars[4 x 3]
Groups: ["b"]
a integer [1, 2, 4, 3]
b string ["a", "b", nil, "c"]
c string ["d", "e", "f", nil]
>
A cross join operation is going to keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :cross)
#Explorer.DataFrame<
Polars[9 x 4]
Groups: ["b"]
a integer [1, 1, 1, 2, 2, ...]
b string ["a", "a", "a", "b", "b", ...]
a_right integer [1, 2, 4, 1, 2, ...]
c string ["d", "e", "f", "d", "e", ...]
>
Link to this section Functions: Row-based
Returns the first n rows of the dataframe.
By default it returns the first 5 rows.
If the dataframe is using groups, then the first n rows of each group is returned.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df)
#Explorer.DataFrame<
Polars[5 x 10]
year integer [2010, 2010, 2010, 2010, 2010]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
total integer [2308, 1254, 32500, 141, 7924]
solid_fuel integer [627, 117, 332, 0, 0]
liquid_fuel integer [1601, 953, 12381, 141, 3649]
gas_fuel integer [74, 7, 14565, 0, 374]
cement integer [5, 177, 2598, 0, 204]
gas_flaring integer [0, 0, 2623, 0, 3697]
per_capita float [0.08, 0.43, 0.9, 1.68, 0.37]
bunker_fuels integer [9, 7, 663, 0, 321]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year integer [2010, 2010]
country string ["AFGHANISTAN", "ALBANIA"]
total integer [2308, 1254]
solid_fuel integer [627, 117]
liquid_fuel integer [1601, 953]
gas_fuel integer [74, 7]
cement integer [5, 177]
gas_flaring integer [0, 0]
per_capita float [0.08, 0.43]
bunker_fuels integer [9, 7]
>
grouped-examples
Grouped examples
Using grouped dataframes makes head/2
return n rows from each group.
Here is an example using the Iris dataset, and returning two rows from each group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.head(grouped, 2)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length float [5.1, 4.9, 7.0, 6.4, 6.3, ...]
sepal_width float [3.5, 3.0, 3.2, 3.2, 3.3, ...]
petal_length float [1.4, 1.4, 4.7, 4.5, 6.0, ...]
petal_width float [0.2, 0.2, 1.4, 1.5, 2.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Sample rows from a dataframe.
If given an integer as the second argument, it will return N samples. If given a float, it will return that proportion of the series.
Can sample with or without replacement.
For grouped dataframes, sample will take into account the rows of each group, meaning that if you try to get N samples and you have G groups, you will get N * G rows. See the examples below.
options
Options
:replacement
- If set totrue
, each sample will be independent and therefore values may repeat. Required to betrue
forn
greater then the number of rows in the dataframe orfrac
> 1.0. (default:false
):seed
- An integer to be used as a random seed. If nil, a random value between 1 and 1e12 will be used. (default: nil)
examples
Examples
You can sample N rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 3, seed: 100)
#Explorer.DataFrame<
Polars[3 x 10]
year integer [2011, 2012, 2011]
country string ["SERBIA", "FALKLAND ISLANDS (MALVINAS)", "SWAZILAND"]
total integer [13422, 15, 286]
solid_fuel integer [9355, 3, 102]
liquid_fuel integer [2537, 12, 184]
gas_fuel integer [1188, 0, 0]
cement integer [342, 0, 0]
gas_flaring integer [0, 0, 0]
per_capita float [1.49, 5.21, 0.24]
bunker_fuels integer [39, 0, 1]
>
Or you can sample a proportion of rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 0.03, seed: 100)
#Explorer.DataFrame<
Polars[32 x 10]
year integer [2011, 2012, 2012, 2013, 2010, ...]
country string ["URUGUAY", "FRENCH POLYNESIA", "ICELAND", "PERU", "TUNISIA", ...]
total integer [2117, 222, 491, 15586, 7543, ...]
solid_fuel integer [1, 0, 96, 784, 15, ...]
liquid_fuel integer [1943, 222, 395, 7097, 3138, ...]
gas_fuel integer [40, 0, 0, 3238, 3176, ...]
cement integer [132, 0, 0, 1432, 1098, ...]
gas_flaring integer [0, 0, 0, 3036, 116, ...]
per_capita float [0.63, 0.81, 1.52, 0.51, 0.71, ...]
bunker_fuels integer [401, 45, 170, 617, 219, ...]
>
grouped-examples
Grouped examples
In the following example we have the Iris dataset grouped by species, and we want to take a sample of two plants from each group. Since we have three species, the resultant dataframe is going to have six rows (2 * 3).
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 2, seed: 100)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length float [5.3, 5.1, 5.1, 5.6, 6.2, ...]
sepal_width float [3.7, 3.8, 2.5, 2.7, 3.4, ...]
petal_length float [1.5, 1.9, 3.0, 4.2, 5.4, ...]
petal_width float [0.2, 0.4, 1.1, 1.3, 2.3, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
The behaviour is similar when you want to take a fraction of the rows from each group. The main difference is that each group can have more or less rows, depending on its size.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 0.1, seed: 100)
#Explorer.DataFrame<
Polars[15 x 5]
Groups: ["species"]
sepal_length float [5.3, 5.1, 4.7, 5.7, 5.1, ...]
sepal_width float [3.7, 3.8, 3.2, 3.8, 3.5, ...]
petal_length float [1.5, 1.9, 1.3, 1.7, 1.4, ...]
petal_width float [0.2, 0.4, 0.2, 0.3, 0.3, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Slices rows at the given indices as a new dataframe.
The indices may be either a list of indices or a range. A list of indices does not support negative numbers. Ranges may be negative on either end, which are then normalized. Note ranges in Elixir are inclusive.
Slice works differently when a dataframe is grouped. It is going to consider the indices of each group instead of the entire dataframe. See the examples below.
examples
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, [0, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a integer [1, 3]
b string ["a", "c"]
>
With a range:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, 1..2)
#Explorer.DataFrame<
Polars[2 x 2]
a integer [2, 3]
b string ["b", "c"]
>
With a range with negative first and last:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, -2..-1)
#Explorer.DataFrame<
Polars[2 x 2]
a integer [2, 3]
b string ["b", "c"]
>
grouped-examples
Grouped examples
We are going to once again use the Iris dataset. In this example we want to take elements at indexes 0 and 2:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, [0, 2])
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length float [5.1, 4.7, 7.0, 6.9, 6.3, ...]
sepal_width float [3.5, 3.2, 3.2, 3.1, 3.3, ...]
petal_length float [1.4, 1.3, 4.7, 4.9, 6.0, ...]
petal_width float [0.2, 0.2, 1.4, 1.5, 2.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Now we want to take the first 3 rows of each group.
This is going to work with the range 0..2
:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0..2)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length float [5.1, 4.9, 4.7, 7.0, 6.4, ...]
sepal_width float [3.5, 3.0, 3.2, 3.2, 3.2, ...]
petal_length float [1.4, 1.4, 1.3, 4.7, 4.5, ...]
petal_width float [0.2, 0.2, 0.2, 1.4, 1.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
Subset a continuous set of rows.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, 1, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year integer [2010, 2010]
country string ["ALBANIA", "ALGERIA"]
total integer [1254, 32500]
solid_fuel integer [117, 332]
liquid_fuel integer [953, 12381]
gas_fuel integer [7, 14565]
cement integer [177, 2598]
gas_flaring integer [0, 2623]
per_capita float [0.43, 0.9]
bunker_fuels integer [7, 663]
>
Negative offsets count from the end of the series:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year integer [2014, 2014]
country string ["UNITED STATES OF AMERICA", "URUGUAY"]
total integer [1432855, 1840]
solid_fuel integer [450047, 2]
liquid_fuel integer [576531, 1700]
gas_fuel integer [390719, 25]
cement integer [11314, 112]
gas_flaring integer [4244, 0]
per_capita float [4.43, 0.54]
bunker_fuels integer [30722, 251]
>
If the length would run past the end of the dataframe, the result may be shorter than the length:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 20)
#Explorer.DataFrame<
Polars[10 x 10]
year integer [2014, 2014, 2014, 2014, 2014, ...]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN", "VANUATU", "VENEZUELA", ...]
total integer [1432855, 1840, 28692, 42, 50510, ...]
solid_fuel integer [450047, 2, 1677, 0, 204, ...]
liquid_fuel integer [576531, 1700, 2086, 42, 28445, ...]
gas_fuel integer [390719, 25, 23929, 0, 12731, ...]
cement integer [11314, 112, 1000, 0, 1088, ...]
gas_flaring integer [4244, 0, 0, 0, 8042, ...]
per_capita float [4.43, 0.54, 0.97, 0.16, 1.65, ...]
bunker_fuels integer [30722, 251, 0, 10, 1256, ...]
>
grouped-examples
Grouped examples
We want to take the first 3 rows of each group. We need the offset 0 and the length 3:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0, 3)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length float [5.1, 4.9, 4.7, 7.0, 6.4, ...]
sepal_width float [3.5, 3.0, 3.2, 3.2, 3.2, ...]
petal_length float [1.4, 1.4, 1.3, 4.7, 4.5, ...]
petal_width float [0.2, 0.2, 0.2, 1.4, 1.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
We can also pass a negative offset:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, -6, 3)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length float [5.1, 4.8, 5.1, 5.6, 5.7, ...]
sepal_width float [3.8, 3.0, 3.8, 2.7, 3.0, ...]
petal_length float [1.9, 1.4, 1.6, 4.2, 4.2, ...]
petal_width float [0.4, 0.3, 0.2, 1.3, 1.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
Returns the last n rows of the dataframe.
By default it returns the last 5 rows.
If the dataframe is using groups, then the last n rows of each group is returned.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df)
#Explorer.DataFrame<
Polars[5 x 10]
year integer [2014, 2014, 2014, 2014, 2014]
country string ["VIET NAM", "WALLIS AND FUTUNA ISLANDS", "YEMEN", "ZAMBIA", "ZIMBABWE"]
total integer [45517, 6, 6190, 1228, 3278]
solid_fuel integer [19246, 0, 137, 132, 2097]
liquid_fuel integer [12694, 6, 5090, 797, 1005]
gas_fuel integer [5349, 0, 581, 0, 0]
cement integer [8229, 0, 381, 299, 177]
gas_flaring integer [0, 0, 0, 0, 0]
per_capita float [0.49, 0.44, 0.24, 0.08, 0.22]
bunker_fuels integer [761, 1, 153, 33, 9]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year integer [2014, 2014]
country string ["ZAMBIA", "ZIMBABWE"]
total integer [1228, 3278]
solid_fuel integer [132, 2097]
liquid_fuel integer [797, 1005]
gas_fuel integer [0, 0]
cement integer [299, 177]
gas_flaring integer [0, 0]
per_capita float [0.08, 0.22]
bunker_fuels integer [33, 9]
>
grouped-examples
Grouped examples
Using grouped dataframes makes tail/2
return n rows from each group.
Here is an example using the Iris dataset, and returning two rows from each group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.tail(grouped, 2)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length float [5.3, 5.0, 5.1, 5.7, 6.2, ...]
sepal_width float [3.7, 3.3, 2.5, 2.8, 3.4, ...]
petal_length float [1.5, 1.4, 3.0, 4.1, 5.4, ...]
petal_width float [0.2, 0.2, 1.1, 1.3, 2.3, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Link to this section Functions: Introspection
Gets the dtypes of the dataframe columns.
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.dtypes(df)
%{"floats" => :float, "ints" => :integer}
Returns the groups of a dataframe.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, "country")
iex> Explorer.DataFrame.groups(df)
["country"]
iex> df = Explorer.Datasets.iris()
iex> Explorer.DataFrame.groups(df)
[]
Returns the number of columns in the dataframe.
This function works the same way for grouped dataframes.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_columns(df)
10
Returns the number of rows in the dataframe.
This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_rows(df)
1094
Gets the names of the dataframe columns.
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.names(df)
["floats", "ints"]
Gets the shape of the dataframe as a {height, width}
tuple.
This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.
examples
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0, 3.0], ints: [1, 2, 3])
iex> Explorer.DataFrame.shape(df)
{3, 2}
Display the DataFrame in a tabular fashion.
examples
Examples
df = Explorer.Datasets.iris() Explorer.DataFrame.table(df) Explorer.DataFrame.table(df, limit: 1) Explorer.DataFrame.table(df, limit: :infinity)
Link to this section Functions: IO
Writes a dataframe to a binary representation of a delimited file.
options
Options
:header
- Should the column names be written as the first line of the file? (default:true
):delimiter
- A single character used to separate fields within a record. (default:","
)
examples
Examples
iex> df = Explorer.Datasets.fossil_fuels() |> Explorer.DataFrame.head(2)
iex> Explorer.DataFrame.dump_csv(df)
{:ok, "year,country,total,solid_fuel,liquid_fuel,gas_fuel,cement,gas_flaring,per_capita,bunker_fuels\n2010,AFGHANISTAN,2308,627,1601,74,5,0,0.08,9\n2010,ALBANIA,1254,117,953,7,177,0,0.43,7\n"}
Similar to dump_csv/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC file.
Groups are ignored if the dataframe is using any.
options
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Similar to dump_ipc/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC Stream file.
Groups are ignored if the dataframe is using any.
options
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Similar to dump_ipc_stream/2
, but raises in case of error.
Writes a dataframe to a binary representation of a NDJSON file.
Groups are ignored if the dataframe is using any.
examples
Examples
iex> df = Explorer.DataFrame.new(col_a: [1, 2], col_b: [5.1, 5.2])
iex> Explorer.DataFrame.dump_ndjson(df)
{:ok, ~s({"col_a":1,"col_b":5.1}\n{"col_a":2,"col_b":5.2}\n)}
Similar to dump_ndjson!/2
, but raises in case of error.
Writes a dataframe to a binary representation of a Parquet file.
Groups are ignored if the dataframe is using any.
options
Options
:compression
- The compression algorithm to use when writing files. Where a compression level is available, this can be passed as a tuple, such as{:zstd, 3}
. Supported options are:nil
(uncompressed, default):snappy
:gzip
(with levels 1-9):brotli
(with levels 1-11):zstd
(with levels -7-22):lz4raw
.
Similar to dump_parquet/2
, but raises in case of error.
Reads a delimited file into a dataframe.
If the CSV is compressed, it is automatically decompressed.
options
Options
:delimiter
- A single character used to separate fields within a record. (default:","
):dtypes
- A list/map of{"column_name", dtype}
tuples. Any non-specified column has its type imputed from the first 1000 rows. (default:[]
):header
- Does the file have a header of column names as the first row or not? (default:true
):max_rows
- Maximum number of lines to read. (default:nil
):null_character
- The string that should be interpreted as a nil value. (default:"NA"
):skip_rows
- The number of lines to skip at the beginning of the file. (default:0
):columns
- A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default:nil
):infer_schema_length
Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).:parse_dates
- Automatically try to parse dates/ datetimes and time. If parsing fails, columns remain of dtypestring
Similar to from_csv/2
but raises if there is a problem reading the CSV.
Reads an IPC file into a dataframe.
options
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.
Similar to from_ipc/2
but raises if there is a problem reading the IPC file.
Reads an IPC Streaming file into a dataframe.
options
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.
Similar to from_ipc_stream/2
but raises if there is a problem reading the IPC Stream file.
Read a file of JSON objects or lists separated by new lines
options
Options
:batch_size
- Sets the batch size for reading rows. This value may have significant impact in performance, so adjust it for your needs (default:1000
).:infer_schema_length
- Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).
Similar to from_ndjson/2
, but raises in case of error.
Reads a parquet file into a dataframe.
Similar to from_parquet/2
but raises if there is a problem reading the Parquet file.
Reads a representation of a CSV file into a dataframe.
If the CSV is compressed, it is automatically decompressed.
options
Options
:delimiter
- A single character used to separate fields within a record. (default:","
):dtypes
- A list/map of{"column_name", dtype}
tuples. Any non-specified column has its type imputed from the first 1000 rows. (default:[]
):header
- Does the file have a header of column names as the first row or not? (default:true
):max_rows
- Maximum number of lines to read. (default:nil
):null_character
- The string that should be interpreted as a nil value. (default:"NA"
):skip_rows
- The number of lines to skip at the beginning of the file. (default:0
):columns
- A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default:nil
):infer_schema_length
Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).:parse_dates
- Automatically try to parse dates/ datetimes and time. If parsing fails, columns remain of dtypestring
Similar to load_csv/2
but raises if there is a problem reading the CSV.
Reads a binary representing an IPC file into a dataframe.
options
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.
Similar to load_ipc/2
but raises if there is a problem reading the IPC file.
Reads a binary representing an IPC Stream file into a dataframe.
options
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.
Similar to load_ipc_stream/2
but raises if there is a problem.
Reads a representation of a NDJSON file into a dataframe.
options
Options
:batch_size
- Sets the batch size for reading rows. This value may have significant impact in performance, so adjust it for your needs (default:1000
).:infer_schema_length
- Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).
Similar to load_ndjson/2
, but raises in case of error.
examples
Examples
iex> contents = ~s({"col_a":1,"col_b":5.1}\n{"col_a":2,"col_b":5.2}\n)
iex> Explorer.DataFrame.load_ndjson!(contents)
#Explorer.DataFrame<
Polars[2 x 2]
col_a integer [1, 2]
col_b float [5.1, 5.2]
>
Reads a binary representation of a parquet file into a dataframe.
Similar to load_parquet/2
but raises if there is a problem reading the Parquet file.
Writes a dataframe to a delimited file.
Groups are ignored if the dataframe is using any.
options
Options
:header
- Should the column names be written as the first line of the file? (default:true
):delimiter
- A single character used to separate fields within a record. (default:","
)
Similar to to_csv/3
but raises if there is a problem reading the CSV.
Writes a dataframe to an IPC file.
Apache IPC is a language-agnostic columnar data structure that can be used to store dataframes. It excels as a format for quickly exchange data between different programming languages.
Groups are ignored if the dataframe is using any.
options
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Similar to to_ipc/3
, but raises in case of error.
Writes a dataframe to an IPC Stream file.
Arrow IPC Streams provide a streaming protocol or “format" for sending an arbitrary length sequence of record batches. The format must be processed from start to end, and does not support random access.
options
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Writes a dataframe to a ndjson file.
Groups are ignored if the dataframe is using any.
NDJSON are files that contains JSON files separated by new lines. They are often used as structured logs.
Writes a dataframe to a parquet file.
Groups are ignored if the dataframe is using any.
options
Options
:compression
- The compression algorithm to use when writing files. Where a compression level is available, this can be passed as a tuple, such as{:zstd, 3}
. Supported options are:nil
(uncompressed, default):snappy
:gzip
(with levels 1-9):brotli
(with levels 1-11):zstd
(with levels -7-22):lz4raw
.
Similar to to_parquet/3
, but raises in case of error.