View Source Explorer.DataFrame (Explorer v0.9.2)
The DataFrame struct and API.
Dataframes are two-dimensional tabular data structures similar to a spreadsheet. For example, the Iris dataset:
iex> Explorer.Datasets.iris()
#Explorer.DataFrame<
Polars[150 x 5]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
This dataframe has 150 rows and five columns. Each column is an Explorer.Series
of the same size (150):
iex> df = Explorer.Datasets.iris()
iex> df["sepal_length"]
#Explorer.Series<
Polars[150]
f64 [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, ...]
>
Creating dataframes
Dataframes can be created from normal Elixir terms. The main way you might do this is
with the new/1
function. For example:
iex> Explorer.DataFrame.new(a: ["a", "b"], b: [1, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a string ["a", "b"]
b s64 [1, 2]
>
Or with a list of maps:
iex> Explorer.DataFrame.new([%{"col1" => "a", "col2" => 1}, %{"col1" => "b", "col2" => 2}])
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["a", "b"]
col2 s64 [1, 2]
>
Verbs
Explorer uses the idea of a consistent set of SQL-like verbs
like
dplyr
which can help solve common
data manipulation challenges. These are split into single table verbs,
multiple table verbs, and row-based verbs:
Single table verbs
Single table verbs are (unsurprisingly) used for manipulating a single dataframe. Those operations typically driven by column names. These are:
select/2
for picking columns anddiscard/2
to discard themfilter/2
for picking rows based on predicatesmutate/2
for adding or replacing columns that are functions of existing columnssort_by/2
for changing the ordering of rowsdistinct/2
for picking unique rowssummarise/2
for reducing multiple rows down to a single summarypivot_longer/3
andpivot_wider/4
for massaging dataframes into longer or wider forms, respectively
Each of these combine with Explorer.DataFrame.group_by/2
for operating by group.
Multiple table verbs
Multiple table verbs are used for combining tables. These are:
join/3
for performing SQL-like joinsconcat_columns/1
for horizontally "stacking" dataframesconcat_rows/1
for vertically "stacking" dataframes
Row-based verbs
Those operations are driven by the row index. These are:
head/2
for picking the first rowstail/2
for picking the last rowsslice/2
for slicing the dataframe by row indexes or a rangeslice/3
for slicing a section by an offsetsample/2
for sampling the data-frame by row
IO operations
Explorer supports reading and writing of:
- delimited files (such as CSV or TSV)
- Parquet
- Arrow IPC
- Arrow Streaming IPC
- Newline Delimited JSON
- Databases via
Adbc
infrom_query/3
The convention Explorer uses is to have from_*
and to_*
functions to read and write
to files in the formats above. load_*
and dump_*
versions are also available to read
and write those formats directly in memory.
Files can be fetched from local or remote file system, such as S3, using the following formats:
# path to a file in disk
Explorer.DataFrame.from_parquet("/path/to/file.parquet")
# path to a URL schema (with optional configuration)
Explorer.DataFrame.from_parquet("s3://bucket/file.parquet", config: FSS.S3.config_from_system_env())
# it's possible to configure using keyword lists
Explorer.DataFrame.from_parquet("s3://bucket/file.parquet", config: [access_key_id: "my-key", secret_access_key: "my-secret"])
# a FSS entry (it already includes its config)
Explorer.DataFrame.from_parquet(FSS.S3.parse("s3://bucket/file.parquet"))
The :config
option of from_*
functions is only required if the filename is a path
to a remote resource. In case it's a FSS entry, the requirement is that the config is passed
inside the entry struct.
For more details about the options, see the FSS docs.
Selecting columns and access
Several functions in this module, such as select/2
, discard/2
, drop_nil/2
, and so
forth accept a single or multiple columns as arguments. The columns can be specified in
a variety of formats, which we describe below.
Explorer.DataFrame
also implements the Access
behaviour (also known as the brackets
syntax). This should be familiar for users coming from other language with dataframes
such as R or Python. For example:
iex> df = Explorer.Datasets.wine()
iex> df["class"]
#Explorer.Series<
Polars[178]
s64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>
Accessing the dataframe with a column name either as a string or an atom, will return the column. You can also pass an integer representing the column order:
iex> df = Explorer.Datasets.wine()
iex> df[0]
#Explorer.Series<
Polars[178]
s64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>
You can also pass a list, a range, or a regex to return a dataframe matching the given data type. For example, by passing a list:
iex> df = Explorer.Datasets.wine()
iex> df[["class", "hue"]]
#Explorer.DataFrame<
Polars[178 x 2]
class s64 [1, 1, 1, 1, 1, ...]
hue f64 [1.04, 1.05, 1.03, 0.86, 1.04, ...]
>
Or a range for the given positions:
iex> df = Explorer.Datasets.wine()
iex> df[0..2]
#Explorer.DataFrame<
Polars[178 x 3]
class s64 [1, 1, 1, 1, 1, ...]
alcohol f64 [14.23, 13.2, 13.16, 14.37, 13.24, ...]
malic_acid f64 [1.71, 1.78, 2.36, 1.95, 2.59, ...]
>
Or a regex to keep only columns matching a given pattern:
iex> df = Explorer.Datasets.wine()
iex> df[~r/(class|hue)/]
#Explorer.DataFrame<
Polars[178 x 2]
class s64 [1, 1, 1, 1, 1, ...]
hue f64 [1.04, 1.05, 1.03, 0.86, 1.04, ...]
>
Given you can also access a series using its index, you can use multiple accesses to select a column and row at the same time:
iex> df = Explorer.Datasets.wine()
iex> df["class"][3]
1
Summary
Functions: Conversion
Computes and collects a data frame to the current node.
Computes the lazy data frame into an eager one, executing the query.
Converts the dataframe to the lazy version of the current backend.
Returns true if the data frame is a lazy one.
Creates a new dataframe.
Converts a dataframe to a list of columns with lists as values.
Converts a dataframe to a list of maps (rows).
Converts a dataframe to a stream of maps (rows).
Converts a dataframe to a list of columns with series as values.
Functions: Single-table
Calculates the pairwise correlation of numeric columns.
Calculates the pairwise covariance of numeric columns.
Describe numeric columns of a DataFrame.
Discards a subset of columns by name.
Takes distinct rows by a selection of columns.
Drop nil values.
Turns a set of columns to dummy variables.
Explodes one or multiple list column into multiple rows.
Picks rows based on Explorer.Query
.
Picks rows based on a callback function.
Creates a new dataframe with unique rows and the frequencies of each.
Group the dataframe by one or more variables.
Picks rows based on a list or series of values.
Creates or modifies columns based on Explorer.Query
.
Creates or modifies columns using a callback function and an
optional keep
argument.
Counts the number of nil
elements in each column.
Pivot data from wide to long.
Pivot data from long to wide.
Extracts a single column as a series.
Creates or modifies a single column.
Relocates columns.
Renames columns.
Renames columns with a function.
Selects a subset of columns by name.
Sorts rows by columns using Explorer.Query
.
Sorts rows by columns using a callback function.
Create a dataframe from the result of a SQL query on an existing dataframe.
Summarise each group to a single row using Explorer.Query
.
Summarise each group to a single row using a callback function.
Make new columns by applying a native Elixir function to each row.
Transpose a DataFrame.
Removes grouping variables.
Unnests one or multiple struct columns into individual columns.
Functions: Multi-table
Combine two or more dataframes column-wise.
Combine two dataframes column-wise.
Combine two or more dataframes row-wise (stack).
Combine two dataframes row-wise.
Join two tables.
Functions: Row-based
Returns the first n rows of the dataframe.
Sample rows from a dataframe.
Change the order of the rows of a dataframe randomly.
Slices rows at the given indices as a new dataframe.
Subset a continuous set of rows.
Returns the last n rows of the dataframe.
Functions: Introspection
Gets the dtypes of the dataframe columns.
Returns the groups of a dataframe.
Returns the number of columns in the dataframe.
Returns the number of rows in the dataframe.
Gets the names of the dataframe columns.
Prints the DataFrame in a tabular fashion.
Gets the shape of the dataframe as a {height, width}
tuple.
Functions: IO
Writes a dataframe to a binary representation of a delimited file.
Similar to dump_csv/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC file.
Similar to dump_ipc/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC Stream file.
Similar to dump_ipc_stream/2
, but raises in case of error.
Writes a dataframe to a binary representation of a NDJSON file.
Similar to dump_ndjson!/2
, but raises in case of error.
Writes a dataframe to a binary representation of a Parquet file.
Similar to dump_parquet/2
, but raises in case of error.
Reads a delimited file into a dataframe.
Similar to from_csv/2
but raises if there is a problem reading the CSV.
Reads an IPC file into a dataframe.
Similar to from_ipc/2
but raises if there is a problem reading the IPC file.
Reads an IPC Streaming file into a dataframe.
Similar to from_ipc_stream/2
but raises if there is a problem reading the IPC Stream file.
Read a file of JSON objects or lists separated by new lines
Similar to from_ndjson/2
, but raises in case of error.
Reads a parquet file into a dataframe.
Similar to from_parquet/2
but raises if there is a problem reading the Parquet file.
Reads data from a query.
Similar to from_query/4
but raises if there is an error.
Reads a representation of a CSV file into a dataframe.
Similar to load_csv/2
but raises if there is a problem reading the CSV.
Reads a binary representing an IPC file into a dataframe.
Similar to load_ipc/2
but raises if there is a problem reading the IPC file.
Reads a binary representing an IPC Stream file into a dataframe.
Similar to load_ipc_stream/2
but raises if there is a problem.
Reads a representation of a NDJSON file into a dataframe.
Similar to load_ndjson/2
, but raises in case of error.
Reads a binary representation of a parquet file into a dataframe.
Similar to load_parquet/2
but raises if there is a problem reading the Parquet file.
Writes a dataframe to a delimited file.
Similar to to_csv/3
but raises if there is a problem reading the CSV.
Writes a dataframe to an IPC file.
Similar to to_ipc/3
, but raises in case of error.
Writes a dataframe to an IPC Stream file.
Similar to to_ipc_stream/3
, but raises in case of error.
Writes a dataframe to a ndjson file.
Similar to to_ndjson/3
, but raises in case of error.
Writes a dataframe to a parquet file.
Similar to to_parquet/3
, but raises in case of error.
Types
Represents a column name or its index.
Represents a column name as atom or string.
Represents multiple column names as atoms or strings.
Represents a column pair where the value is a column name or
a column index, and the value is of type value
.
Represents multiple columns.
Represents a filesystem entry, that can be local, S3 or URL.
Represents a dataframe.
Functions: Conversion
Computes and collects a data frame to the current node.
If the data frame is already in the current node, it is computed but no transfer happens.
Examples
series = Explorer.Series.from_list([1, 2, 3], node: :some@node)
Explorer.Series.collect(series)
Computes the lazy data frame into an eager one, executing the query.
It is the opposite of lazy/1
. If it is not a lazy data frame,
this is a noop.
Collecting a grouped dataframe should return a grouped dataframe.
Converts the dataframe to the lazy version of the current backend.
Operations on a lazy dataframe are not executed immediately, instead
they are batched together, and performed when collect/1
is invoked.
This allows for query optimizations, parallelism, leading to better
performance.
If already lazy, this is a noop. Converting a grouped dataframe should return a lazy dataframe with groups.
Returns true if the data frame is a lazy one.
@spec new( Table.Reader.t() | series_pairs, opts :: Keyword.t() ) :: t() when series_pairs: %{required(column_name()) => Explorer.Series.t()} | [{column_name(), Explorer.Series.t()}]
Creates a new dataframe.
It accepts any of:
- a map or keyword list of string/atom keys and series as values
- a map or keyword list of string/atom keys and tensors as values
- any data structure adhering to the
Table.Reader
protocol
Options
:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:dtypes
- A list/map of{column_name, dtype}
pairs. (default:[]
):lazy
- force the results into the lazy version of the current backend.
Examples
From series
Series can be given either as keyword lists or maps where the keys are the name and the values are series:
iex> Explorer.DataFrame.new(%{
...> floats: Explorer.Series.from_list([1.0, 2.0]),
...> ints: Explorer.Series.from_list([1, nil])
...> })
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
>
From tensors
To create dataframe from tensors, you can pass a matrix as argument. Each matrix column becomes a dataframe column with names x1, x2, x3, etc:
iex> Explorer.DataFrame.new(Nx.tensor([
...> [1, 2, 3],
...> [4, 5, 6]
...> ]))
#Explorer.DataFrame<
Polars[2 x 3]
x1 s64 [1, 4]
x2 s64 [2, 5]
x3 s64 [3, 6]
>
Explorer expects tensors to have certain types, so you may need to cast
the data accordingly. See Explorer.Series.from_tensor/2
for more info.
You can also pass a keyword list or maps of vectors (rank 1 tensors):
iex> Explorer.DataFrame.new(%{
...> floats: Nx.tensor([1.0, 2.0], type: :f64),
...> ints: Nx.tensor([3, 4])
...> })
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [3, 4]
>
Use dtypes to force a particular representation:
iex> Explorer.DataFrame.new([
...> floats: Nx.tensor([1.0, 2.0], type: :f64),
...> times: Nx.tensor([3_000, 4_000])
...> ], dtypes: [times: :time])
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
times time [00:00:00.000003, 00:00:00.000004]
>
From tabular
Tabular data can be either columnar or row-based. Let's start with column data:
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, nil]})
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
>
iex> Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
>
iex> Explorer.DataFrame.new([floats: [1.0, 2.0], ints: [1, nil], binaries: [<<239, 191, 19>>, nil]], dtypes: [{:binaries, :binary}])
#Explorer.DataFrame<
Polars[2 x 3]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
binaries binary [<<239, 191, 19>>, nil]
>
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, "wrong"]})
** (ArgumentError) cannot create series "ints": the value "wrong" does not match the inferred dtype {:s, 64}
From row data:
iex> rows = [%{id: 1, name: "José"}, %{id: 2, name: "Christopher"}, %{id: 3, name: "Cristine"}]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id s64 [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
iex> rows = [[id: 1, name: "José"], [id: 2, name: "Christopher"], [id: 3, name: "Cristine"]]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id s64 [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
Converts a dataframe to a list of columns with lists as values.
See to_series/2
if you want a list of columns with series as values.
Note that this function does not take into account groups.
Warning
This is an expensive operation since it converts series to lists and doing so will copy the whole dataframe. Prefer to use the operations in this and the
Explorer.Series
module rather than the ones inEnum
whenever possible, as Explorer is optimized for large series.
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> Explorer.DataFrame.to_columns(df)
%{"floats" => [1.0, 2.0], "ints" => [1, nil]}
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_columns(df, atom_keys: true)
%{floats: [1.0, 2.0], ints: [1, nil]}
Converts a dataframe to a list of maps (rows).
Warning
This is an expensive operation since data is stored in a columnar format. You must avoid converting a dataframe to rows, as that will transform and copy the whole dataframe in memory. Prefer to use the operations in this module rather than the ones in
Enum
whenever possible, as this module is optimized for large series.
Options
:atom_keys
- Configure if the resultant maps should have atom keys. (default:false
)
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df, atom_keys: true)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]
@spec to_rows_stream(df :: t(), Keyword.t()) :: Enumerable.t()
Converts a dataframe to a stream of maps (rows).
Warning
This is an expensive operation since data is stored in a columnar format. Prefer to use the operations in this module rather than the ones in
Enum
whenever possible, as this module is optimized for large series.
Options
:atom_keys
- Configure if the resultant maps should have atom keys. (default:false
):chunk_size
- Number of rows passed toto_rows/2
while streaming over the data. (default:1000
)
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows_stream(df) |> Enum.map(& &1)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows_stream(df, atom_keys: true) |> Enum.map(& &1)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]
Converts a dataframe to a list of columns with series as values.
See to_columns/2
if you want a list of columns with lists as values.
Note that this function does not take into account groups.
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> map = Explorer.DataFrame.to_series(df)
iex> Explorer.Series.to_list(map["floats"])
[1.0, 2.0]
iex> Explorer.Series.to_list(map["ints"])
[1, nil]
Functions: Single-table
Calculates the pairwise correlation of numeric columns.
The returned dataframe is the correlation matrix.
Supported dtypes
Only columns with the following dtypes are taken into account.
- floats:
{:f, 32}
and{:f, 64}
- integers:
{:s, 8}
,{:s, 16}
,{:s, 32}
,{:s, 64}
,{:u, 8}
,{:u, 16}
,{:u, 32}
and{:u, 64}
The resultant columns are always {:f, 64}
.
Options
:columns
- the selection of columns to calculate. Defaults to all numeric columns.:column_name
- the name of the column with column names. Defaults to "names".:ddof
- the 'delta degrees of freedom' - the divisor used in the correlation calculation. Defaults to 1.:method
refers to the correlation method. The following methods are available::pearson
: Standard correlation coefficient. (default):spearman
: Spearman rank correlation.
Examples
iex> df = Explorer.DataFrame.new(dogs: [1, 8, 3], cats: [4, 5, 2])
iex> Explorer.DataFrame.correlation(df)
#Explorer.DataFrame<
Polars[2 x 3]
names string ["dogs", "cats"]
dogs f64 [1.0000000000000002, 0.5447047794019219]
cats f64 [0.5447047794019219, 1.0]
>
Calculates the pairwise covariance of numeric columns.
The returned dataframe is the covariance matrix.
Supported dtypes
Only columns with the following dtypes are taken into account.
- floats:
{:f, 32}
and{:f, 64}
- integers:
{:s, 8}
,{:s, 16}
,{:s, 32}
,{:s, 64}
,{:u, 8}
,{:u, 16}
,{:u, 32}
and{:u, 64}
The resultant columns are always {:f, 64}
.
Options
:columns
- the selection of columns to calculate. Defaults to all numeric columns.:column_name
- the name of the column with column names. Defaults to "names".:ddof
- the 'delta degrees of freedom' - the divisor used in the covariance calculation. Defaults to 1.
Examples
iex> df = Explorer.DataFrame.new(dogs: [1, 0, 2, 1], cats: [2, 3, 0, 1])
iex> Explorer.DataFrame.covariance(df)
#Explorer.DataFrame<
Polars[2 x 3]
names string ["dogs", "cats"]
dogs f64 [0.6666666666666666, -1.0]
cats f64 [-1.0, 1.6666666666666667]
>
Describe numeric columns of a DataFrame.
Groups are ignored if the dataframe is using any.
Options
:percentiles
- Floating point list with the percentiles to be calculated. (default:[0.25, 0.5, 0.75]
)
Examples
iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
iex> Explorer.DataFrame.describe(df)
#Explorer.DataFrame<
Polars[9 x 4]
describe string ["count", "nil_count", "mean", "std", "min", ...]
a string ["2", "1", nil, nil, nil, ...]
b f64 [3.0, 0.0, 2.0, 1.0, 1.0, ...]
c string ["3", "0", nil, nil, nil, ...]
>
iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
iex> Explorer.DataFrame.describe(df, percentiles: [0.3, 0.5, 0.8])
#Explorer.DataFrame<
Polars[9 x 4]
describe string ["count", "nil_count", "mean", "std", "min", ...]
a string ["2", "1", nil, nil, nil, ...]
b f64 [3.0, 0.0, 2.0, 1.0, 1.0, ...]
c string ["3", "0", nil, nil, nil, ...]
>
Discards a subset of columns by name.
It's important to notice that groups are kept: you can't discard grouping columns.
Examples
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.discard(df, ["b"])
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.discard(df, ["a", "b"])
#Explorer.DataFrame<
Polars[3 x 1]
c s64 [4, 5, 6]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
Grouped examples
You cannot discard grouped columns. You need to ungroup before removing them:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.discard(grouped, ["species"])
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Takes distinct rows by a selection of columns.
Distinct is not affected by groups, although groups are kept in the
columns selection if keep_all
option is false (the default).
Options
:keep_all
- If set totrue
, keep all columns. Default isfalse
.
Examples
By default will return unique values of the requested columns:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
If keep_all
is set to true
, then the first value of each column not in the requested
columns will be returned:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"], keep_all: true)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
A callback on the dataframe's names can be passed instead of a list (like select/2
):
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> Explorer.DataFrame.distinct(df, &String.starts_with?(&1, "x"))
#Explorer.DataFrame<
Polars[2 x 2]
x1 s64 [1, 3]
x2 string ["a", "c"]
>
If the dataframe has groups, then the columns of each group will be added to the distinct columns:
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> df = Explorer.DataFrame.group_by(df, "x1")
iex> Explorer.DataFrame.distinct(df, ["x2"])
#Explorer.DataFrame<
Polars[2 x 2]
Groups: ["x1"]
x1 s64 [1, 3]
x2 string ["a", "c"]
>
Drop nil values.
Optionally accepts a subset of columns.
Examples
To drop nils on all columns:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df)
#Explorer.DataFrame<
Polars[1 x 2]
a s64 [1]
b s64 [1]
>
To drop nils on a single column:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df, :a)
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [1, 2]
b s64 [1, nil]
>
To drop some columns:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, [:a, :c])
#Explorer.DataFrame<
Polars[1 x 3]
a s64 [2]
b s64 [nil]
c s64 [5]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
Turns a set of columns to dummy variables.
In case the dataframe is using groups, all groups will be removed.
Examples
To mark a single column as dummy:
iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, "col_x")
#Explorer.DataFrame<
Polars[4 x 3]
col_x_a u8 [1, 0, 1, 0]
col_x_b u8 [0, 1, 0, 0]
col_x_c u8 [0, 0, 0, 1]
>
Or multiple columns:
iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, ["col_x", "col_y"])
#Explorer.DataFrame<
Polars[4 x 6]
col_x_a u8 [1, 0, 1, 0]
col_x_b u8 [0, 1, 0, 0]
col_x_c u8 [0, 0, 0, 1]
col_y_b u8 [1, 0, 1, 0]
col_y_a u8 [0, 1, 0, 0]
col_y_d u8 [0, 0, 0, 1]
>
Or all string columns:
iex> df = Explorer.DataFrame.new(num: [1, 2, 3, 4], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, fn _name, type -> type == :string end)
#Explorer.DataFrame<
Polars[4 x 3]
col_y_b u8 [1, 0, 1, 0]
col_y_a u8 [0, 1, 0, 0]
col_y_d u8 [0, 0, 0, 1]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
@spec explode(df :: t(), column :: column_name() | [column_name()]) :: t()
Explodes one or multiple list column into multiple rows.
When exploding multiple columns, the number of list elements in each row for the exploded columns must be the same.
Examples
iex> df = Explorer.DataFrame.new(a: [[1, 2], [3, 4]], b: [[5, 6], [7, 8]], c: ["a", "b"])
iex> Explorer.DataFrame.explode(df, :a)
#Explorer.DataFrame<
Polars[4 x 3]
a s64 [1, 2, 3, 4]
b list[s64] [[5, 6], [5, 6], [7, 8], [7, ...]]
c string ["a", "a", "b", "b"]
>
iex> Explorer.DataFrame.explode(df, [:a, :b])
#Explorer.DataFrame<
Polars[4 x 3]
a s64 [1, 2, 3, 4]
b s64 [5, 6, 7, 8]
c string ["a", "a", "b", "b"]
>
You can think of exploding multiple list columns as being the inverse of aggregating the elements the exploded columns into lists:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3, 4], b: [5, 6, 7, 8], c: ["a", "a", "b", "b"])
iex> df = df |> Explorer.DataFrame.group_by(:c) |> Explorer.DataFrame.summarise(a: a, b: b)
#Explorer.DataFrame<
Polars[2 x 3]
c string ["a", "b"]
a list[s64] [[1, 2], [3, 4]]
b list[s64] [[5, 6], [7, 8]]
>
iex> Explorer.DataFrame.explode(df, [:a, :b]) # we are back where we started
#Explorer.DataFrame<
Polars[4 x 3]
c string ["a", "a", "b", "b"]
a s64 [1, 2, 3, 4]
b s64 [5, 6, 7, 8]
>
If you want to perform the cartesian product of two list columns, you must
call explode/2
once for each column:
iex> df = Explorer.DataFrame.new(a: [[1, 2], [3, 4]], b: [[5, 6], [7, 8]], c: ["a", "b"])
iex> df |> Explorer.DataFrame.explode(:a) |> Explorer.DataFrame.explode(:b)
#Explorer.DataFrame<
Polars[8 x 3]
a s64 [1, 1, 2, 2, 3, ...]
b s64 [5, 6, 5, 6, 7, ...]
c string ["a", "a", "a", "a", "b", ...]
>
Picks rows based on Explorer.Query
.
The query is compiled and runs efficiently against the dataframe.
The query must return a boolean expression or a list of boolean expressions.
When a list is returned, they are joined as and
expressions.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
Besides element-wise series operations, you can also use window functions and aggregations inside comparisons. In such cases, grouped dataframes may have different results than ungrouped ones, because the filtering is computed withing groups. See examples below.
See filter_with/2
for a callback version of this function without
Explorer.Query
.
Examples
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col2 > 2)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["c"]
col2 s64 [3]
>
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col1 == "b")
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 s64 [2]
>
iex> df = Explorer.DataFrame.new(col1: [5, 4, 3], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, [col1 > 3, col2 < 3])
#Explorer.DataFrame<
Polars[2 x 2]
col1 s64 [5, 4]
col2 s64 [1, 2]
>
Returning a non-boolean expression errors:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2))
** (ArgumentError) expecting the function to return a boolean LazySeries, but instead it returned a LazySeries of type {:s, 64}
Which can be addressed by converting it to boolean:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2) == 1)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["a"]
col2 s64 [1]
>
Grouped examples
In a grouped dataframe, the aggregation is calculated within each group.
In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter(grouped, petal_length > mean(petal_length))
#Explorer.DataFrame<
Polars[79 x 5]
Groups: ["species"]
sepal_length f64 [4.6, 5.4, 5.0, 4.9, 5.4, ...]
sepal_width f64 [3.1, 3.9, 3.4, 3.1, 3.7, ...]
petal_length f64 [1.5, 1.7, 1.5, 1.5, 1.5, ...]
petal_width f64 [0.2, 0.4, 0.2, 0.1, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec filter_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> Explorer.Series.lazy_t()) ) :: t()
Picks rows based on a callback function.
The callback receives a lazy dataframe. A lazy dataframe does not hold any values, instead it stores all operations in order to execute all filtering performantly.
This is a callback version of filter/2
.
Examples
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, &Explorer.Series.greater(&1["col2"], 2))
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["c"]
col2 s64 [3]
>
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.equal(df["col1"], "b") end)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 s64 [2]
>
Grouped examples
In a grouped dataframe, the aggregation is calculated within each group.
In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter_with(grouped, &Explorer.Series.greater(&1["petal_length"], Explorer.Series.mean(&1["petal_length"])))
#Explorer.DataFrame<
Polars[79 x 5]
Groups: ["species"]
sepal_length f64 [4.6, 5.4, 5.0, 4.9, 5.4, ...]
sepal_width f64 [3.1, 3.9, 3.4, 3.1, 3.7, ...]
petal_length f64 [1.5, 1.7, 1.5, 1.5, 1.5, ...]
petal_width f64 [0.2, 0.4, 0.2, 0.1, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec frequencies(df :: t(), columns :: column_names()) :: t()
Creates a new dataframe with unique rows and the frequencies of each.
Examples
iex> df = Explorer.DataFrame.new(a: ["a", "a", "b"], b: [1, 1, nil])
iex> Explorer.DataFrame.frequencies(df, [:a, :b])
#Explorer.DataFrame<
Polars[2 x 3]
a string ["a", "b"]
b s64 [1, nil]
counts u32 [2, 1]
>
@spec group_by(df :: t(), groups_or_group :: column_names() | column_name()) :: t()
Group the dataframe by one or more variables.
When the dataframe has grouping variables, operations are performed per group.
Explorer.DataFrame.ungroup/2
removes grouping.
Examples
You can group by a single variable:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, "country")
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Or you can group by multiple columns in a given list:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, ["country", "year"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country", "year"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Or by a range:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, 0..1)
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["year", "country"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Regexes and functions are also accepted in column names, as in select/2
.
@spec mask(df :: t(), mask :: Explorer.Series.t() | [boolean()]) :: t()
Picks rows based on a list or series of values.
Examples
This function must only be used when you need to select rows based on external values that are not available to the dataframe. For example, you can pass a list:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, [false, true, false])
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 s64 [2]
>
You must avoid using masks when the masks themselves are computed from other columns. For example, DO NOT do this:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, Explorer.Series.greater(df["col2"], 1))
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["b", "c"]
col2 s64 [2, 3]
>
Instead, do this:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.greater(df["col2"], 1) end)
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["b", "c"]
col2 s64 [2, 3]
>
The filter_with/2
version is much more efficient because it doesn't need
to create intermediate series representations to apply the mask.
Creates or modifies columns based on Explorer.Query
.
The query is compiled and runs efficiently against the dataframe. New variables overwrite existing variables of the same name. Column names are coerced from atoms to strings.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
Besides element-wise series operations, you can also use window functions and aggregations inside mutations. In such cases, grouped dataframes may have different results than ungrouped ones, because the mutation is computed withing groups. See examples below.
See mutate_with/3
for a callback version of this function without
Explorer.Query
. If your mutation cannot be expressed with queries,
you may compute the values using the Explorer.Series
API directly
and then add it to the dataframe using put/3
.
Options
keep
- Controls which columns should be kept in the resulting dataframe. Supported options for keep are::all
- Retains all columns from the input dataframe (default).:none
- Retains only the columns created by the mutation (andgrouping columns if operating on a grouped DataFrame).
Notice
You will need to wrap your mutations in
[]
in order to use the:keep
option.
Examples
Mutations are useful to add or modify columns in your dataframe:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, c: b + 1)
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
c s64 [2, 3, 4]
>
It's also possible to overwrite existing columns:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: b * 2)
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [2, 4, 6]
b s64 [1, 2, 3]
>
Scalar values are repeated to fill the series:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: 4)
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [4, 4, 4]
b s64 [1, 2, 3]
>
It's also possible to use functions from the Series module, like Explorer.Series.window_sum/3
:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, b: window_sum(a, 2))
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b s64 [1, 3, 5]
>
Alternatively, all of the above works with a map instead of a keyword list:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, %{"c" => cast(b, {:f, 64})})
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
c f64 [1.0, 2.0, 3.0]
>
Using the :keep
option to retain only new columns:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, [c: b + 1], keep: :none)
#Explorer.DataFrame<
Polars[3 x 1]
c s64 [2, 3, 4]
>
Grouped examples
Mutations in grouped dataframes takes the context of the group.
This enables some aggregations to be made considering each group. It's almost like summarise/2
,
but repeating the results for each member in the group.
For example, if we want to count how many elements of a given group, we can add a new
column with that aggregation:
iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate(grouped, count: count(b))
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["id"]
id string ["a", "a", "b"]
b s64 [1, 2, 3]
count u32 [2, 2, 1]
>
In case we want to get the average size of the petal length from the Iris dataset, we can:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.mutate(grouped, petal_length_avg: mean(petal_length))
#Explorer.DataFrame<
Polars[150 x 6]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
petal_length_avg f64 [1.464, 1.464, 1.464, 1.464, 1.464, ...]
>
We can also use the keep
option on grouped dataframes:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.mutate(grouped, [petal_length_avg: mean(petal_length)], keep: :none)
#Explorer.DataFrame<
Polars[150 x 2]
Groups: ["species"]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
petal_length_avg f64 [1.464, 1.464, 1.464, 1.464, 1.464, ...]
>
@spec mutate_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> column_pairs(Explorer.Series.lazy_t())), opts :: keyword() ) :: t()
Creates or modifies columns using a callback function and an
optional keep
argument.
The callback receives a lazy dataframe. A lazy dataframe doesn't hold any values, instead it stores all operations in order to execute all mutations performantly.
This is a callback version of mutate/3
. If your mutation
cannot be expressed with lazy dataframes, you may compute the
values using the Explorer.Series
API directly and then add
it to the dataframe using put/3
.
Options
keep
- Controls which columns should be kept in the resulting dataframe. Supported options for keep are::all
- Retains all columns from the input dataframe (default).:none
- Retains only the columns created by the mutation (andgrouping columns if operating on a grouped DataFrame).
Examples
Here is an example of a new column that sums the value of two other columns:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])])
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [4, 5, 6]
b s64 [1, 2, 3]
c s64 [5, 7, 9]
>
You can overwrite existing columns as well:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[b: Explorer.Series.pow(&1["b"], 2)])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [1, 4, 9]
>
It's possible to "reuse" a variable for different computations:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, fn ldf ->
iex> c = Explorer.Series.add(ldf["a"], ldf["b"])
iex> [c: c, d: Explorer.Series.window_sum(c, 2)]
iex> end)
#Explorer.DataFrame<
Polars[3 x 4]
a s64 [4, 5, 6]
b s64 [1, 2, 3]
c s64 [5, 7, 9]
d s64 [5, 12, 16]
>
Grouped examples
Mutations in grouped dataframes takes the context of the group. For example, if we want to count how many elements of a given group, we can add a new column with that aggregation:
iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate_with(grouped, &[count: Explorer.Series.count(&1["b"])])
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["id"]
id string ["a", "a", "b"]
b s64 [1, 2, 3]
count u32 [2, 2, 1]
>
Examples with the keep
option
Here is an example of a new column that sums the value of two other columns:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])], keep: :all)
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [4, 5, 6]
b s64 [1, 2, 3]
c s64 [5, 7, 9]
>
Note that keep: :all
is the default behavior.
Using the :keep
option to retain only new columns:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])], keep: :none)
#Explorer.DataFrame<
Polars[3 x 1]
c s64 [5, 7, 9]
>
When using keep: :none
on a grouped DataFrame, the grouping
columns are retained along with the newly created columns:
iex> df = Explorer.DataFrame.new(name: ["a", "a", "b", "b"], a: [1, 2, 3, 4], b: [5, 6, 7, 8])
iex> grouped = Explorer.DataFrame.group_by(df, :name)
iex> Explorer.DataFrame.mutate_with(grouped, &[ab_sum: Explorer.Series.sum(Explorer.Series.add(&1["a"], &1["b"]))], keep: :none)
#Explorer.DataFrame<
Polars[4 x 2]
Groups: ["name"]
name string ["a", "a", "b", "b"]
ab_sum s64 [14, 14, 22, 22]
>
Note that the name
column is retained because it's the
grouping column, even though keep: :none
is specified.
The a
and b
columns are dropped, and only the new ab_sum
column is added.
Counts the number of nil
elements in each column.
Examples
iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [nil, 2, nil], c: ["a", "b", "c"])
iex> Explorer.DataFrame.nil_count(df)
#Explorer.DataFrame<
Polars[1 x 3]
a u32 [1]
b u32 [2]
c u32 [0]
>
Pivot data from wide to long.
pivot_longer/3
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is pivot_wider/4
.
The second argument, columns_to_pivot
, can be either list of column names to pivot.
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
The selected columns must always have the same data type.
In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.
Options
:select
- Columns that are not in the list of pivot and should be kept in the dataframe. Ranges, regexes, and functions are also accepted in column names, as inselect/2
. Defaults to all columns except the ones to pivot.:discard
- Columns that are not in the list of pivot and should be dropped from the dataframe. Ranges, regexes, and functions are also accepted in column names, as inselect/2
. This list of columns is going to be subtracted from the list ofselect
. Defaults to an empty list.:names_to
- A string specifying the name of the column to create from the data stored in the column names of the dataframe. Defaults to"variable"
.:values_to
- A string specifying the name of the column to create from the data stored in series element values. Defaults to"value"
.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[3282 x 9]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value s64 [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"), select: ["year", "country"])
#Explorer.DataFrame<
Polars[3282 x 4]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value s64 [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: ["year", "country"], discard: ["country"])
#Explorer.DataFrame<
Polars[1094 x 3]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
variable string ["total", "total", "total", "total", "total", ...]
value s64 [2308, 1254, 32500, 141, 7924, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: [], names_to: "my_var", values_to: "my_value")
#Explorer.DataFrame<
Polars[1094 x 2]
my_var string ["total", "total", "total", "total", "total", ...]
my_value s64 [2308, 1254, 32500, 141, 7924, ...]
>
Grouped examples
In the following example we want to take the Iris dataset and increase the number of rows by pivoting the "sepal_length" column. This dataset is grouped by "species", so the resultant dataframe is going to keep the "species" group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["sepal_length"])
#Explorer.DataFrame<
Polars[150 x 6]
Groups: ["species"]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
variable string ["sepal_length", "sepal_length", "sepal_length", "sepal_length", "sepal_length", ...]
value f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
>
Now we want to do something different: we want to pivot the "species" column that is also a group. This is going to remove the group in the resultant dataframe:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["species"])
#Explorer.DataFrame<
Polars[150 x 6]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
variable string ["species", "species", "species", "species", "species", ...]
value string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec pivot_wider( df :: t(), names_from :: column(), values_from :: column() | columns(), opts :: Keyword.t() ) :: t()
Pivot data from long to wide.
pivot_wider/4
"widens" data, increasing the number of columns and decreasing the number of rows.
The inverse transformation is pivot_longer/3
.
In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.
Options
:id_columns
- A set of columns that uniquely identifies each observation.Defaults to all columns in data except for the columns specified in
names_from
andvalues_from
, and columns that are of the{:f, 64}
dtype.Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables. May accept a filter callback, a list or a range of column names. If an empty list is passed, or a range that results in a empty list of column names, it raises an error.
ID columns cannot be of the float type and any columns of this dtype is discarded. If you need to use float columns as IDs, you must carefully consider rounding or truncating the column and converting it to integer, as long as doing so preserves the properties of the column.
:names_prefix
- String added to the start of every variable name. This is particularly useful ifnames_from
is a numeric vector and you want to create syntactic variable names.
Examples
Suppose we have a basketball court and multiple teams that want to train in that court. They need to share a schedule with the hours each team is going to use it. Here is a dataframe representing that schedule:
iex> Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
This dataframe is going to look like this - using print/2
:
+----------------------------------------------+
| Explorer DataFrame: [rows: 10, columns: 3] |
+---------------+--------------+---------------+
| weekday | team | hour |
| <string> | <string> | <integer> |
+===============+==============+===============+
| Monday | A | 10 |
+---------------+--------------+---------------+
| Tuesday | B | 9 |
+---------------+--------------+---------------+
| Wednesday | C | 10 |
+---------------+--------------+---------------+
| Thursday | A | 10 |
+---------------+--------------+---------------+
| Friday | B | 11 |
+---------------+--------------+---------------+
| Monday | C | 15 |
+---------------+--------------+---------------+
| Tuesday | A | 14 |
+---------------+--------------+---------------+
| Wednesday | B | 16 |
+---------------+--------------+---------------+
| Thursday | C | 14 |
+---------------+--------------+---------------+
| Friday | A | 16 |
+---------------+--------------+---------------+
You can see that the "weekday" repeats, and it's not clear how free the agenda is. We can solve that by pivoting the "weekday" column in multiple columns, making each weekday a new column in the resultant dataframe.
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
team string ["A", "B", "C"]
Monday s64 [10, nil, 15]
Tuesday s64 [14, 9, nil]
Wednesday s64 [nil, 16, 10]
Thursday s64 [10, nil, 14]
Friday s64 [16, 11, nil]
>
Now if we print that same dataframe with print/2
, we get a better picture of the schedule:
+----------------------------------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 6] |
+----------+-----------+-----------+-----------+-----------+-----------+
| team | Monday | Tuesday | Wednesday | Thursday | Friday |
| <string> | <integer> | <integer> | <integer> | <integer> | <integer> |
+==========+===========+===========+===========+===========+===========+
| A | 10 | 14 | | 10 | 16 |
+----------+-----------+-----------+-----------+-----------+-----------+
| B | | 9 | 16 | | 11 |
+----------+-----------+-----------+-----------+-----------+-----------+
| C | 15 | | 10 | 14 | |
+----------+-----------+-----------+-----------+-----------+-----------+
Pivot wider can create unpredictable column names, and sometimes they can conflict with ID columns. In that scenario, we add a number as suffix to duplicated column names. Here is an example:
iex> df = Explorer.DataFrame.new(
iex> product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex> property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex> property_value: [1, 42, 40, 64, 2, 35, 20, 40]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", "property_value")
#Explorer.DataFrame<
Polars[2 x 5]
product_id s64 [1, 2]
product_id_1 s64 [1, 2]
width_cm s64 [42, 35]
height_cm s64 [40, 20]
length_cm s64 [64, 40]
>
But if the option :names_prefix
is used, that suffix is not added:
iex> df = Explorer.DataFrame.new(
iex> product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex> property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex> property_value: [1, 42, 40, 64, 2, 35, 20, 40]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", "property_value", names_prefix: "col_")
#Explorer.DataFrame<
Polars[2 x 5]
product_id s64 [1, 2]
col_product_id s64 [1, 2]
col_width_cm s64 [42, 35]
col_height_cm s64 [40, 20]
col_length_cm s64 [64, 40]
>
Multiple columns are accepted for the values_from
parameter, but the behaviour is slightly
different for the naming of new columns in the resultant dataframe. The new columns are going
to be prefixed by the name of the original value column, followed by the name of the variable.
iex> df = Explorer.DataFrame.new(
iex> product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex> property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex> property_value: [1, 42, 40, 64, 2, 35, 20, 40],
iex> another_value: [1, 43, 41, 65, 2, 36, 21, 42]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", ["property_value", "another_value"])
#Explorer.DataFrame<
Polars[2 x 9]
product_id s64 [1, 2]
property_value_product_id s64 [1, 2]
property_value_width_cm s64 [42, 35]
property_value_height_cm s64 [40, 20]
property_value_length_cm s64 [64, 40]
another_value_product_id s64 [1, 2]
another_value_width_cm s64 [43, 36]
another_value_height_cm s64 [41, 21]
another_value_length_cm s64 [65, 42]
>
Grouped examples
Now using the same idea, we can see that there is not much difference for grouped dataframes. The only detail is that groups that are not ID columns are discarded.
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "team")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
Groups: ["team"]
team string ["A", "B", "C"]
Monday s64 [10, nil, 15]
Tuesday s64 [14, 9, nil]
Wednesday s64 [nil, 16, 10]
Thursday s64 [10, nil, 14]
Friday s64 [16, 11, nil]
>
In the following example the group "weekday" is going to be removed, because the column is going to be pivoted in multiple columns:
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "weekday")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
team string ["A", "B", "C"]
Monday s64 [10, nil, 15]
Tuesday s64 [14, 9, nil]
Wednesday s64 [nil, 16, 10]
Thursday s64 [10, nil, 14]
Friday s64 [16, 11, nil]
>
@spec pull(df :: t(), column :: column()) :: Explorer.Series.t()
Extracts a single column as a series.
This is equivalent to df[field]
for retrieving a single field.
The returned series will have its :name
field set to the column name.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, "total")
#Explorer.Series<
Polars[1094]
s64 [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, 2)
#Explorer.Series<
Polars[1094]
s64 [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
Creates or modifies a single column.
This is a simplified way to add or modify one column, accepting a series, tensor, or a list.
If you are computing a series, it is preferrable to use
mutate/2
or mutate_with/2
to compute the series and
modify it in a single step, as it is more powerful and
it handles both expressions and scalar values accordingly.
If you are passing tensors or lists, they will be automatically
converted to a series. By default, the new series will have the
same dtype as the existing series, unless the :dtype
option
is given. If there is no existing series, one is inferred from
the tensor/list.
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.transform(df[:a], fn n -> n * 2 end))
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b s64 [2, 4, 6]
>
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.from_list([4, 5, 6]))
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b s64 [4, 5, 6]
>
Grouped examples
If the dataframe is grouped, put/3
is going to ignore the groups.
So the series must be of the same size of the entire dataframe.
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, "a")
iex> series = Explorer.Series.from_list([9, 8, 7])
iex> Explorer.DataFrame.put(grouped, :b, series)
#Explorer.DataFrame<
Polars[3 x 2]
Groups: ["a"]
a s64 [1, 2, 3]
b s64 [9, 8, 7]
>
Tensor examples
You can also put tensors into the dataframe:
iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([1, 2, 3]))
#Explorer.DataFrame<
Polars[3 x 1]
a s64 [1, 2, 3]
>
You can specify which dtype the tensor represents. For example, a tensor of s64 represents integers by default, but it may also represent timestamps in microseconds from the Unix epoch:
iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([1, 2, 3]), dtype: {:naive_datetime, :microsecond})
#Explorer.DataFrame<
Polars[3 x 1]
a naive_datetime[μs] [1970-01-01 00:00:00.000001, 1970-01-01 00:00:00.000002, 1970-01-01 00:00:00.000003]
>
If there is already a column where we want to place the tensor, the column dtype will be automatically used, this means that updating dataframes in place while preserving their types is straight-forward:
iex> df = Explorer.DataFrame.new(a: [~N[1970-01-01 00:00:00]])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(529550625987654))
#Explorer.DataFrame<
Polars[1 x 1]
a naive_datetime[μs] [1986-10-13 01:23:45.987654]
>
This is particularly useful for categorical columns:
iex> cat = Explorer.Series.from_list(["foo", "bar", "baz"], dtype: :category)
iex> df = Explorer.DataFrame.new(a: cat)
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([2, 1, 0]))
#Explorer.DataFrame<
Polars[3 x 1]
a category ["baz", "bar", "foo"]
>
On the other hand, if you try to put a floating tensor on
an integer column, an error will be raised unless a dtype
or dtype: :infer
is given:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64))
** (ArgumentError) dtype {:s, 64} expects a tensor of type {:s, 64} but got type {:f, 64}
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: {:f, 64})
#Explorer.DataFrame<
Polars[3 x 1]
a f64 [1.0, 1.0, 1.0]
>
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: :infer)
#Explorer.DataFrame<
Polars[3 x 1]
a f64 [1.0, 1.0, 1.0]
>
List examples
Similar to tensors, we can also put lists in the dataframe:
iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, [1, 2, 3])
#Explorer.DataFrame<
Polars[3 x 1]
a s64 [1, 2, 3]
>
The same considerations as above apply.
Relocates columns.
Change column order within a DataFrame. The before
and after
options are mutually exclusive.
Providing no options will relocate the columns to beginning of the DataFrame.
Options
:before
- Specifies to relocate before the given column. It can be an index or a column name.:after
- Specifies to relocate after the given column. It can be an index or a column name.
Examples
Relocate a single column
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1], c: [nil, 5, 6])
iex> Explorer.DataFrame.relocate(df, "a", after: "c")
#Explorer.DataFrame<
Polars[3 x 3]
b s64 [1, 3, 1]
c s64 [nil, 5, 6]
a string ["a", "b", "a"]
>
Relocate (and reorder) multiple columns to the beginning
iex> df = Explorer.DataFrame.new(a: [1, 2], b: [5.1, 5.2], c: [4, 5], d: ["yes", "no"])
iex> Explorer.DataFrame.relocate(df, ["d", 1], before: 0)
#Explorer.DataFrame<
Polars[2 x 4]
d string ["yes", "no"]
b f64 [5.1, 5.2]
a s64 [1, 2]
c s64 [4, 5]
>
Relocate before another column
iex> df = Explorer.DataFrame.new(a: [1, 2], b: [5.1, 5.2], c: [4, 5], d: ["yes", "no"])
iex> Explorer.DataFrame.relocate(df, ["a", "c"], before: "b")
#Explorer.DataFrame<
Polars[2 x 4]
a s64 [1, 2]
c s64 [4, 5]
b f64 [5.1, 5.2]
d string ["yes", "no"]
>
@spec rename( df :: t(), names :: column_names() | column_pairs(column_name()) ) :: t()
Renames columns.
Renaming a column that is also a group is going to rename the group as well.
To apply a function to a subset of columns, see rename_with/3
.
Examples
You can pass in a list of new names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, ["c", "d"])
#Explorer.DataFrame<
Polars[3 x 2]
c string ["a", "b", "a"]
d s64 [1, 3, 1]
>
Or you can rename individual columns using keyword args:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, a: "first")
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b s64 [1, 3, 1]
>
Or you can rename individual columns using a map:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, %{"a" => "first"})
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b s64 [1, 3, 1]
>
Renames columns with a function.
Renaming a column that is also a group is going to rename the group as well.
Examples
If no columns are specified, it will apply the function to all column names:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
YEAR s64 [2010, 2010, 2010, 2010, 2010, ...]
COUNTRY string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL s64 [2308, 1254, 32500, 141, 7924, ...]
SOLID_FUEL s64 [627, 117, 332, 0, 0, ...]
LIQUID_FUEL s64 [1601, 953, 12381, 141, 3649, ...]
GAS_FUEL s64 [74, 7, 14565, 0, 374, ...]
CEMENT s64 [5, 177, 2598, 0, 204, ...]
GAS_FLARING s64 [0, 0, 2623, 0, 3697, ...]
PER_CAPITA f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
BUNKER_FUELS s64 [9, 7, 663, 0, 321, ...]
>
A callback can be used to filter the column names that will be renamed, similarly to select/2
:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.ends_with?(&1, "_fuel"), &String.trim_trailing(&1, "_fuel"))
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid s64 [627, 117, 332, 0, 0, ...]
liquid s64 [1601, 953, 12381, 141, 3649, ...]
gas s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Or you can just pass in the list of column names you'd like to apply the function to:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, ["total", "cement"], &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
CEMENT s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
Selects a subset of columns by name.
It's important to notice that groups are kept: you can't select off grouping columns.
Examples
You can select a single column:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, "a")
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
Or a list of names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ["a"])
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
You can also use a range or a list of integers:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, [0, 1])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, 0..1)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
>
Or you can use a callback function that takes the dataframe's names as its first argument:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, &String.starts_with?(&1, "b"))
#Explorer.DataFrame<
Polars[3 x 1]
b s64 [1, 2, 3]
>
Or, if you prefer, a regex:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ~r/^b$/)
#Explorer.DataFrame<
Polars[3 x 1]
b s64 [1, 2, 3]
>
Or a callback function that takes names and types:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, fn _name, type -> type == {:s, 64} end)
#Explorer.DataFrame<
Polars[3 x 1]
b s64 [1, 2, 3]
>
Grouped examples
Columns that are also groups cannot be removed, you need to ungroup before removing these columns.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.select(grouped, ["sepal_width"])
#Explorer.DataFrame<
Polars[150 x 2]
Groups: ["species"]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Sorts rows by columns using Explorer.Query
.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
See sort_with/2
for a callback version of this function without
Explorer.Query
.
Options
:nils
-:first
or:last
. By default it is:last
if direction is:asc
, and:first
otherwise.:parallel
- boolean. Whether to parallelize the sorting. By default it istrue
.Parallel sort isn't available on certain lazy operations. In those situations this option is ignored.
:stable
- boolean. Determines if the sorting is stable (ties are guaranteed to maintain their order) or not. Unstable sorting may be more performant. By default it isfalse
.
Examples
A single column name will sort ascending by that column:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_by(df, a)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [3, 1, 2]
>
You can also sort descending:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_by(df, desc: a)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["c", "b", "a"]
b s64 [2, 1, 3]
>
You can specify how nil
s are sorted:
iex> df = Explorer.DataFrame.new(a: ["b", "c", nil, "a"])
iex> Explorer.DataFrame.sort_by(df, [desc: a], nils: :first)
#Explorer.DataFrame<
Polars[4 x 1]
a string [nil, "c", "b", "a"]
>
Sorting by more than one column sorts them in the order they are entered:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sort_by(df, asc: total, desc: country)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2011, 2011, 2012, ...]
country string ["NIUE", "TUVALU", "TUVALU", "NIUE", "NIUE", ...]
total s64 [1, 2, 2, 2, 2, ...]
solid_fuel s64 [0, 0, 0, 0, 0, ...]
liquid_fuel s64 [1, 2, 2, 2, 2, ...]
gas_fuel s64 [0, 0, 0, 0, 0, ...]
cement s64 [0, 0, 0, 0, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [0.52, 0.0, 0.0, 1.04, 1.04, ...]
bunker_fuels s64 [0, 0, 0, 0, 0, ...]
>
Grouped examples
When used in a grouped dataframe, sort_by is going to sort each group individually and then return the entire dataframe with the existing groups. If one of the sort_by columns is also a group, the sorting for that column is not going to work. It is necessary to first summarise the desired column and then sort_by it.
Here is an example using the Iris dataset. We group by species and then we try to sort the dataframe by species and petal length, but only "petal length" is taken into account because "species" is a group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sort_by(grouped, desc: species, asc: sepal_width)
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length f64 [4.5, 4.4, 4.9, 4.8, 4.3, ...]
sepal_width f64 [2.3, 2.9, 3.0, 3.0, 3.0, ...]
petal_length f64 [1.3, 1.4, 1.4, 1.4, 1.1, ...]
petal_width f64 [0.3, 0.2, 0.2, 0.1, 0.1, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec sort_with( df :: t(), (Explorer.Backend.LazyFrame.t() -> Explorer.Series.lazy_t() | [Explorer.Series.lazy_t()] | [{:asc | :desc, Explorer.Series.lazy_t()}]), opts :: [nils: :first | :last, stable: boolean()] ) :: t()
Sorts rows by columns using a callback function.
The callback receives a lazy dataframe which stores operations instead of values for efficient sorting.
This is a callback version of sort_by/2
.
Options
:nils
-:first
or:last
. By default it is:last
if direction is:asc
, and:first
otherwise.:parallel
- boolean. Whether to parallelize the sorting. By default it istrue
.Parallel sort isn't available on certain lazy operations. In those situations this option is ignored.
:stable
- boolean. Determines if the sorting is stable (ties are guaranteed to maintain their order) or not. Unstable sorting may be more performant. By default it isfalse
.
Examples
A single column name will sort ascending by that column:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_with(df, &(&1["a"]))
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [3, 1, 2]
>
You can also sort descending:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"]])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["c", "b", "a"]
b s64 [2, 1, 3]
>
You can specify how nil
s are sorted:
iex> df = Explorer.DataFrame.new(a: ["b", "c", nil, "a"])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"]], nils: :first)
#Explorer.DataFrame<
Polars[4 x 1]
a string [nil, "c", "b", "a"]
>
Sorting by more than one column sorts them in the order they are entered:
iex> df = Explorer.DataFrame.new(a: [3, 1, 3], b: [2, 1, 3])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"], asc: &1["b"]])
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [3, 3, 1]
b s64 [2, 3, 1]
>
Grouped examples
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sort_with(grouped, &[desc: &1["species"], asc: &1["sepal_width"]])
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length f64 [4.5, 4.4, 4.9, 4.8, 4.3, ...]
sepal_width f64 [2.3, 2.9, 3.0, 3.0, 3.0, ...]
petal_length f64 [1.3, 1.4, 1.4, 1.4, 1.1, ...]
petal_width f64 [0.3, 0.2, 0.2, 0.1, 0.1, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Create a dataframe from the result of a SQL query on an existing dataframe.
SQL Query is Unvalidated
Explorer does not validate the SQL query string provided. Instead it passes it directly to the backend. As such, the SQL dialect will be backend dependent and any errors will come directly from the backend.
from
Clause
The from
clause of the SQL query should reference a chosen table name. The
default name is "df"
. See the examples for a custom table name.
Examples
Basic example:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["x", "y", "y"])
iex> Explorer.DataFrame.sql(df, "select ARRAY_AGG(a), b from df group by b order by b")
#Explorer.DataFrame<
Polars[2 x 2]
a list[s64] [[1], [2, 3]]
b string ["x", "y"]
>
Custom table name:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.sql(df, "select a + 1 from my_table", table_name: "my_table")
#Explorer.DataFrame<
Polars[3 x 1]
a s64 [2, 3, 4]
>
Summarise each group to a single row using Explorer.Query
.
To summarise, you must either perform an aggregation (defined in Explorer.Series
)
on the desired columns, or select plain columns to retrieve all elements of a group.
The query is compiled and runs efficiently against the dataframe. This function performs aggregations based on groups and it implicitly ungroups the resultant dataframe.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
See summarise_with/2
for a callback version of this function without
Explorer.Query
.
Examples
iex> df = Explorer.DataFrame.new(letters: ~w(a b c d e), is_vowel: [true, false, false, false, true])
iex> grouped_df = Explorer.DataFrame.group_by(df, :is_vowel)
iex> Explorer.DataFrame.summarise(grouped_df, letters: letters)
#Explorer.DataFrame<
Polars[2 x 2]
is_vowel boolean [true, false]
letters list[string] [["a", "e"], ["b", "c", "d"]]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> grouped_df = Explorer.DataFrame.group_by(df, "year")
iex> Explorer.DataFrame.summarise(grouped_df, total_max: max(total), total_min: min(total))
#Explorer.DataFrame<
Polars[5 x 3]
year s64 [2010, 2011, 2012, 2013, 2014]
total_max s64 [2393248, 2654360, 2734817, 2797384, 2806634]
total_min s64 [1, 2, 2, 2, 3]
>
Suppose you want to get the mean sepal width of each Iris species. You could do something like this:
iex> df = Explorer.Datasets.iris()
iex> grouped_df = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.summarise(grouped_df, mean_sepal_width: round(mean(sepal_width), 3))
#Explorer.DataFrame<
Polars[3 x 2]
species string ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
mean_sepal_width f64 [3.418, 2.77, 2.974]
>
In case aggregations for all the dataframe is what you want, you can use ungrouped dataframes:
iex> df = Explorer.Datasets.iris()
iex> Explorer.DataFrame.summarise(df, mean_petal_length: round(mean(petal_length), 2))
#Explorer.DataFrame<
Polars[1 x 1]
mean_petal_length f64 [3.76]
>
@spec summarise_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> column_pairs(Explorer.Series.lazy_t())) ) :: t()
Summarise each group to a single row using a callback function.
In case no group is set, the entire dataframe will be considered. The callback receives a lazy dataframe. A lazy dataframe does not hold any values, instead it stores all operations in order to execute all summarizations performantly.
This is a callback version of summarise/2
.
Examples
iex> alias Explorer.{DataFrame, Series}
iex> df = Explorer.Datasets.fossil_fuels() |> DataFrame.group_by("year")
iex> DataFrame.summarise_with(df, &[total_max: Series.max(&1["total"]), countries: Series.n_distinct(&1["country"])])
#Explorer.DataFrame<
Polars[5 x 3]
year s64 [2010, 2011, 2012, 2013, 2014]
total_max s64 [2393248, 2654360, 2734817, 2797384, 2806634]
countries u32 [217, 217, 220, 220, 220]
>
iex> alias Explorer.{DataFrame, Series}
iex> df = Explorer.Datasets.fossil_fuels()
iex> DataFrame.summarise_with(df, &[total_max: Series.max(&1["total"]), countries: Series.n_distinct(&1["country"])])
#Explorer.DataFrame<
Polars[1 x 2]
total_max s64 [2806634]
countries u32 [222]
>
Make new columns by applying a native Elixir function to each row.
The native Elixir function should take a map (an element from
to_rows_stream/2
) as input and return a map where the keys are the desired
new columns and the values are the values for that row.
See also the version of this function for a single series:
Warning
This is an expensive operation since data is stored in a columnar format. You should avoid converting a dataframe to rows where possible as it will incur significant overhead. Prefer instead to use the operations in this module rather than native Elixir ones. This function should only be used as a fallback when no equivalent DataFrame operation is available.
Options
:names
- Column names to select for serializing. By default all columns will be returned.
Also, all options which are valid for to_rows_stream
are valid here.
The
:names
field is recommendedIt's recommended that you provide the
:names
option and to set it to only those column names needed by your function. Otherwise, you'll be serializing unnecessary information.
Examples
Pre-selecting with :names
:
iex> alias Explorer.DataFrame, as: DF
iex> df = DF.new(
...> numbers: [1, 2],
...> datetime_local: [~N[2024-01-01 00:00:00], ~N[2024-01-01 00:00:00]],
...> timezone: ["Etc/UTC", "America/New_York"]
...> )
iex> DF.transform(df, [names: ["datetime_local", "timezone"]], fn row ->
...> datetime_utc =
...> row["datetime_local"]
...> |> DateTime.from_naive!(row["timezone"])
...> |> DateTime.shift_zone!("Etc/UTC")
...>
...> %{datetime_utc: datetime_utc}
...> end)
#Explorer.DataFrame<
Polars[2 x 4]
numbers s64 [1, 2]
datetime_local naive_datetime[μs] [2024-01-01 00:00:00.000000, 2024-01-01 00:00:00.000000]
timezone string ["Etc/UTC", "America/New_York"]
datetime_utc datetime[μs, Etc/UTC] [2024-01-01 00:00:00.000000Z, 2024-01-01 05:00:00.000000Z]
>
Converting to atoms with :atom_keys
:
iex> alias Explorer.DataFrame, as: DF
iex> df = DF.new(
...> major: [1, 2, 3],
...> minor: [11, 12, 13],
...> patch: [0, 0, 0]
...> )
iex> DF.transform(df, [atom_keys: true], fn row ->
...> version = Version.parse!("#{row.major}.#{row.minor}.#{row.patch}")
...> %{meets_requirements: Version.match?(version, "> 2.0.0")}
...> end)
#Explorer.DataFrame<
Polars[3 x 4]
major s64 [1, 2, 3]
minor s64 [11, 12, 13]
patch s64 [0, 0, 0]
meets_requirements boolean [false, true, true]
>
Transpose a DataFrame.
Warning
This is an expensive operation since data is stored in a columnar format.
Options
:header
- When a string is passed, name of the header column is set to the value. Whenheader
istrue
, name is set tocolumn
(default:false
):columns
- names for non header columns. Length of column_names should match the row count of data frame. (default:nil
)
Examples
iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df)
#Explorer.DataFrame<
Polars[3 x 2]
column_0 string ["d", "1", "a"]
column_1 string [nil, "2", "b"]
>
iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df, columns: ["x", "y"])
#Explorer.DataFrame<
Polars[3 x 2]
x string ["d", "1", "a"]
y string [nil, "2", "b"]
>
iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df, header: "name")
#Explorer.DataFrame<
Polars[3 x 3]
name string ["a", "b", "c"]
column_0 string ["d", "1", "a"]
column_1 string [nil, "2", "b"]
>
@spec ungroup(df :: t(), groups_or_group :: column_names() | column_name() | :all) :: t()
Removes grouping variables.
Accepts a list of group names. If groups is not specified, then all groups are removed.
Examples
Ungroups all by default:
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Ungrouping a single column:
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df, "country")
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["year"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Lists, ranges, regexes, and functions are also accepted in column names, as in select/2
.
@spec unnest(df :: t(), column_or_columns :: column_name() | [column_name()]) :: t()
Unnests one or multiple struct columns into individual columns.
The field names in a unnested column must not exist in the dataframe or in any other unnested columns.
Examples
iex> df = Explorer.DataFrame.new(before: [1, 2], struct: [%{x: 1, y: 2}, %{x: 3, y: 4}], after: [3, 4])
iex> Explorer.DataFrame.unnest(df, :struct)
#Explorer.DataFrame<
Polars[2 x 4]
before s64 [1, 2]
x s64 [1, 3]
y s64 [2, 4]
after s64 [3, 4]
>
iex> df = Explorer.DataFrame.new(struct1: [%{x: 1, y: 2}, %{x: 3, y: 4}], struct2: [%{z: 5}, %{z: 6}])
iex> Explorer.DataFrame.unnest(df, [:struct1, :struct2])
#Explorer.DataFrame<
Polars[2 x 3]
x s64 [1, 3]
y s64 [2, 4]
z s64 [5, 6]
>
Functions: Multi-table
Combine two or more dataframes column-wise.
This function expects the dataframes to have the same number of rows, otherwise rows may be silently discarded. Eager backends may check whenever this happens and raise instead of silently fail. But this may not be possible for lazy dataframes as the number of rows is not known upfront.
When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.
Examples
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(z: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
Polars[3 x 4]
x s64 [1, 2, 3]
y string ["a", "b", "c"]
z s64 [4, 5, 6]
a string ["d", "e", "f"]
>
Conflicting names are suffixed with the index of the dataframe in the array:
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
Polars[3 x 4]
x s64 [1, 2, 3]
y string ["a", "b", "c"]
x_1 s64 [4, 5, 6]
a string ["d", "e", "f"]
>
Combine two dataframes column-wise.
When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.
concat_columns(df1, df2)
is equivalent to concat_columns([df1, df2])
.
Combine two or more dataframes row-wise (stack).
Column names and dtypes must match. The only exception is for numeric columns that can be mixed together, and casted automatically to the "highest" dtype.
For example, if a column is represented by {:u, 16}
and {:u, 32}
,
it will be casted to the highest unsigned integer dtype between the two,
which is {:u, 32}
.
If it is mixing signed and unsigned integers, it will be casted to
the highest signed integer possible.
And if floats and integers are mixed together, Explorer will cast
them to the float dtype {:f, 64}
.
When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.
Examples
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x s64 [1, 2, 3, 4, 5, ...]
y string ["a", "b", "c", "d", "e", ...]
>
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4.2, 5.3, 6.4], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x f64 [1.0, 2.0, 3.0, 4.2, 5.3, ...]
y string ["a", "b", "c", "d", "e", ...]
>
Combine two dataframes row-wise.
concat_rows(df1, df2)
is equivalent to concat_rows([df1, df2])
.
When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.
Join two tables.
Join types
:inner
- Returns all rows fromleft
where there are matching values inright
, and all columns fromleft
andright
.:left
- Returns all rows fromleft
and all columns fromleft
andright
. Rows inleft
with no match inright
will havenil
values in the new columns.:right
- Returns all rows fromright
and all columns fromleft
andright
. Rows inright
with no match inleft
will havenil
values in the new columns.:outer
- Returns all rows and all columns from bothleft
andright
. Where there are not matching values, returnsnil
for the one missing.:cross
- Also known as a cartesian join. Returns all combinations ofleft
andright
. Can be very computationally expensive.
Options
:on
- The column(s) to join on. Defaults to overlapping columns. Does not apply to cross join.:how
- One of the join types (as an atom) described above. Defaults to:inner
.
Examples
Inner join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right)
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
Left join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :left)
#Explorer.DataFrame<
Polars[4 x 3]
a s64 [1, 2, 2, 3]
b string ["a", "b", "b", "c"]
c string ["d", "e", "f", nil]
>
Right join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :right)
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [1, 2, 4]
c string ["d", "e", "f"]
b string ["a", "b", nil]
>
Outer join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :outer)
#Explorer.DataFrame<
Polars[4 x 4]
a s64 [1, 2, nil, 3]
b string ["a", "b", nil, "c"]
a_right s64 [1, 2, 4, nil]
c string ["d", "e", "f", nil]
>
Cross join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :cross)
#Explorer.DataFrame<
Polars[9 x 4]
a s64 [1, 1, 1, 2, 2, ...]
b string ["a", "a", "a", "b", "b", ...]
a_right s64 [1, 2, 4, 1, 2, ...]
c string ["d", "e", "f", "d", "e", ...]
>
Inner join with different names:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(d: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, on: [a: :d])
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
Grouped examples
When doing a join operation with grouped dataframes, the joined dataframe may keep the groups from only one side.
An inner join operation will keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right)
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["b"]
a s64 [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
A left join operation will keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :left)
#Explorer.DataFrame<
Polars[4 x 3]
Groups: ["b"]
a s64 [1, 2, 2, 3]
b string ["a", "b", "b", "c"]
c string ["d", "e", "f", nil]
>
A right join operation will keep the groups from the right-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :right)
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["c"]
a s64 [1, 2, 4]
c string ["d", "e", "f"]
b string ["a", "b", nil]
>
An outer join operation is going to keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :outer)
#Explorer.DataFrame<
Polars[4 x 4]
Groups: ["b"]
a s64 [1, 2, nil, 3]
b string ["a", "b", nil, "c"]
a_right s64 [1, 2, 4, nil]
c string ["d", "e", "f", nil]
>
A cross join operation is going to keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :cross)
#Explorer.DataFrame<
Polars[9 x 4]
Groups: ["b"]
a s64 [1, 1, 1, 2, 2, ...]
b string ["a", "a", "a", "b", "b", ...]
a_right s64 [1, 2, 4, 1, 2, ...]
c string ["d", "e", "f", "d", "e", ...]
>
Functions: Row-based
Returns the first n rows of the dataframe.
By default it returns the first 5 rows.
If the dataframe is using groups, then the first n rows of each group is returned.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2010, 2010, 2010, 2010, 2010]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
total s64 [2308, 1254, 32500, 141, 7924]
solid_fuel s64 [627, 117, 332, 0, 0]
liquid_fuel s64 [1601, 953, 12381, 141, 3649]
gas_fuel s64 [74, 7, 14565, 0, 374]
cement s64 [5, 177, 2598, 0, 204]
gas_flaring s64 [0, 0, 2623, 0, 3697]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37]
bunker_fuels s64 [9, 7, 663, 0, 321]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2010, 2010]
country string ["AFGHANISTAN", "ALBANIA"]
total s64 [2308, 1254]
solid_fuel s64 [627, 117]
liquid_fuel s64 [1601, 953]
gas_fuel s64 [74, 7]
cement s64 [5, 177]
gas_flaring s64 [0, 0]
per_capita f64 [0.08, 0.43]
bunker_fuels s64 [9, 7]
>
Grouped examples
Using grouped dataframes makes head/2
return n rows from each group.
Here is an example using the Iris dataset, and returning two rows from each group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.head(grouped, 2)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 7.0, 6.4, 6.3, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.3, ...]
petal_length f64 [1.4, 1.4, 4.7, 4.5, 6.0, ...]
petal_width f64 [0.2, 0.2, 1.4, 1.5, 2.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Sample rows from a dataframe.
If given an integer as the second argument, it will return N samples. If given a float, it will return that proportion of the series.
Can sample with or without replacement.
For grouped dataframes, sample will take into account the rows of each group, meaning that if you try to get N samples and you have G groups, you will get N * G rows. See the examples below.
Options
:replace
- If set totrue
, each sample will be independent and therefore values may repeat. Required to betrue
forn
greater then the number of rows in the dataframe orfrac
> 1.0. (default:false
):seed
- An integer to be used as a random seed. If nil, a random value between 0 and 2^64 - 1 will be used. (default:nil
):shuffle
- If set totrue
, the resultant dataframe is going to be shuffle if the sample is equal to the size of the dataframe. (default:false
)
Examples
You can sample N rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 3, seed: 100)
#Explorer.DataFrame<
Polars[3 x 10]
year s64 [2012, 2013, 2014]
country string ["SYRIAN ARAB REPUBLIC", "EGYPT", "AFGHANISTAN"]
total s64 [12198, 58198, 2675]
solid_fuel s64 [1, 224, 1194]
liquid_fuel s64 [7909, 26501, 1393]
gas_fuel s64 [3265, 24672, 74]
cement s64 [816, 6800, 14]
gas_flaring s64 [208, 0, 0]
per_capita f64 [0.61, 0.66, 0.08]
bunker_fuels s64 [437, 694, 9]
>
Or you can sample a proportion of rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 0.03, seed: 100)
#Explorer.DataFrame<
Polars[32 x 10]
year s64 [2013, 2012, 2014, 2011, 2011, ...]
country string ["BRITISH VIRGIN ISLANDS", "TAJIKISTAN", "AFGHANISTAN", "ICELAND", "SINGAPORE", ...]
total s64 [48, 800, 2675, 513, 12332, ...]
solid_fuel s64 [0, 192, 1194, 94, 7, ...]
liquid_fuel s64 [48, 501, 1393, 400, 7774, ...]
gas_fuel s64 [0, 74, 74, 0, 4551, ...]
cement s64 [0, 34, 14, 19, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [1.64, 0.1, 0.08, 1.6, 2.38, ...]
bunker_fuels s64 [0, 28, 9, 168, 41786, ...]
>
Grouped examples
In the following example we have the Iris dataset grouped by species, and we want to take a sample of two plants from each group. Since we have three species, the resultant dataframe is going to have six rows (2 * 3).
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 2, seed: 100)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [4.8, 5.0, 5.5, 6.5, 7.4, ...]
sepal_width f64 [3.1, 3.6, 2.4, 2.8, 2.8, ...]
petal_length f64 [1.6, 1.4, 3.8, 4.6, 6.1, ...]
petal_width f64 [0.2, 0.2, 1.1, 1.5, 1.9, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
The behaviour is similar when you want to take a fraction of the rows from each group. The main difference is that each group can have more or less rows, depending on its size.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 0.1, seed: 100)
#Explorer.DataFrame<
Polars[15 x 5]
Groups: ["species"]
sepal_length f64 [5.2, 5.0, 5.2, 5.0, 5.0, ...]
sepal_width f64 [3.4, 3.6, 3.5, 3.0, 3.4, ...]
petal_length f64 [1.4, 1.4, 1.5, 1.6, 1.6, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.4, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Change the order of the rows of a dataframe randomly.
This function is going to ignore groups.
Options
:seed
- An integer to be used as a random seed. If nil, a random value between 0 and 2^64 - 1 will be used. (default:nil
)
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.shuffle(df, seed: 100)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2014, 2014, 2014, 2012, 2010, ...]
country string ["ISRAEL", "ARGENTINA", "NETHERLANDS", "YEMEN", "GRENADA", ...]
total s64 [17617, 55638, 45624, 5091, 71, ...]
solid_fuel s64 [6775, 1588, 9070, 129, 0, ...]
liquid_fuel s64 [6013, 25685, 18272, 4173, 71, ...]
gas_fuel s64 [3930, 26368, 18010, 414, 0, ...]
cement s64 [898, 1551, 272, 375, 0, ...]
gas_flaring s64 [0, 446, 0, 0, 0, ...]
per_capita f64 [2.22, 1.29, 2.7, 0.2, 0.68, ...]
bunker_fuels s64 [1011, 2079, 14210, 111, 4, ...]
>
Slices rows at the given indices as a new dataframe.
The indices may be a list or series of indices, or a range. A list of indices does not support negative numbers. Ranges may be negative on either end, which are then normalized. Note ranges in Elixir are inclusive.
Slice works differently when a dataframe is grouped. It is going to consider the indices of each group instead of the entire dataframe. See the examples below.
If your intention is to grab a portion of each group,
prefer to use sample/3
instead.
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, [0, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [1, 3]
b string ["a", "c"]
>
With a series
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, Explorer.Series.from_list([0, 2]))
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [1, 3]
b string ["a", "c"]
>
With a range:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, 1..2)
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [2, 3]
b string ["b", "c"]
>
With a range with negative first and last:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, -2..-1)
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [2, 3]
b string ["b", "c"]
>
Grouped examples
We are going to once again use the Iris dataset. In this example we want to take elements at indexes 0 and 2:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, [0, 2])
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.7, 7.0, 6.9, 6.3, ...]
sepal_width f64 [3.5, 3.2, 3.2, 3.1, 3.3, ...]
petal_length f64 [1.4, 1.3, 4.7, 4.9, 6.0, ...]
petal_width f64 [0.2, 0.2, 1.4, 1.5, 2.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Now we want to take the first 3 rows of each group.
This is going to work with the range 0..2
:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0..2)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 7.0, 6.4, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.2, ...]
petal_length f64 [1.4, 1.4, 1.3, 4.7, 4.5, ...]
petal_width f64 [0.2, 0.2, 0.2, 1.4, 1.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
Subset a continuous set of rows.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, 1, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2010, 2010]
country string ["ALBANIA", "ALGERIA"]
total s64 [1254, 32500]
solid_fuel s64 [117, 332]
liquid_fuel s64 [953, 12381]
gas_fuel s64 [7, 14565]
cement s64 [177, 2598]
gas_flaring s64 [0, 2623]
per_capita f64 [0.43, 0.9]
bunker_fuels s64 [7, 663]
>
Negative offsets count from the end of the series:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2014, 2014]
country string ["UNITED STATES OF AMERICA", "URUGUAY"]
total s64 [1432855, 1840]
solid_fuel s64 [450047, 2]
liquid_fuel s64 [576531, 1700]
gas_fuel s64 [390719, 25]
cement s64 [11314, 112]
gas_flaring s64 [4244, 0]
per_capita f64 [4.43, 0.54]
bunker_fuels s64 [30722, 251]
>
If the length would run past the end of the dataframe, the result may be shorter than the length:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 20)
#Explorer.DataFrame<
Polars[10 x 10]
year s64 [2014, 2014, 2014, 2014, 2014, ...]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN", "VANUATU", "VENEZUELA", ...]
total s64 [1432855, 1840, 28692, 42, 50510, ...]
solid_fuel s64 [450047, 2, 1677, 0, 204, ...]
liquid_fuel s64 [576531, 1700, 2086, 42, 28445, ...]
gas_fuel s64 [390719, 25, 23929, 0, 12731, ...]
cement s64 [11314, 112, 1000, 0, 1088, ...]
gas_flaring s64 [4244, 0, 0, 0, 8042, ...]
per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65, ...]
bunker_fuels s64 [30722, 251, 0, 10, 1256, ...]
>
Grouped examples
We want to take the first 3 rows of each group. We need the offset 0 and the length 3:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0, 3)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 7.0, 6.4, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.2, ...]
petal_length f64 [1.4, 1.4, 1.3, 4.7, 4.5, ...]
petal_width f64 [0.2, 0.2, 0.2, 1.4, 1.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
We can also pass a negative offset:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, -6, 3)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.8, 5.1, 5.6, 5.7, ...]
sepal_width f64 [3.8, 3.0, 3.8, 2.7, 3.0, ...]
petal_length f64 [1.9, 1.4, 1.6, 4.2, 4.2, ...]
petal_width f64 [0.4, 0.3, 0.2, 1.3, 1.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
Returns the last n rows of the dataframe.
By default it returns the last 5 rows.
If the dataframe is using groups, then the last n rows of each group is returned.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2014, 2014, 2014, 2014, 2014]
country string ["VIET NAM", "WALLIS AND FUTUNA ISLANDS", "YEMEN", "ZAMBIA", "ZIMBABWE"]
total s64 [45517, 6, 6190, 1228, 3278]
solid_fuel s64 [19246, 0, 137, 132, 2097]
liquid_fuel s64 [12694, 6, 5090, 797, 1005]
gas_fuel s64 [5349, 0, 581, 0, 0]
cement s64 [8229, 0, 381, 299, 177]
gas_flaring s64 [0, 0, 0, 0, 0]
per_capita f64 [0.49, 0.44, 0.24, 0.08, 0.22]
bunker_fuels s64 [761, 1, 153, 33, 9]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2014, 2014]
country string ["ZAMBIA", "ZIMBABWE"]
total s64 [1228, 3278]
solid_fuel s64 [132, 2097]
liquid_fuel s64 [797, 1005]
gas_fuel s64 [0, 0]
cement s64 [299, 177]
gas_flaring s64 [0, 0]
per_capita f64 [0.08, 0.22]
bunker_fuels s64 [33, 9]
>
Grouped examples
Using grouped dataframes makes tail/2
return n rows from each group.
Here is an example using the Iris dataset, and returning two rows from each group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.tail(grouped, 2)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [5.3, 5.0, 5.1, 5.7, 6.2, ...]
sepal_width f64 [3.7, 3.3, 2.5, 2.8, 3.4, ...]
petal_length f64 [1.5, 1.4, 3.0, 4.1, 5.4, ...]
petal_width f64 [0.2, 0.2, 1.1, 1.3, 2.3, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Functions: Introspection
Gets the dtypes of the dataframe columns.
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.dtypes(df)
%{"floats" => {:f, 64}, "ints" => {:s, 64}}
Returns the groups of a dataframe.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, "country")
iex> Explorer.DataFrame.groups(df)
["country"]
iex> df = Explorer.Datasets.iris()
iex> Explorer.DataFrame.groups(df)
[]
Returns the number of columns in the dataframe.
This function works the same way for grouped dataframes.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_columns(df)
10
Returns the number of rows in the dataframe.
This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_rows(df)
1094
Gets the names of the dataframe columns.
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.names(df)
["floats", "ints"]
Prints the DataFrame in a tabular fashion.
Examples
df = Explorer.Datasets.iris()
Explorer.DataFrame.print(df)
Explorer.DataFrame.print(df, limit: 1)
Explorer.DataFrame.print(df, limit: :infinity)
Gets the shape of the dataframe as a {height, width}
tuple.
This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0, 3.0], ints: [1, 2, 3])
iex> Explorer.DataFrame.shape(df)
{3, 2}
Functions: IO
@spec dump_csv(df :: t(), opts :: Keyword.t()) :: {:ok, String.t()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of a delimited file.
Options
:header
- Should the column names be written as the first line of the file? (default:true
):delimiter
- A single character used to separate fields within a record. (default:","
)
Examples
iex> df = Explorer.Datasets.fossil_fuels() |> Explorer.DataFrame.head(2)
iex> Explorer.DataFrame.dump_csv(df)
{:ok, "year,country,total,solid_fuel,liquid_fuel,gas_fuel,cement,gas_flaring,per_capita,bunker_fuels\n2010,AFGHANISTAN,2308,627,1601,74,5,0,0.08,9\n2010,ALBANIA,1254,117,953,7,177,0,0.43,7\n"}
Similar to dump_csv/2
, but raises in case of error.
@spec dump_ipc(df :: t(), opts :: Keyword.t()) :: {:ok, binary()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of an IPC file.
Groups are ignored if the dataframe is using any.
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Similar to dump_ipc/2
, but raises in case of error.
@spec dump_ipc_stream(df :: t(), opts :: Keyword.t()) :: {:ok, binary()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of an IPC Stream file.
Groups are ignored if the dataframe is using any.
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Similar to dump_ipc_stream/2
, but raises in case of error.
@spec dump_ndjson(df :: t()) :: {:ok, binary()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of a NDJSON file.
Groups are ignored if the dataframe is using any.
Examples
iex> df = Explorer.DataFrame.new(col_a: [1, 2], col_b: [5.1, 5.2])
iex> Explorer.DataFrame.dump_ndjson(df)
{:ok, ~s({"col_a":1,"col_b":5.1}\n{"col_a":2,"col_b":5.2}\n)}
Similar to dump_ndjson!/2
, but raises in case of error.