View Source Explorer.DataFrame (Explorer v0.8.1)
The DataFrame struct and API.
Dataframes are two-dimensional tabular data structures similar to a spreadsheet. For example, the Iris dataset:
iex> Explorer.Datasets.iris()
#Explorer.DataFrame<
Polars[150 x 5]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
This dataframe has 150 rows and five columns. Each column is an Explorer.Series
of the same size (150):
iex> df = Explorer.Datasets.iris()
iex> df["sepal_length"]
#Explorer.Series<
Polars[150]
f64 [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, ...]
>
Creating dataframes
Dataframes can be created from normal Elixir terms. The main way you might do this is
with the new/1
function. For example:
iex> Explorer.DataFrame.new(a: ["a", "b"], b: [1, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a string ["a", "b"]
b s64 [1, 2]
>
Or with a list of maps:
iex> Explorer.DataFrame.new([%{"col1" => "a", "col2" => 1}, %{"col1" => "b", "col2" => 2}])
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["a", "b"]
col2 s64 [1, 2]
>
Verbs
Explorer uses the idea of a consistent set of SQL-like verbs
like
dplyr
which can help solve common
data manipulation challenges. These are split into single table verbs,
multiple table verbs, and row-based verbs:
Single table verbs
Single table verbs are (unsurprisingly) used for manipulating a single dataframe. Those operations typically driven by column names. These are:
select/2
for picking columns anddiscard/2
to discard themfilter/2
for picking rows based on predicatesmutate/2
for adding or replacing columns that are functions of existing columnssort_by/2
for changing the ordering of rowsdistinct/2
for picking unique rowssummarise/2
for reducing multiple rows down to a single summarypivot_longer/3
andpivot_wider/4
for massaging dataframes into longer or wider forms, respectively
Each of these combine with Explorer.DataFrame.group_by/2
for operating by group.
Multiple table verbs
Multiple table verbs are used for combining tables. These are:
join/3
for performing SQL-like joinsconcat_columns/1
for horizontally "stacking" dataframesconcat_rows/1
for vertically "stacking" dataframes
Row-based verbs
Those operations are driven by the row index. These are:
head/2
for picking the first rowstail/2
for picking the last rowsslice/2
for slicing the dataframe by row indexes or a rangeslice/3
for slicing a section by an offsetsample/2
for sampling the data-frame by row
IO operations
Explorer supports reading and writing of:
- delimited files (such as CSV or TSV)
- Parquet
- Arrow IPC
- Arrow Streaming IPC
- Newline Delimited JSON
- Databases via
Adbc
infrom_query/3
The convention Explorer uses is to have from_*
and to_*
functions to read and write
to files in the formats above. load_*
and dump_*
versions are also available to read
and write those formats directly in memory.
Files can be fetched from local or remote file system, such as S3, using the following formats:
# path to a file in disk
Explorer.DataFrame.from_parquet("/path/to/file.parquet")
# path to a URL schema (with optional configuration)
Explorer.DataFrame.from_parquet("s3://bucket/file.parquet", config: FSS.S3.config_from_system_env())
# it's possible to configure using keyword lists
Explorer.DataFrame.from_parquet("s3://bucket/file.parquet", config: [access_key_id: "my-key", secret_access_key: "my-secret"])
# a FSS entry (it already includes its config)
Explorer.DataFrame.from_parquet(FSS.S3.parse("s3://bucket/file.parquet"))
The :config
option of from_*
functions is only required if the filename is a path
to a remote resource. In case it's a FSS entry, the requirement is that the config is passed
inside the entry struct.
For more details about the options, see the FSS docs.
Selecting columns and access
Several functions in this module, such as select/2
, discard/2
, drop_nil/2
, and so
forth accept a single or multiple columns as arguments. The columns can be specified in
a variety of formats, which we describe below.
Explorer.DataFrame
also implements the Access
behaviour (also known as the brackets
syntax). This should be familiar for users coming from other language with dataframes
such as R or Python. For example:
iex> df = Explorer.Datasets.wine()
iex> df["class"]
#Explorer.Series<
Polars[178]
s64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>
Accessing the dataframe with a column name either as a string or an atom, will return the column. You can also pass an integer representing the column order:
iex> df = Explorer.Datasets.wine()
iex> df[0]
#Explorer.Series<
Polars[178]
s64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>
You can also pass a list, a range, or a regex to return a dataframe matching the given data type. For example, by passing a list:
iex> df = Explorer.Datasets.wine()
iex> df[["class", "hue"]]
#Explorer.DataFrame<
Polars[178 x 2]
class s64 [1, 1, 1, 1, 1, ...]
hue f64 [1.04, 1.05, 1.03, 0.86, 1.04, ...]
>
Or a range for the given positions:
iex> df = Explorer.Datasets.wine()
iex> df[0..2]
#Explorer.DataFrame<
Polars[178 x 3]
class s64 [1, 1, 1, 1, 1, ...]
alcohol f64 [14.23, 13.2, 13.16, 14.37, 13.24, ...]
malic_acid f64 [1.71, 1.78, 2.36, 1.95, 2.59, ...]
>
Or a regex to keep only columns matching a given pattern:
iex> df = Explorer.Datasets.wine()
iex> df[~r/(class|hue)/]
#Explorer.DataFrame<
Polars[178 x 2]
class s64 [1, 1, 1, 1, 1, ...]
hue f64 [1.04, 1.05, 1.03, 0.86, 1.04, ...]
>
Given you can also access a series using its index, you can use multiple accesses to select a column and row at the same time:
iex> df = Explorer.Datasets.wine()
iex> df["class"][3]
1
Summary
Functions: Conversion
Collects the lazy data frame into an eager one, computing the query.
Converts the dataframe to the lazy version of the current backend.
Returns true if the data frame is a lazy one.
Creates a new dataframe.
Converts a dataframe to a list of columns with lists as values.
Converts a dataframe to a list of maps (rows).
Converts a dataframe to a stream of maps (rows).
Converts a dataframe to a list of columns with series as values.
Functions: Single-table
Calculates the pairwise correlation of numeric columns.
Calculates the pairwise covariance of numeric columns.
Describe numeric columns of a DataFrame.
Discards a subset of columns by name.
Takes distinct rows by a selection of columns.
Drop nil values.
Turns a set of columns to dummy variables.
Explodes one or multiple list column into multiple rows.
Picks rows based on Explorer.Query
.
Picks rows based on a callback function.
Creates a new dataframe with unique rows and the frequencies of each.
Group the dataframe by one or more variables.
Picks rows based on a list or series of values.
Creates or modifies columns based on Explorer.Query
.
Creates or modifies columns using a callback function.
Counts the number of nil
elements in each column.
Pivot data from wide to long.
Pivot data from long to wide.
Extracts a single column as a series.
Creates or modifies a single column.
Relocates columns.
Renames columns.
Renames columns with a function.
Selects a subset of columns by name.
Sorts rows by columns using Explorer.Query
.
Sorts rows by columns using a callback function.
Summarise each group to a single row using Explorer.Query
.
Summarise each group to a single row using a callback function.
Transpose a DataFrame.
Removes grouping variables.
Unnests one or multiple struct columns into individual columns.
Functions: Multi-table
Combine two or more dataframes column-wise.
Combine two dataframes column-wise.
Combine two or more dataframes row-wise (stack).
Combine two dataframes row-wise.
Join two tables.
Functions: Row-based
Returns the first n rows of the dataframe.
Sample rows from a dataframe.
Change the order of the rows of a dataframe randomly.
Slices rows at the given indices as a new dataframe.
Subset a continuous set of rows.
Returns the last n rows of the dataframe.
Functions: Introspection
Gets the dtypes of the dataframe columns.
Returns the groups of a dataframe.
Returns the number of columns in the dataframe.
Returns the number of rows in the dataframe.
Gets the names of the dataframe columns.
Prints the DataFrame in a tabular fashion.
Gets the shape of the dataframe as a {height, width}
tuple.
Functions: IO
Writes a dataframe to a binary representation of a delimited file.
Similar to dump_csv/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC file.
Similar to dump_ipc/2
, but raises in case of error.
Writes a dataframe to a binary representation of an IPC Stream file.
Similar to dump_ipc_stream/2
, but raises in case of error.
Writes a dataframe to a binary representation of a NDJSON file.
Similar to dump_ndjson!/2
, but raises in case of error.
Writes a dataframe to a binary representation of a Parquet file.
Similar to dump_parquet/2
, but raises in case of error.
Reads a delimited file into a dataframe.
Similar to from_csv/2
but raises if there is a problem reading the CSV.
Reads an IPC file into a dataframe.
Similar to from_ipc/2
but raises if there is a problem reading the IPC file.
Reads an IPC Streaming file into a dataframe.
Similar to from_ipc_stream/2
but raises if there is a problem reading the IPC Stream file.
Read a file of JSON objects or lists separated by new lines
Similar to from_ndjson/2
, but raises in case of error.
Reads a parquet file into a dataframe.
Similar to from_parquet/2
but raises if there is a problem reading the Parquet file.
Reads data from a query.
Similar to from_query/4
but raises if there is an error.
Reads a representation of a CSV file into a dataframe.
Similar to load_csv/2
but raises if there is a problem reading the CSV.
Reads a binary representing an IPC file into a dataframe.
Similar to load_ipc/2
but raises if there is a problem reading the IPC file.
Reads a binary representing an IPC Stream file into a dataframe.
Similar to load_ipc_stream/2
but raises if there is a problem.
Reads a representation of a NDJSON file into a dataframe.
Similar to load_ndjson/2
, but raises in case of error.
Reads a binary representation of a parquet file into a dataframe.
Similar to load_parquet/2
but raises if there is a problem reading the Parquet file.
Writes a dataframe to a delimited file.
Similar to to_csv/3
but raises if there is a problem reading the CSV.
Writes a dataframe to an IPC file.
Similar to to_ipc/3
, but raises in case of error.
Writes a dataframe to an IPC Stream file.
Similar to to_ipc_stream/3
, but raises in case of error.
Writes a dataframe to a ndjson file.
Similar to to_ndjson/3
, but raises in case of error.
Writes a dataframe to a parquet file.
Similar to to_parquet/3
, but raises in case of error.
Types
Represents a column name or its index.
Represents a column name as atom or string.
Represents multiple column names as atoms or strings.
Represents a column pair where the value is a column name or
a column index, and the value is of type value
.
Represents multiple columns.
Represents a filesystem entry, that can be local, S3 or URL.
Represents a dataframe.
Functions: Conversion
Collects the lazy data frame into an eager one, computing the query.
It is the opposite of lazy/1
. If already eager, this is a noop.
Collecting a grouped dataframe should return a grouped dataframe.
Converts the dataframe to the lazy version of the current backend.
Operations on a lazy dataframe are not executed immediately, instead
they are batched together, and performed when collect/1
is invoked.
This allows for query optimizations, parallelism, leading to better
performance.
If already lazy, this is a noop. Converting a grouped dataframe should return a lazy dataframe with groups.
Returns true if the data frame is a lazy one.
@spec new( Table.Reader.t() | series_pairs, opts :: Keyword.t() ) :: t() when series_pairs: %{required(column_name()) => Explorer.Series.t()} | [{column_name(), Explorer.Series.t()}]
Creates a new dataframe.
It accepts any of:
- a map or keyword list of string/atom keys and series as values
- a map or keyword list of string/atom keys and tensors as values
- any data structure adhering to the
Table.Reader
protocol
Options
:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:dtypes
- A list/map of{column_name, dtype}
pairs. (default:[]
):lazy
- force the results into the lazy version of the current backend.
Examples
From series
Series can be given either as keyword lists or maps where the keys are the name and the values are series:
iex> Explorer.DataFrame.new(%{
...> floats: Explorer.Series.from_list([1.0, 2.0]),
...> ints: Explorer.Series.from_list([1, nil])
...> })
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
>
From tensors
To create dataframe from tensors, you can pass a matrix as argument. Each matrix column becomes a dataframe column with names x1, x2, x3, etc:
iex> Explorer.DataFrame.new(Nx.tensor([
...> [1, 2, 3],
...> [4, 5, 6]
...> ]))
#Explorer.DataFrame<
Polars[2 x 3]
x1 s64 [1, 4]
x2 s64 [2, 5]
x3 s64 [3, 6]
>
Explorer expects tensors to have certain types, so you may need to cast
the data accordingly. See Explorer.Series.from_tensor/2
for more info.
You can also pass a keyword list or maps of vectors (rank 1 tensors):
iex> Explorer.DataFrame.new(%{
...> floats: Nx.tensor([1.0, 2.0], type: :f64),
...> ints: Nx.tensor([3, 4])
...> })
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [3, 4]
>
Use dtypes to force a particular representation:
iex> Explorer.DataFrame.new([
...> floats: Nx.tensor([1.0, 2.0], type: :f64),
...> times: Nx.tensor([3_000, 4_000])
...> ], dtypes: [times: :time])
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
times time [00:00:00.000003, 00:00:00.000004]
>
From tabular
Tabular data can be either columnar or row-based. Let's start with column data:
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, nil]})
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
>
iex> Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
#Explorer.DataFrame<
Polars[2 x 2]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
>
iex> Explorer.DataFrame.new([floats: [1.0, 2.0], ints: [1, nil], binaries: [<<239, 191, 19>>, nil]], dtypes: [{:binaries, :binary}])
#Explorer.DataFrame<
Polars[2 x 3]
floats f64 [1.0, 2.0]
ints s64 [1, nil]
binaries binary [<<239, 191, 19>>, nil]
>
iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, "wrong"]})
** (ArgumentError) cannot create series "ints": the value "wrong" does not match the inferred dtype {:s, 64}
From row data:
iex> rows = [%{id: 1, name: "José"}, %{id: 2, name: "Christopher"}, %{id: 3, name: "Cristine"}]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id s64 [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
iex> rows = [[id: 1, name: "José"], [id: 2, name: "Christopher"], [id: 3, name: "Cristine"]]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
Polars[3 x 2]
id s64 [1, 2, 3]
name string ["José", "Christopher", "Cristine"]
>
Converts a dataframe to a list of columns with lists as values.
See to_series/2
if you want a list of columns with series as values.
Note that this function does not take into account groups.
Warning
This is an expensive operation since it converts series to lists and doing so will copy the whole dataframe. Prefer to use the operations in this and the
Explorer.Series
module rather than the ones inEnum
whenever possible, as Explorer is optimized for large series.
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> Explorer.DataFrame.to_columns(df)
%{"floats" => [1.0, 2.0], "ints" => [1, nil]}
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_columns(df, atom_keys: true)
%{floats: [1.0, 2.0], ints: [1, nil]}
Converts a dataframe to a list of maps (rows).
Warning
This is an expensive operation since data is stored in a columnar format. You must avoid converting a dataframe to rows, as that will transform and copy the whole dataframe in memory. Prefer to use the operations in this module rather than the ones in
Enum
whenever possible, as this module is optimized for large series.
Options
:atom_keys
- Configure if the resultant maps should have atom keys. (default:false
)
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df, atom_keys: true)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]
@spec to_rows_stream(df :: t(), Keyword.t()) :: Enumerable.t()
Converts a dataframe to a stream of maps (rows).
Warning
This is an expensive operation since data is stored in a columnar format. Prefer to use the operations in this module rather than the ones in
Enum
whenever possible, as this module is optimized for large series.
Options
:atom_keys
- Configure if the resultant maps should have atom keys. (default:false
):chunk_size
- Number of rows passed toto_rows/2
while streaming over the data. (default:1000
)
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows_stream(df) |> Enum.map(& &1)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows_stream(df, atom_keys: true) |> Enum.map(& &1)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]
Converts a dataframe to a list of columns with series as values.
See to_columns/2
if you want a list of columns with lists as values.
Note that this function does not take into account groups.
Options
:atom_keys
- Configure if the resultant map should have atom keys. (default:false
)
Examples
iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> map = Explorer.DataFrame.to_series(df)
iex> Explorer.Series.to_list(map["floats"])
[1.0, 2.0]
iex> Explorer.Series.to_list(map["ints"])
[1, nil]
Functions: Single-table
Calculates the pairwise correlation of numeric columns.
The returned dataframe is the correlation matrix.
Supported dtypes
Only columns with the following dtypes are taken into account.
- floats:
{:f, 32}
and{:f, 64}
- integers:
{:s, 8}
,{:s, 16}
,{:s, 32}
,{:s, 64}
,{:u, 8}
,{:u, 16}
,{:u, 32}
and{:u, 64}
The resultant columns are always {:f, 64}
.
Options
:columns
- the selection of columns to calculate. Defaults to all numeric columns.:column_name
- the name of the column with column names. Defaults to "names".:ddof
- the 'delta degrees of freedom' - the divisor used in the correlation calculation. Defaults to 1.:method
refers to the correlation method. The following methods are available::pearson
: Standard correlation coefficient. (default):spearman
: Spearman rank correlation.
Examples
iex> df = Explorer.DataFrame.new(dogs: [1, 8, 3], cats: [4, 5, 2])
iex> Explorer.DataFrame.correlation(df)
#Explorer.DataFrame<
Polars[2 x 3]
names string ["dogs", "cats"]
dogs f64 [1.0000000000000002, 0.5447047794019219]
cats f64 [0.5447047794019219, 1.0]
>
Calculates the pairwise covariance of numeric columns.
The returned dataframe is the covariance matrix.
Supported dtypes
Only columns with the following dtypes are taken into account.
- floats:
{:f, 32}
and{:f, 64}
- integers:
{:s, 8}
,{:s, 16}
,{:s, 32}
,{:s, 64}
,{:u, 8}
,{:u, 16}
,{:u, 32}
and{:u, 64}
The resultant columns are always {:f, 64}
.
Options
:columns
- the selection of columns to calculate. Defaults to all numeric columns.:column_name
- the name of the column with column names. Defaults to "names".:ddof
- the 'delta degrees of freedom' - the divisor used in the covariance calculation. Defaults to 1.
Examples
iex> df = Explorer.DataFrame.new(dogs: [1, 0, 2, 1], cats: [2, 3, 0, 1])
iex> Explorer.DataFrame.covariance(df)
#Explorer.DataFrame<
Polars[2 x 3]
names string ["dogs", "cats"]
dogs f64 [0.6666666666666666, -1.0]
cats f64 [-1.0, 1.6666666666666667]
>
Describe numeric columns of a DataFrame.
Groups are ignored if the dataframe is using any.
Options
:percentiles
- Floating point list with the percentiles to be calculated. (default:[0.25, 0.5, 0.75]
)
Examples
iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
iex> Explorer.DataFrame.describe(df)
#Explorer.DataFrame<
Polars[9 x 4]
describe string ["count", "nil_count", "mean", "std", "min", ...]
a string ["2", "1", nil, nil, nil, ...]
b f64 [3.0, 0.0, 2.0, 1.0, 1.0, ...]
c string ["3", "0", nil, nil, nil, ...]
>
iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
iex> Explorer.DataFrame.describe(df, percentiles: [0.3, 0.5, 0.8])
#Explorer.DataFrame<
Polars[9 x 4]
describe string ["count", "nil_count", "mean", "std", "min", ...]
a string ["2", "1", nil, nil, nil, ...]
b f64 [3.0, 0.0, 2.0, 1.0, 1.0, ...]
c string ["3", "0", nil, nil, nil, ...]
>
Discards a subset of columns by name.
It's important to notice that groups are kept: you can't discard grouping columns.
Examples
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.discard(df, ["b"])
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.discard(df, ["a", "b"])
#Explorer.DataFrame<
Polars[3 x 1]
c s64 [4, 5, 6]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
Grouped examples
You cannot discard grouped columns. You need to ungroup before removing them:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.discard(grouped, ["species"])
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Takes distinct rows by a selection of columns.
Distinct is not affected by groups, although groups are kept in the
columns selection if keep_all
option is false (the default).
Options
:keep_all
- If set totrue
, keep all columns. Default isfalse
.
Examples
By default will return unique values of the requested columns:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
If keep_all
is set to true
, then the first value of each column not in the requested
columns will be returned:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"], keep_all: true)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
A callback on the dataframe's names can be passed instead of a list (like select/2
):
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> Explorer.DataFrame.distinct(df, &String.starts_with?(&1, "x"))
#Explorer.DataFrame<
Polars[2 x 2]
x1 s64 [1, 3]
x2 string ["a", "c"]
>
If the dataframe has groups, then the columns of each group will be added to the distinct columns:
iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> df = Explorer.DataFrame.group_by(df, "x1")
iex> Explorer.DataFrame.distinct(df, ["x2"])
#Explorer.DataFrame<
Polars[2 x 2]
Groups: ["x1"]
x1 s64 [1, 3]
x2 string ["a", "c"]
>
Drop nil values.
Optionally accepts a subset of columns.
Examples
To drop nils on all columns:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df)
#Explorer.DataFrame<
Polars[1 x 2]
a s64 [1]
b s64 [1]
>
To drop nils on a single column:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df, :a)
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [1, 2]
b s64 [1, nil]
>
To drop some columns:
iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, [:a, :c])
#Explorer.DataFrame<
Polars[1 x 3]
a s64 [2]
b s64 [nil]
c s64 [5]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
Turns a set of columns to dummy variables.
In case the dataframe is using groups, all groups will be removed.
Examples
To mark a single column as dummy:
iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, "col_x")
#Explorer.DataFrame<
Polars[4 x 3]
col_x_a u8 [1, 0, 1, 0]
col_x_b u8 [0, 1, 0, 0]
col_x_c u8 [0, 0, 0, 1]
>
Or multiple columns:
iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, ["col_x", "col_y"])
#Explorer.DataFrame<
Polars[4 x 6]
col_x_a u8 [1, 0, 1, 0]
col_x_b u8 [0, 1, 0, 0]
col_x_c u8 [0, 0, 0, 1]
col_y_b u8 [1, 0, 1, 0]
col_y_a u8 [0, 1, 0, 0]
col_y_d u8 [0, 0, 0, 1]
>
Or all string columns:
iex> df = Explorer.DataFrame.new(num: [1, 2, 3, 4], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, fn _name, type -> type == :string end)
#Explorer.DataFrame<
Polars[4 x 3]
col_y_b u8 [1, 0, 1, 0]
col_y_a u8 [0, 1, 0, 0]
col_y_d u8 [0, 0, 0, 1]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
@spec explode(df :: t(), column :: column_name() | [column_name()]) :: t()
Explodes one or multiple list column into multiple rows.
When exploding multiple columns, the number of list elements in each row for the exploded columns must be the same.
Examples
iex> df = Explorer.DataFrame.new(a: [[1, 2], [3, 4]], b: [[5, 6], [7, 8]], c: ["a", "b"])
iex> Explorer.DataFrame.explode(df, :a)
#Explorer.DataFrame<
Polars[4 x 3]
a s64 [1, 2, 3, 4]
b list[s64] [[5, 6], [5, 6], [7, 8], [7, ...]]
c string ["a", "a", "b", "b"]
>
iex> Explorer.DataFrame.explode(df, [:a, :b])
#Explorer.DataFrame<
Polars[4 x 3]
a s64 [1, 2, 3, 4]
b s64 [5, 6, 7, 8]
c string ["a", "a", "b", "b"]
>
You can think of exploding multiple list columns as being the inverse of aggregating the elements the exploded columns into lists:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3, 4], b: [5, 6, 7, 8], c: ["a", "a", "b", "b"])
iex> df = df |> Explorer.DataFrame.group_by(:c) |> Explorer.DataFrame.summarise(a: a, b: b)
#Explorer.DataFrame<
Polars[2 x 3]
c string ["a", "b"]
a list[s64] [[1, 2], [3, 4]]
b list[s64] [[5, 6], [7, 8]]
>
iex> Explorer.DataFrame.explode(df, [:a, :b]) # we are back where we started
#Explorer.DataFrame<
Polars[4 x 3]
c string ["a", "a", "b", "b"]
a s64 [1, 2, 3, 4]
b s64 [5, 6, 7, 8]
>
If you want to perform the cartesian product of two list columns, you must
call explode/2
once for each column:
iex> df = Explorer.DataFrame.new(a: [[1, 2], [3, 4]], b: [[5, 6], [7, 8]], c: ["a", "b"])
iex> df |> Explorer.DataFrame.explode(:a) |> Explorer.DataFrame.explode(:b)
#Explorer.DataFrame<
Polars[8 x 3]
a s64 [1, 1, 2, 2, 3, ...]
b s64 [5, 6, 5, 6, 7, ...]
c string ["a", "a", "a", "a", "b", ...]
>
Picks rows based on Explorer.Query
.
The query is compiled and runs efficiently against the dataframe.
The query must return a boolean expression or a list of boolean expressions.
When a list is returned, they are joined as and
expressions.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
Besides element-wise series operations, you can also use window functions and aggregations inside comparisons. In such cases, grouped dataframes may have different results than ungrouped ones, because the filtering is computed withing groups. See examples below.
See filter_with/2
for a callback version of this function without
Explorer.Query
.
Examples
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col2 > 2)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["c"]
col2 s64 [3]
>
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col1 == "b")
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 s64 [2]
>
iex> df = Explorer.DataFrame.new(col1: [5, 4, 3], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, [col1 > 3, col2 < 3])
#Explorer.DataFrame<
Polars[2 x 2]
col1 s64 [5, 4]
col2 s64 [1, 2]
>
Returning a non-boolean expression errors:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2))
** (ArgumentError) expecting the function to return a boolean LazySeries, but instead it returned a LazySeries of type {:s, 64}
Which can be addressed by converting it to boolean:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2) == 1)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["a"]
col2 s64 [1]
>
Grouped examples
In a grouped dataframe, the aggregation is calculated within each group.
In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter(grouped, petal_length > mean(petal_length))
#Explorer.DataFrame<
Polars[79 x 5]
Groups: ["species"]
sepal_length f64 [4.6, 5.4, 5.0, 4.9, 5.4, ...]
sepal_width f64 [3.1, 3.9, 3.4, 3.1, 3.7, ...]
petal_length f64 [1.5, 1.7, 1.5, 1.5, 1.5, ...]
petal_width f64 [0.2, 0.4, 0.2, 0.1, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec filter_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> Explorer.Series.lazy_t()) ) :: t()
Picks rows based on a callback function.
The callback receives a lazy dataframe. A lazy dataframe does not hold any values, instead it stores all operations in order to execute all filtering performantly.
This is a callback version of filter/2
.
Examples
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, &Explorer.Series.greater(&1["col2"], 2))
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["c"]
col2 s64 [3]
>
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.equal(df["col1"], "b") end)
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 s64 [2]
>
Grouped examples
In a grouped dataframe, the aggregation is calculated within each group.
In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter_with(grouped, &Explorer.Series.greater(&1["petal_length"], Explorer.Series.mean(&1["petal_length"])))
#Explorer.DataFrame<
Polars[79 x 5]
Groups: ["species"]
sepal_length f64 [4.6, 5.4, 5.0, 4.9, 5.4, ...]
sepal_width f64 [3.1, 3.9, 3.4, 3.1, 3.7, ...]
petal_length f64 [1.5, 1.7, 1.5, 1.5, 1.5, ...]
petal_width f64 [0.2, 0.4, 0.2, 0.1, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec frequencies(df :: t(), columns :: column_names()) :: t()
Creates a new dataframe with unique rows and the frequencies of each.
Examples
iex> df = Explorer.DataFrame.new(a: ["a", "a", "b"], b: [1, 1, nil])
iex> Explorer.DataFrame.frequencies(df, [:a, :b])
#Explorer.DataFrame<
Polars[2 x 3]
a string ["a", "b"]
b s64 [1, nil]
counts u32 [2, 1]
>
@spec group_by(df :: t(), groups_or_group :: column_names() | column_name()) :: t()
Group the dataframe by one or more variables.
When the dataframe has grouping variables, operations are performed per group.
Explorer.DataFrame.ungroup/2
removes grouping.
Examples
You can group by a single variable:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, "country")
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Or you can group by multiple columns in a given list:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, ["country", "year"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country", "year"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Or by a range:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, 0..1)
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["year", "country"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Regexes and functions are also accepted in column names, as in select/2
.
@spec mask(df :: t(), mask :: Explorer.Series.t() | [boolean()]) :: t()
Picks rows based on a list or series of values.
Examples
This function must only be used when you need to select rows based on external values that are not available to the dataframe. For example, you can pass a list:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, [false, true, false])
#Explorer.DataFrame<
Polars[1 x 2]
col1 string ["b"]
col2 s64 [2]
>
You must avoid using masks when the masks themselves are computed from other columns. For example, DO NOT do this:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, Explorer.Series.greater(df["col2"], 1))
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["b", "c"]
col2 s64 [2, 3]
>
Instead, do this:
iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.greater(df["col2"], 1) end)
#Explorer.DataFrame<
Polars[2 x 2]
col1 string ["b", "c"]
col2 s64 [2, 3]
>
The filter_with/2
version is much more efficient because it doesn't need
to create intermediate series representations to apply the mask.
Creates or modifies columns based on Explorer.Query
.
The query is compiled and runs efficiently against the dataframe. New variables overwrite existing variables of the same name. Column names are coerced from atoms to strings.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
Besides element-wise series operations, you can also use window functions and aggregations inside mutations. In such cases, grouped dataframes may have different results than ungrouped ones, because the mutation is computed withing groups. See examples below.
See mutate_with/2
for a callback version of this function without
Explorer.Query
. If your mutation cannot be expressed with queries,
you may compute the values using the Explorer.Series
API directly
and then add it to the dataframe using put/3
.
Examples
Mutations are useful to add or modify columns in your dataframe:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, c: b + 1)
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
c s64 [2, 3, 4]
>
It's also possible to overwrite existing columns:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: b * 2)
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [2, 4, 6]
b s64 [1, 2, 3]
>
Scalar values are repeated to fill the series:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: 4)
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [4, 4, 4]
b s64 [1, 2, 3]
>
It's also possible to use functions from the Series module, like Explorer.Series.window_sum/3
:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, b: window_sum(a, 2))
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b s64 [1, 3, 5]
>
Alternatively, all of the above works with a map instead of a keyword list:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, %{"c" => cast(b, {:f, 64})})
#Explorer.DataFrame<
Polars[3 x 3]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
c f64 [1.0, 2.0, 3.0]
>
Grouped examples
Mutations in grouped dataframes takes the context of the group.
This enables some aggregations to be made considering each group. It's almost like summarise/2
,
but repeating the results for each member in the group.
For example, if we want to count how many elements of a given group, we can add a new
column with that aggregation:
iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate(grouped, count: count(b))
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["id"]
id string ["a", "a", "b"]
b s64 [1, 2, 3]
count u32 [2, 2, 1]
>
In case we want to get the average size of the petal length from the Iris dataset, we can:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.mutate(grouped, petal_length_avg: mean(petal_length))
#Explorer.DataFrame<
Polars[150 x 6]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
petal_length_avg f64 [1.464, 1.464, 1.464, 1.464, 1.464, ...]
>
@spec mutate_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> column_pairs(Explorer.Series.lazy_t())) ) :: t()
Creates or modifies columns using a callback function.
The callback receives a lazy dataframe. A lazy dataframe doesn't hold any values, instead it stores all operations in order to execute all mutations performantly.
This is a callback version of mutate/2
. If your mutation
cannot be expressed with lazy dataframes, you may compute the
values using the Explorer.Series
API directly and then add
it to the dataframe using put/3
.
Examples
Here is an example of a new column that sums the value of two other columns:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])])
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [4, 5, 6]
b s64 [1, 2, 3]
c s64 [5, 7, 9]
>
You can overwrite existing columns as well:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[b: Explorer.Series.pow(&1["b"], 2)])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b f64 [1.0, 4.0, 9.0]
>
It's possible to "reuse" a variable for different computations:
iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, fn ldf ->
iex> c = Explorer.Series.add(ldf["a"], ldf["b"])
iex> [c: c, d: Explorer.Series.window_sum(c, 2)]
iex> end)
#Explorer.DataFrame<
Polars[3 x 4]
a s64 [4, 5, 6]
b s64 [1, 2, 3]
c s64 [5, 7, 9]
d s64 [5, 12, 16]
>
Grouped examples
Mutations in grouped dataframes takes the context of the group. For example, if we want to count how many elements of a given group, we can add a new column with that aggregation:
iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate_with(grouped, &[count: Explorer.Series.count(&1["b"])])
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["id"]
id string ["a", "a", "b"]
b s64 [1, 2, 3]
count u32 [2, 2, 1]
>
Counts the number of nil
elements in each column.
Examples
iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [nil, 2, nil], c: ["a", "b", "c"])
iex> Explorer.DataFrame.nil_count(df)
#Explorer.DataFrame<
Polars[1 x 3]
a u32 [1]
b u32 [2]
c u32 [0]
>
Pivot data from wide to long.
pivot_longer/3
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is pivot_wider/4
.
The second argument, columns_to_pivot
, can be either list of column names to pivot.
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
The selected columns must always have the same data type.
In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.
Options
:select
- Columns that are not in the list of pivot and should be kept in the dataframe. Ranges, regexes, and functions are also accepted in column names, as inselect/2
. Defaults to all columns except the ones to pivot.:discard
- Columns that are not in the list of pivot and should be dropped from the dataframe. Ranges, regexes, and functions are also accepted in column names, as inselect/2
. This list of columns is going to be subtracted from the list ofselect
. Defaults to an empty list.:names_to
- A string specifying the name of the column to create from the data stored in the column names of the dataframe. Defaults to"variable"
.:values_to
- A string specifying the name of the column to create from the data stored in series element values. Defaults to"value"
.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[3282 x 9]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value s64 [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"), select: ["year", "country"])
#Explorer.DataFrame<
Polars[3282 x 4]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
value s64 [627, 117, 332, 0, 0, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: ["year", "country"], discard: ["country"])
#Explorer.DataFrame<
Polars[1094 x 3]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
variable string ["total", "total", "total", "total", "total", ...]
value s64 [2308, 1254, 32500, 141, 7924, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: [], names_to: "my_var", values_to: "my_value")
#Explorer.DataFrame<
Polars[1094 x 2]
my_var string ["total", "total", "total", "total", "total", ...]
my_value s64 [2308, 1254, 32500, 141, 7924, ...]
>
Grouped examples
In the following example we want to take the Iris dataset and increase the number of rows by pivoting the "sepal_length" column. This dataset is grouped by "species", so the resultant dataframe is going to keep the "species" group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["sepal_length"])
#Explorer.DataFrame<
Polars[150 x 6]
Groups: ["species"]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
variable string ["sepal_length", "sepal_length", "sepal_length", "sepal_length", "sepal_length", ...]
value f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
>
Now we want to do something different: we want to pivot the "species" column that is also a group. This is going to remove the group in the resultant dataframe:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["species"])
#Explorer.DataFrame<
Polars[150 x 6]
sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
variable string ["species", "species", "species", "species", "species", ...]
value string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec pivot_wider( df :: t(), names_from :: column(), values_from :: column() | columns(), opts :: Keyword.t() ) :: t()
Pivot data from long to wide.
pivot_wider/4
"widens" data, increasing the number of columns and decreasing the number of rows.
The inverse transformation is pivot_longer/3
.
In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.
Options
:id_columns
- A set of columns that uniquely identifies each observation.Defaults to all columns in data except for the columns specified in
names_from
andvalues_from
, and columns that are of the{:f, 64}
dtype.Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables. May accept a filter callback, a list or a range of column names. If an empty list is passed, or a range that results in a empty list of column names, it raises an error.
ID columns cannot be of the float type and any columns of this dtype is discarded. If you need to use float columns as IDs, you must carefully consider rounding or truncating the column and converting it to integer, as long as doing so preserves the properties of the column.
:names_prefix
- String added to the start of every variable name. This is particularly useful ifnames_from
is a numeric vector and you want to create syntactic variable names.
Examples
Suppose we have a basketball court and multiple teams that want to train in that court. They need to share a schedule with the hours each team is going to use it. Here is a dataframe representing that schedule:
iex> Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
This dataframe is going to look like this - using print/2
:
+----------------------------------------------+
| Explorer DataFrame: [rows: 10, columns: 3] |
+---------------+--------------+---------------+
| weekday | team | hour |
| <string> | <string> | <integer> |
+===============+==============+===============+
| Monday | A | 10 |
+---------------+--------------+---------------+
| Tuesday | B | 9 |
+---------------+--------------+---------------+
| Wednesday | C | 10 |
+---------------+--------------+---------------+
| Thursday | A | 10 |
+---------------+--------------+---------------+
| Friday | B | 11 |
+---------------+--------------+---------------+
| Monday | C | 15 |
+---------------+--------------+---------------+
| Tuesday | A | 14 |
+---------------+--------------+---------------+
| Wednesday | B | 16 |
+---------------+--------------+---------------+
| Thursday | C | 14 |
+---------------+--------------+---------------+
| Friday | A | 16 |
+---------------+--------------+---------------+
You can see that the "weekday" repeats, and it's not clear how free the agenda is. We can solve that by pivoting the "weekday" column in multiple columns, making each weekday a new column in the resultant dataframe.
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
team string ["A", "B", "C"]
Monday s64 [10, nil, 15]
Tuesday s64 [14, 9, nil]
Wednesday s64 [nil, 16, 10]
Thursday s64 [10, nil, 14]
Friday s64 [16, 11, nil]
>
Now if we print that same dataframe with print/2
, we get a better picture of the schedule:
+----------------------------------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 6] |
+----------+-----------+-----------+-----------+-----------+-----------+
| team | Monday | Tuesday | Wednesday | Thursday | Friday |
| <string> | <integer> | <integer> | <integer> | <integer> | <integer> |
+==========+===========+===========+===========+===========+===========+
| A | 10 | 14 | | 10 | 16 |
+----------+-----------+-----------+-----------+-----------+-----------+
| B | | 9 | 16 | | 11 |
+----------+-----------+-----------+-----------+-----------+-----------+
| C | 15 | | 10 | 14 | |
+----------+-----------+-----------+-----------+-----------+-----------+
Pivot wider can create unpredictable column names, and sometimes they can conflict with ID columns. In that scenario, we add a number as suffix to duplicated column names. Here is an example:
iex> df = Explorer.DataFrame.new(
iex> product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex> property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex> property_value: [1, 42, 40, 64, 2, 35, 20, 40]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", "property_value")
#Explorer.DataFrame<
Polars[2 x 5]
product_id s64 [1, 2]
product_id_1 s64 [1, 2]
width_cm s64 [42, 35]
height_cm s64 [40, 20]
length_cm s64 [64, 40]
>
But if the option :names_prefix
is used, that suffix is not added:
iex> df = Explorer.DataFrame.new(
iex> product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex> property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex> property_value: [1, 42, 40, 64, 2, 35, 20, 40]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", "property_value", names_prefix: "col_")
#Explorer.DataFrame<
Polars[2 x 5]
product_id s64 [1, 2]
col_product_id s64 [1, 2]
col_width_cm s64 [42, 35]
col_height_cm s64 [40, 20]
col_length_cm s64 [64, 40]
>
Multiple columns are accepted for the values_from
parameter, but the behaviour is slightly
different for the naming of new columns in the resultant dataframe. The new columns are going
to be prefixed by the name of the original value column, followed by an underscore and the
original column name, followed by the name of the variable.
iex> df = Explorer.DataFrame.new(
iex> product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex> property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex> property_value: [1, 42, 40, 64, 2, 35, 20, 40],
iex> another_value: [1, 43, 41, 65, 2, 36, 21, 42]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", ["property_value", "another_value"])
#Explorer.DataFrame<
Polars[2 x 9]
product_id s64 [1, 2]
property_value_property_product_id s64 [1, 2]
property_value_property_width_cm s64 [42, 35]
property_value_property_height_cm s64 [40, 20]
property_value_property_length_cm s64 [64, 40]
another_value_property_product_id s64 [1, 2]
another_value_property_width_cm s64 [43, 36]
another_value_property_height_cm s64 [41, 21]
another_value_property_length_cm s64 [65, 42]
>
Grouped examples
Now using the same idea, we can see that there is not much difference for grouped dataframes. The only detail is that groups that are not ID columns are discarded.
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "team")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
Groups: ["team"]
team string ["A", "B", "C"]
Monday s64 [10, nil, 15]
Tuesday s64 [14, 9, nil]
Wednesday s64 [nil, 16, 10]
Thursday s64 [10, nil, 14]
Friday s64 [16, 11, nil]
>
In the following example the group "weekday" is going to be removed, because the column is going to be pivoted in multiple columns:
iex> df = Explorer.DataFrame.new(
iex> weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex> team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex> hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "weekday")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
Polars[3 x 6]
team string ["A", "B", "C"]
Monday s64 [10, nil, 15]
Tuesday s64 [14, 9, nil]
Wednesday s64 [nil, 16, 10]
Thursday s64 [10, nil, 14]
Friday s64 [16, 11, nil]
>
@spec pull(df :: t(), column :: column()) :: Explorer.Series.t()
Extracts a single column as a series.
This is equivalent to df[field]
for retrieving a single field.
The returned series will have its :name
field set to the column name.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, "total")
#Explorer.Series<
Polars[1094]
s64 [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, 2)
#Explorer.Series<
Polars[1094]
s64 [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>
Creates or modifies a single column.
This is a simplified way to add or modify one column, accepting a series, tensor, or a list.
If you are computing a series, it is preferrable to use
mutate/2
or mutate_with/2
to compute the series and
modify it in a single step, as it is more powerful and
it handles both expressions and scalar values accordingly.
If you are passing tensors or lists, they will be automatically
converted to a series. By default, the new series will have the
same dtype as the existing series, unless the :dtype
option
is given. If there is no existing series, one is inferred from
the tensor/list.
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.transform(df[:a], fn n -> n * 2 end))
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b s64 [2, 4, 6]
>
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.from_list([4, 5, 6]))
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b s64 [4, 5, 6]
>
Grouped examples
If the dataframe is grouped, put/3
is going to ignore the groups.
So the series must be of the same size of the entire dataframe.
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, "a")
iex> series = Explorer.Series.from_list([9, 8, 7])
iex> Explorer.DataFrame.put(grouped, :b, series)
#Explorer.DataFrame<
Polars[3 x 2]
Groups: ["a"]
a s64 [1, 2, 3]
b s64 [9, 8, 7]
>
Tensor examples
You can also put tensors into the dataframe:
iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([1, 2, 3]))
#Explorer.DataFrame<
Polars[3 x 1]
a s64 [1, 2, 3]
>
You can specify which dtype the tensor represents. For example, a tensor of s64 represents integers by default, but it may also represent timestamps in microseconds from the Unix epoch:
iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([1, 2, 3]), dtype: {:datetime, :microsecond})
#Explorer.DataFrame<
Polars[3 x 1]
a datetime[μs] [1970-01-01 00:00:00.000001, 1970-01-01 00:00:00.000002, 1970-01-01 00:00:00.000003]
>
If there is already a column where we want to place the tensor, the column dtype will be automatically used, this means that updating dataframes in place while preserving their types is straight-forward:
iex> df = Explorer.DataFrame.new(a: [~N[1970-01-01 00:00:00]])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(529550625987654))
#Explorer.DataFrame<
Polars[1 x 1]
a datetime[μs] [1986-10-13 01:23:45.987654]
>
This is particularly useful for categorical columns:
iex> cat = Explorer.Series.from_list(["foo", "bar", "baz"], dtype: :category)
iex> df = Explorer.DataFrame.new(a: cat)
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([2, 1, 0]))
#Explorer.DataFrame<
Polars[3 x 1]
a category ["baz", "bar", "foo"]
>
On the other hand, if you try to put a floating tensor on
an integer column, an error will be raised unless a dtype
or dtype: :infer
is given:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64))
** (ArgumentError) dtype {:s, 64} expects a tensor of type {:s, 64} but got type {:f, 64}
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: {:f, 64})
#Explorer.DataFrame<
Polars[3 x 1]
a f64 [1.0, 1.0, 1.0]
>
iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: :infer)
#Explorer.DataFrame<
Polars[3 x 1]
a f64 [1.0, 1.0, 1.0]
>
List examples
Similar to tensors, we can also put lists in the dataframe:
iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, [1, 2, 3])
#Explorer.DataFrame<
Polars[3 x 1]
a s64 [1, 2, 3]
>
The same considerations as above apply.
Relocates columns.
Change column order within a DataFrame. The before
and after
options are mutually exclusive.
Providing no options will relocate the columns to beginning of the DataFrame.
Options
:before
- Specifies to relocate before the given column. It can be an index or a column name.:after
- Specifies to relocate after the given column. It can be an index or a column name.
Examples
Relocate a single column
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1], c: [nil, 5, 6])
iex> Explorer.DataFrame.relocate(df, "a", after: "c")
#Explorer.DataFrame<
Polars[3 x 3]
b s64 [1, 3, 1]
c s64 [nil, 5, 6]
a string ["a", "b", "a"]
>
Relocate (and reorder) multiple columns to the beginning
iex> df = Explorer.DataFrame.new(a: [1, 2], b: [5.1, 5.2], c: [4, 5], d: ["yes", "no"])
iex> Explorer.DataFrame.relocate(df, ["d", 1], before: 0)
#Explorer.DataFrame<
Polars[2 x 4]
d string ["yes", "no"]
b f64 [5.1, 5.2]
a s64 [1, 2]
c s64 [4, 5]
>
Relocate before another column
iex> df = Explorer.DataFrame.new(a: [1, 2], b: [5.1, 5.2], c: [4, 5], d: ["yes", "no"])
iex> Explorer.DataFrame.relocate(df, ["a", "c"], before: "b")
#Explorer.DataFrame<
Polars[2 x 4]
a s64 [1, 2]
c s64 [4, 5]
b f64 [5.1, 5.2]
d string ["yes", "no"]
>
@spec rename( df :: t(), names :: column_names() | column_pairs(column_name()) ) :: t()
Renames columns.
Renaming a column that is also a group is going to rename the group as well.
To apply a function to a subset of columns, see rename_with/3
.
Examples
You can pass in a list of new names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, ["c", "d"])
#Explorer.DataFrame<
Polars[3 x 2]
c string ["a", "b", "a"]
d s64 [1, 3, 1]
>
Or you can rename individual columns using keyword args:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, a: "first")
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b s64 [1, 3, 1]
>
Or you can rename individual columns using a map:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, %{"a" => "first"})
#Explorer.DataFrame<
Polars[3 x 2]
first string ["a", "b", "a"]
b s64 [1, 3, 1]
>
Renames columns with a function.
Renaming a column that is also a group is going to rename the group as well.
Examples
If no columns are specified, it will apply the function to all column names:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
YEAR s64 [2010, 2010, 2010, 2010, 2010, ...]
COUNTRY string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL s64 [2308, 1254, 32500, 141, 7924, ...]
SOLID_FUEL s64 [627, 117, 332, 0, 0, ...]
LIQUID_FUEL s64 [1601, 953, 12381, 141, 3649, ...]
GAS_FUEL s64 [74, 7, 14565, 0, 374, ...]
CEMENT s64 [5, 177, 2598, 0, 204, ...]
GAS_FLARING s64 [0, 0, 2623, 0, 3697, ...]
PER_CAPITA f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
BUNKER_FUELS s64 [9, 7, 663, 0, 321, ...]
>
A callback can be used to filter the column names that will be renamed, similarly to select/2
:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.ends_with?(&1, "_fuel"), &String.trim_trailing(&1, "_fuel"))
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid s64 [627, 117, 332, 0, 0, ...]
liquid s64 [1601, 953, 12381, 141, 3649, ...]
gas s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Or you can just pass in the list of column names you'd like to apply the function to:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, ["total", "cement"], &String.upcase/1)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
TOTAL s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
CEMENT s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Ranges, regexes, and functions are also accepted in column names, as in select/2
.
Selects a subset of columns by name.
It's important to notice that groups are kept: you can't select off grouping columns.
Examples
You can select a single column:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, "a")
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
Or a list of names:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ["a"])
#Explorer.DataFrame<
Polars[3 x 1]
a string ["a", "b", "c"]
>
You can also use a range or a list of integers:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, [0, 1])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
>
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, 0..1)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [1, 2, 3]
>
Or you can use a callback function that takes the dataframe's names as its first argument:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, &String.starts_with?(&1, "b"))
#Explorer.DataFrame<
Polars[3 x 1]
b s64 [1, 2, 3]
>
Or, if you prefer, a regex:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ~r/^b$/)
#Explorer.DataFrame<
Polars[3 x 1]
b s64 [1, 2, 3]
>
Or a callback function that takes names and types:
iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, fn _name, type -> type == {:s, 64} end)
#Explorer.DataFrame<
Polars[3 x 1]
b s64 [1, 2, 3]
>
Grouped examples
Columns that are also groups cannot be removed, you need to ungroup before removing these columns.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.select(grouped, ["sepal_width"])
#Explorer.DataFrame<
Polars[150 x 2]
Groups: ["species"]
sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Sorts rows by columns using Explorer.Query
.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
See sort_with/2
for a callback version of this function without
Explorer.Query
.
Options
:nils
-:first
or:last
. By default it is:last
if direction is:asc
, and:first
otherwise.:parallel
- boolean. Whether to parallelize the sorting. By default it istrue
.Parallel sort isn't available on certain lazy operations. In those situations this option is ignored.
:stable
- boolean. Determines if the sorting is stable (ties are guaranteed to maintain their order) or not. Unstable sorting may be more performant. By default it isfalse
.
Examples
A single column name will sort ascending by that column:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_by(df, a)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [3, 1, 2]
>
You can also sort descending:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_by(df, desc: a)
#Explorer.DataFrame<
Polars[3 x 2]
a string ["c", "b", "a"]
b s64 [2, 1, 3]
>
You can specify how nil
s are sorted:
iex> df = Explorer.DataFrame.new(a: ["b", "c", nil, "a"])
iex> Explorer.DataFrame.sort_by(df, [desc: a], nils: :first)
#Explorer.DataFrame<
Polars[4 x 1]
a string [nil, "c", "b", "a"]
>
Sorting by more than one column sorts them in the order they are entered:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sort_by(df, asc: total, desc: country)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2011, 2011, 2012, ...]
country string ["NIUE", "TUVALU", "TUVALU", "NIUE", "NIUE", ...]
total s64 [1, 2, 2, 2, 2, ...]
solid_fuel s64 [0, 0, 0, 0, 0, ...]
liquid_fuel s64 [1, 2, 2, 2, 2, ...]
gas_fuel s64 [0, 0, 0, 0, 0, ...]
cement s64 [0, 0, 0, 0, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [0.52, 0.0, 0.0, 1.04, 1.04, ...]
bunker_fuels s64 [0, 0, 0, 0, 0, ...]
>
Grouped examples
When used in a grouped dataframe, sort_by is going to sort each group individually and then return the entire dataframe with the existing groups. If one of the sort_by columns is also a group, the sorting for that column is not going to work. It is necessary to first summarise the desired column and then sort_by it.
Here is an example using the Iris dataset. We group by species and then we try to sort the dataframe by species and petal length, but only "petal length" is taken into account because "species" is a group.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sort_by(grouped, desc: species, asc: sepal_width)
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length f64 [4.5, 4.4, 4.9, 4.8, 4.3, ...]
sepal_width f64 [2.3, 2.9, 3.0, 3.0, 3.0, ...]
petal_length f64 [1.3, 1.4, 1.4, 1.4, 1.1, ...]
petal_width f64 [0.3, 0.2, 0.2, 0.1, 0.1, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
@spec sort_with( df :: t(), (Explorer.Backend.LazyFrame.t() -> Explorer.Series.lazy_t() | [Explorer.Series.lazy_t()] | [{:asc | :desc, Explorer.Series.lazy_t()}]), opts :: [nils: :first | :last, stable: boolean()] ) :: t()
Sorts rows by columns using a callback function.
The callback receives a lazy dataframe which stores operations instead of values for efficient sorting.
This is a callback version of sort_by/2
.
Options
:nils
-:first
or:last
. By default it is:last
if direction is:asc
, and:first
otherwise.:parallel
- boolean. Whether to parallelize the sorting. By default it istrue
.Parallel sort isn't available on certain lazy operations. In those situations this option is ignored.
:stable
- boolean. Determines if the sorting is stable (ties are guaranteed to maintain their order) or not. Unstable sorting may be more performant. By default it isfalse
.
Examples
A single column name will sort ascending by that column:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_with(df, &(&1["a"]))
#Explorer.DataFrame<
Polars[3 x 2]
a string ["a", "b", "c"]
b s64 [3, 1, 2]
>
You can also sort descending:
iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"]])
#Explorer.DataFrame<
Polars[3 x 2]
a string ["c", "b", "a"]
b s64 [2, 1, 3]
>
You can specify how nil
s are sorted:
iex> df = Explorer.DataFrame.new(a: ["b", "c", nil, "a"])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"]], nils: :first)
#Explorer.DataFrame<
Polars[4 x 1]
a string [nil, "c", "b", "a"]
>
Sorting by more than one column sorts them in the order they are entered:
iex> df = Explorer.DataFrame.new(a: [3, 1, 3], b: [2, 1, 3])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"], asc: &1["b"]])
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [3, 3, 1]
b s64 [2, 3, 1]
>
Grouped examples
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sort_with(grouped, &[desc: &1["species"], asc: &1["sepal_width"]])
#Explorer.DataFrame<
Polars[150 x 5]
Groups: ["species"]
sepal_length f64 [4.5, 4.4, 4.9, 4.8, 4.3, ...]
sepal_width f64 [2.3, 2.9, 3.0, 3.0, 3.0, ...]
petal_length f64 [1.3, 1.4, 1.4, 1.4, 1.1, ...]
petal_width f64 [0.3, 0.2, 0.2, 0.1, 0.1, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Summarise each group to a single row using Explorer.Query
.
To summarise, you must either perform an aggregation (defined in Explorer.Series
)
on the desired columns, or select plain columns to retrieve all elements of a group.
The query is compiled and runs efficiently against the dataframe. This function performs aggregations based on groups and it implicitly ungroups the resultant dataframe.
Notice
This is a macro. You must
require Explorer.DataFrame
before using it.
See summarise_with/2
for a callback version of this function without
Explorer.Query
.
Examples
iex> df = Explorer.DataFrame.new(letters: ~w(a b c d e), is_vowel: [true, false, false, false, true])
iex> grouped_df = Explorer.DataFrame.group_by(df, :is_vowel)
iex> Explorer.DataFrame.summarise(grouped_df, letters: letters)
#Explorer.DataFrame<
Polars[2 x 2]
is_vowel boolean [true, false]
letters list[string] [["a", "e"], ["b", "c", "d"]]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> grouped_df = Explorer.DataFrame.group_by(df, "year")
iex> Explorer.DataFrame.summarise(grouped_df, total_max: max(total), total_min: min(total))
#Explorer.DataFrame<
Polars[5 x 3]
year s64 [2010, 2011, 2012, 2013, 2014]
total_max s64 [2393248, 2654360, 2734817, 2797384, 2806634]
total_min s64 [1, 2, 2, 2, 3]
>
Suppose you want to get the mean petal length of each Iris species. You could do something like this:
iex> df = Explorer.Datasets.iris()
iex> grouped_df = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.summarise(grouped_df, mean_petal_length: mean(petal_length))
#Explorer.DataFrame<
Polars[3 x 2]
species string ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
mean_petal_length f64 [1.464, 4.26, 5.552]
>
In case aggregations for all the dataframe is what you want, you can use ungrouped dataframes:
iex> df = Explorer.Datasets.iris()
iex> Explorer.DataFrame.summarise(df, mean_petal_length: mean(petal_length))
#Explorer.DataFrame<
Polars[1 x 1]
mean_petal_length f64 [3.758666666666666]
>
@spec summarise_with( df :: t(), callback :: (Explorer.Backend.LazyFrame.t() -> column_pairs(Explorer.Series.lazy_t())) ) :: t()
Summarise each group to a single row using a callback function.
In case no group is set, the entire dataframe will be considered. The callback receives a lazy dataframe. A lazy dataframe does not hold any values, instead it stores all operations in order to execute all summarizations performantly.
This is a callback version of summarise/2
.
Examples
iex> alias Explorer.{DataFrame, Series}
iex> df = Explorer.Datasets.fossil_fuels() |> DataFrame.group_by("year")
iex> DataFrame.summarise_with(df, &[total_max: Series.max(&1["total"]), countries: Series.n_distinct(&1["country"])])
#Explorer.DataFrame<
Polars[5 x 3]
year s64 [2010, 2011, 2012, 2013, 2014]
total_max s64 [2393248, 2654360, 2734817, 2797384, 2806634]
countries u32 [217, 217, 220, 220, 220]
>
iex> alias Explorer.{DataFrame, Series}
iex> df = Explorer.Datasets.fossil_fuels()
iex> DataFrame.summarise_with(df, &[total_max: Series.max(&1["total"]), countries: Series.n_distinct(&1["country"])])
#Explorer.DataFrame<
Polars[1 x 2]
total_max s64 [2806634]
countries u32 [222]
>
Transpose a DataFrame.
Warning
This is an expensive operation since data is stored in a columnar format.
Options
:header
- When a string is passed, name of the header column is set to the value. Whenheader
istrue
, name is set tocolumn
(default:false
):columns
- names for non header columns. Length of column_names should match the row count of data frame. (default:nil
)
Examples
iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df)
#Explorer.DataFrame<
Polars[3 x 2]
column_0 string ["d", "1", "a"]
column_1 string [nil, "2", "b"]
>
iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df, columns: ["x", "y"])
#Explorer.DataFrame<
Polars[3 x 2]
x string ["d", "1", "a"]
y string [nil, "2", "b"]
>
iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df, header: "name")
#Explorer.DataFrame<
Polars[3 x 3]
name string ["a", "b", "c"]
column_0 string ["d", "1", "a"]
column_1 string [nil, "2", "b"]
>
@spec ungroup(df :: t(), groups_or_group :: column_names() | column_name() | :all) :: t()
Removes grouping variables.
Accepts a list of group names. If groups is not specified, then all groups are removed.
Examples
Ungroups all by default:
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Ungrouping a single column:
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df, "country")
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["year"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Lists, ranges, regexes, and functions are also accepted in column names, as in select/2
.
@spec unnest(df :: t(), column_or_columns :: column_name() | [column_name()]) :: t()
Unnests one or multiple struct columns into individual columns.
The field names in a unnested column must not exist in the dataframe or in any other unnested columns.
Examples
iex> df = Explorer.DataFrame.new(before: [1, 2], struct: [%{x: 1, y: 2}, %{x: 3, y: 4}], after: [3, 4])
iex> Explorer.DataFrame.unnest(df, :struct)
#Explorer.DataFrame<
Polars[2 x 4]
before s64 [1, 2]
x s64 [1, 3]
y s64 [2, 4]
after s64 [3, 4]
>
iex> df = Explorer.DataFrame.new(struct1: [%{x: 1, y: 2}, %{x: 3, y: 4}], struct2: [%{z: 5}, %{z: 6}])
iex> Explorer.DataFrame.unnest(df, [:struct1, :struct2])
#Explorer.DataFrame<
Polars[2 x 3]
x s64 [1, 3]
y s64 [2, 4]
z s64 [5, 6]
>
Functions: Multi-table
Combine two or more dataframes column-wise.
This function expects the dataframes to have the same number of rows, otherwise rows may be silently discarded. Eager backends may check whenever this happens and raise instead of silently fail. But this may not be possible for lazy dataframes as the number of rows is not known upfront.
When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.
Examples
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(z: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
Polars[3 x 4]
x s64 [1, 2, 3]
y string ["a", "b", "c"]
z s64 [4, 5, 6]
a string ["d", "e", "f"]
>
Conflicting names are suffixed with the index of the dataframe in the array:
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
Polars[3 x 4]
x s64 [1, 2, 3]
y string ["a", "b", "c"]
x_1 s64 [4, 5, 6]
a string ["d", "e", "f"]
>
Combine two dataframes column-wise.
When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.
concat_columns(df1, df2)
is equivalent to concat_columns([df1, df2])
.
Combine two or more dataframes row-wise (stack).
Column names and dtypes must match. The only exception is for numeric columns that can be mixed together, and casted automatically to the "highest" dtype.
For example, if a column is represented by {:u, 16}
and {:u, 32}
,
it will be casted to the highest unsigned integer dtype between the two,
which is {:u, 32}
.
If it is mixing signed and unsigned integers, it will be casted to
the highest signed integer possible.
And if floats and integers are mixed together, Explorer will cast
them to the float dtype {:f, 64}
.
When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.
Examples
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x s64 [1, 2, 3, 4, 5, ...]
y string ["a", "b", "c", "d", "e", ...]
>
iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4.2, 5.3, 6.4], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
Polars[6 x 2]
x f64 [1.0, 2.0, 3.0, 4.2, 5.3, ...]
y string ["a", "b", "c", "d", "e", ...]
>
Combine two dataframes row-wise.
concat_rows(df1, df2)
is equivalent to concat_rows([df1, df2])
.
When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.
Join two tables.
Join types
:inner
- Returns all rows fromleft
where there are matching values inright
, and all columns fromleft
andright
.:left
- Returns all rows fromleft
and all columns fromleft
andright
. Rows inleft
with no match inright
will havenil
values in the new columns.:right
- Returns all rows fromright
and all columns fromleft
andright
. Rows inright
with no match inleft
will havenil
values in the new columns.:outer
- Returns all rows and all columns from bothleft
andright
. Where there are not matching values, returnsnil
for the one missing.:cross
- Also known as a cartesian join. Returns all combinations ofleft
andright
. Can be very computationally expensive.
Options
:on
- The column(s) to join on. Defaults to overlapping columns. Does not apply to cross join.:how
- One of the join types (as an atom) described above. Defaults to:inner
.
Examples
Inner join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right)
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
Left join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :left)
#Explorer.DataFrame<
Polars[4 x 3]
a s64 [1, 2, 2, 3]
b string ["a", "b", "b", "c"]
c string ["d", "e", "f", nil]
>
Right join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :right)
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [1, 2, 4]
c string ["d", "e", "f"]
b string ["a", "b", nil]
>
Outer join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :outer)
#Explorer.DataFrame<
Polars[4 x 4]
a s64 [1, 2, nil, 3]
b string ["a", "b", nil, "c"]
a_right s64 [1, 2, 4, nil]
c string ["d", "e", "f", nil]
>
Cross join:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :cross)
#Explorer.DataFrame<
Polars[9 x 4]
a s64 [1, 1, 1, 2, 2, ...]
b string ["a", "a", "a", "b", "b", ...]
a_right s64 [1, 2, 4, 1, 2, ...]
c string ["d", "e", "f", "d", "e", ...]
>
Inner join with different names:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(d: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, on: [{"a", "d"}])
#Explorer.DataFrame<
Polars[3 x 3]
a s64 [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
Grouped examples
When doing a join operation with grouped dataframes, the joined dataframe may keep the groups from only one side.
An inner join operation will keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right)
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["b"]
a s64 [1, 2, 2]
b string ["a", "b", "b"]
c string ["d", "e", "f"]
>
A left join operation will keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :left)
#Explorer.DataFrame<
Polars[4 x 3]
Groups: ["b"]
a s64 [1, 2, 2, 3]
b string ["a", "b", "b", "c"]
c string ["d", "e", "f", nil]
>
A right join operation will keep the groups from the right-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :right)
#Explorer.DataFrame<
Polars[3 x 3]
Groups: ["c"]
a s64 [1, 2, 4]
c string ["d", "e", "f"]
b string ["a", "b", nil]
>
An outer join operation is going to keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :outer)
#Explorer.DataFrame<
Polars[4 x 4]
Groups: ["b"]
a s64 [1, 2, nil, 3]
b string ["a", "b", nil, "c"]
a_right s64 [1, 2, 4, nil]
c string ["d", "e", "f", nil]
>
A cross join operation is going to keep the groups from the left-hand side dataframe:
iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :cross)
#Explorer.DataFrame<
Polars[9 x 4]
Groups: ["b"]
a s64 [1, 1, 1, 2, 2, ...]
b string ["a", "a", "a", "b", "b", ...]
a_right s64 [1, 2, 4, 1, 2, ...]
c string ["d", "e", "f", "d", "e", ...]
>
Functions: Row-based
Returns the first n rows of the dataframe.
By default it returns the first 5 rows.
If the dataframe is using groups, then the first n rows of each group is returned.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2010, 2010, 2010, 2010, 2010]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
total s64 [2308, 1254, 32500, 141, 7924]
solid_fuel s64 [627, 117, 332, 0, 0]
liquid_fuel s64 [1601, 953, 12381, 141, 3649]
gas_fuel s64 [74, 7, 14565, 0, 374]
cement s64 [5, 177, 2598, 0, 204]
gas_flaring s64 [0, 0, 2623, 0, 3697]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37]
bunker_fuels s64 [9, 7, 663, 0, 321]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2010, 2010]
country string ["AFGHANISTAN", "ALBANIA"]
total s64 [2308, 1254]
solid_fuel s64 [627, 117]
liquid_fuel s64 [1601, 953]
gas_fuel s64 [74, 7]
cement s64 [5, 177]
gas_flaring s64 [0, 0]
per_capita f64 [0.08, 0.43]
bunker_fuels s64 [9, 7]
>
Grouped examples
Using grouped dataframes makes head/2
return n rows from each group.
Here is an example using the Iris dataset, and returning two rows from each group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.head(grouped, 2)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 7.0, 6.4, 6.3, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.3, ...]
petal_length f64 [1.4, 1.4, 4.7, 4.5, 6.0, ...]
petal_width f64 [0.2, 0.2, 1.4, 1.5, 2.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Sample rows from a dataframe.
If given an integer as the second argument, it will return N samples. If given a float, it will return that proportion of the series.
Can sample with or without replacement.
For grouped dataframes, sample will take into account the rows of each group, meaning that if you try to get N samples and you have G groups, you will get N * G rows. See the examples below.
Options
:replace
- If set totrue
, each sample will be independent and therefore values may repeat. Required to betrue
forn
greater then the number of rows in the dataframe orfrac
> 1.0. (default:false
):seed
- An integer to be used as a random seed. If nil, a random value between 0 and 2^64 - 1 will be used. (default:nil
):shuffle
- If set totrue
, the resultant dataframe is going to be shuffle if the sample is equal to the size of the dataframe. (default:false
)
Examples
You can sample N rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 3, seed: 100)
#Explorer.DataFrame<
Polars[3 x 10]
year s64 [2012, 2013, 2014]
country string ["SYRIAN ARAB REPUBLIC", "EGYPT", "AFGHANISTAN"]
total s64 [12198, 58198, 2675]
solid_fuel s64 [1, 224, 1194]
liquid_fuel s64 [7909, 26501, 1393]
gas_fuel s64 [3265, 24672, 74]
cement s64 [816, 6800, 14]
gas_flaring s64 [208, 0, 0]
per_capita f64 [0.61, 0.66, 0.08]
bunker_fuels s64 [437, 694, 9]
>
Or you can sample a proportion of rows:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 0.03, seed: 100)
#Explorer.DataFrame<
Polars[32 x 10]
year s64 [2013, 2012, 2014, 2011, 2011, ...]
country string ["BRITISH VIRGIN ISLANDS", "TAJIKISTAN", "AFGHANISTAN", "ICELAND", "SINGAPORE", ...]
total s64 [48, 800, 2675, 513, 12332, ...]
solid_fuel s64 [0, 192, 1194, 94, 7, ...]
liquid_fuel s64 [48, 501, 1393, 400, 7774, ...]
gas_fuel s64 [0, 74, 74, 0, 4551, ...]
cement s64 [0, 34, 14, 19, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [1.64, 0.1, 0.08, 1.6, 2.38, ...]
bunker_fuels s64 [0, 28, 9, 168, 41786, ...]
>
Grouped examples
In the following example we have the Iris dataset grouped by species, and we want to take a sample of two plants from each group. Since we have three species, the resultant dataframe is going to have six rows (2 * 3).
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 2, seed: 100)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [4.8, 5.0, 5.5, 6.5, 7.4, ...]
sepal_width f64 [3.1, 3.6, 2.4, 2.8, 2.8, ...]
petal_length f64 [1.6, 1.4, 3.8, 4.6, 6.1, ...]
petal_width f64 [0.2, 0.2, 1.1, 1.5, 1.9, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
The behaviour is similar when you want to take a fraction of the rows from each group. The main difference is that each group can have more or less rows, depending on its size.
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 0.1, seed: 100)
#Explorer.DataFrame<
Polars[15 x 5]
Groups: ["species"]
sepal_length f64 [5.2, 5.0, 5.2, 5.0, 5.0, ...]
sepal_width f64 [3.4, 3.6, 3.5, 3.0, 3.4, ...]
petal_length f64 [1.4, 1.4, 1.5, 1.6, 1.6, ...]
petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.4, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>
Change the order of the rows of a dataframe randomly.
This function is going to ignore groups.
Options
:seed
- An integer to be used as a random seed. If nil, a random value between 0 and 2^64 - 1 will be used. (default:nil
)
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.shuffle(df, seed: 100)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2014, 2014, 2014, 2012, 2010, ...]
country string ["ISRAEL", "ARGENTINA", "NETHERLANDS", "YEMEN", "GRENADA", ...]
total s64 [17617, 55638, 45624, 5091, 71, ...]
solid_fuel s64 [6775, 1588, 9070, 129, 0, ...]
liquid_fuel s64 [6013, 25685, 18272, 4173, 71, ...]
gas_fuel s64 [3930, 26368, 18010, 414, 0, ...]
cement s64 [898, 1551, 272, 375, 0, ...]
gas_flaring s64 [0, 446, 0, 0, 0, ...]
per_capita f64 [2.22, 1.29, 2.7, 0.2, 0.68, ...]
bunker_fuels s64 [1011, 2079, 14210, 111, 4, ...]
>
Slices rows at the given indices as a new dataframe.
The indices may be a list or series of indices, or a range. A list of indices does not support negative numbers. Ranges may be negative on either end, which are then normalized. Note ranges in Elixir are inclusive.
Slice works differently when a dataframe is grouped. It is going to consider the indices of each group instead of the entire dataframe. See the examples below.
If your intention is to grab a portion of each group,
prefer to use sample/3
instead.
Examples
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, [0, 2])
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [1, 3]
b string ["a", "c"]
>
With a series
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, Explorer.Series.from_list([0, 2]))
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [1, 3]
b string ["a", "c"]
>
With a range:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, 1..2)
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [2, 3]
b string ["b", "c"]
>
With a range with negative first and last:
iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, -2..-1)
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [2, 3]
b string ["b", "c"]
>
Grouped examples
We are going to once again use the Iris dataset. In this example we want to take elements at indexes 0 and 2:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, [0, 2])
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.7, 7.0, 6.9, 6.3, ...]
sepal_width f64 [3.5, 3.2, 3.2, 3.1, 3.3, ...]
petal_length f64 [1.4, 1.3, 4.7, 4.9, 6.0, ...]
petal_width f64 [0.2, 0.2, 1.4, 1.5, 2.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Now we want to take the first 3 rows of each group.
This is going to work with the range 0..2
:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0..2)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 7.0, 6.4, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.2, ...]
petal_length f64 [1.4, 1.4, 1.3, 4.7, 4.5, ...]
petal_width f64 [0.2, 0.2, 0.2, 1.4, 1.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
Subset a continuous set of rows.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, 1, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2010, 2010]
country string ["ALBANIA", "ALGERIA"]
total s64 [1254, 32500]
solid_fuel s64 [117, 332]
liquid_fuel s64 [953, 12381]
gas_fuel s64 [7, 14565]
cement s64 [177, 2598]
gas_flaring s64 [0, 2623]
per_capita f64 [0.43, 0.9]
bunker_fuels s64 [7, 663]
>
Negative offsets count from the end of the series:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2014, 2014]
country string ["UNITED STATES OF AMERICA", "URUGUAY"]
total s64 [1432855, 1840]
solid_fuel s64 [450047, 2]
liquid_fuel s64 [576531, 1700]
gas_fuel s64 [390719, 25]
cement s64 [11314, 112]
gas_flaring s64 [4244, 0]
per_capita f64 [4.43, 0.54]
bunker_fuels s64 [30722, 251]
>
If the length would run past the end of the dataframe, the result may be shorter than the length:
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 20)
#Explorer.DataFrame<
Polars[10 x 10]
year s64 [2014, 2014, 2014, 2014, 2014, ...]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN", "VANUATU", "VENEZUELA", ...]
total s64 [1432855, 1840, 28692, 42, 50510, ...]
solid_fuel s64 [450047, 2, 1677, 0, 204, ...]
liquid_fuel s64 [576531, 1700, 2086, 42, 28445, ...]
gas_fuel s64 [390719, 25, 23929, 0, 12731, ...]
cement s64 [11314, 112, 1000, 0, 1088, ...]
gas_flaring s64 [4244, 0, 0, 0, 8042, ...]
per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65, ...]
bunker_fuels s64 [30722, 251, 0, 10, 1256, ...]
>
Grouped examples
We want to take the first 3 rows of each group. We need the offset 0 and the length 3:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0, 3)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.9, 4.7, 7.0, 6.4, ...]
sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.2, ...]
petal_length f64 [1.4, 1.4, 1.3, 4.7, 4.5, ...]
petal_width f64 [0.2, 0.2, 0.2, 1.4, 1.5, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
We can also pass a negative offset:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, -6, 3)
#Explorer.DataFrame<
Polars[9 x 5]
Groups: ["species"]
sepal_length f64 [5.1, 4.8, 5.1, 5.6, 5.7, ...]
sepal_width f64 [3.8, 3.0, 3.8, 2.7, 3.0, ...]
petal_length f64 [1.9, 1.4, 1.6, 4.2, 4.2, ...]
petal_width f64 [0.4, 0.3, 0.2, 1.3, 1.2, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>
Returns the last n rows of the dataframe.
By default it returns the last 5 rows.
If the dataframe is using groups, then the last n rows of each group is returned.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2014, 2014, 2014, 2014, 2014]
country string ["VIET NAM", "WALLIS AND FUTUNA ISLANDS", "YEMEN", "ZAMBIA", "ZIMBABWE"]
total s64 [45517, 6, 6190, 1228, 3278]
solid_fuel s64 [19246, 0, 137, 132, 2097]
liquid_fuel s64 [12694, 6, 5090, 797, 1005]
gas_fuel s64 [5349, 0, 581, 0, 0]
cement s64 [8229, 0, 381, 299, 177]
gas_flaring s64 [0, 0, 0, 0, 0]
per_capita f64 [0.49, 0.44, 0.24, 0.08, 0.22]
bunker_fuels s64 [761, 1, 153, 33, 9]
>
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df, 2)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2014, 2014]
country string ["ZAMBIA", "ZIMBABWE"]
total s64 [1228, 3278]
solid_fuel s64 [132, 2097]
liquid_fuel s64 [797, 1005]
gas_fuel s64 [0, 0]
cement s64 [299, 177]
gas_flaring s64 [0, 0]
per_capita f64 [0.08, 0.22]
bunker_fuels s64 [33, 9]
>
Grouped examples
Using grouped dataframes makes tail/2
return n rows from each group.
Here is an example using the Iris dataset, and returning two rows from each group:
iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.tail(grouped, 2)
#Explorer.DataFrame<
Polars[6 x 5]
Groups: ["species"]
sepal_length f64 [5.3, 5.0, 5.1, 5.7, 6.2, ...]
sepal_width f64 [3.7, 3.3, 2.5, 2.8, 3.4, ...]
petal_length f64 [1.5, 1.4, 3.0, 4.1, 5.4, ...]
petal_width f64 [0.2, 0.2, 1.1, 1.3, 2.3, ...]
species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>
Functions: Introspection
Gets the dtypes of the dataframe columns.
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.dtypes(df)
%{"floats" => {:f, 64}, "ints" => {:s, 64}}
Returns the groups of a dataframe.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, "country")
iex> Explorer.DataFrame.groups(df)
["country"]
iex> df = Explorer.Datasets.iris()
iex> Explorer.DataFrame.groups(df)
[]
Returns the number of columns in the dataframe.
This function works the same way for grouped dataframes.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_columns(df)
10
Returns the number of rows in the dataframe.
This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.
Examples
iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_rows(df)
1094
Gets the names of the dataframe columns.
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.names(df)
["floats", "ints"]
Prints the DataFrame in a tabular fashion.
Examples
df = Explorer.Datasets.iris()
Explorer.DataFrame.print(df)
Explorer.DataFrame.print(df, limit: 1)
Explorer.DataFrame.print(df, limit: :infinity)
Gets the shape of the dataframe as a {height, width}
tuple.
This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.
Examples
iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0, 3.0], ints: [1, 2, 3])
iex> Explorer.DataFrame.shape(df)
{3, 2}
Functions: IO
@spec dump_csv(df :: t(), opts :: Keyword.t()) :: {:ok, String.t()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of a delimited file.
Options
:header
- Should the column names be written as the first line of the file? (default:true
):delimiter
- A single character used to separate fields within a record. (default:","
)
Examples
iex> df = Explorer.Datasets.fossil_fuels() |> Explorer.DataFrame.head(2)
iex> Explorer.DataFrame.dump_csv(df)
{:ok, "year,country,total,solid_fuel,liquid_fuel,gas_fuel,cement,gas_flaring,per_capita,bunker_fuels\n2010,AFGHANISTAN,2308,627,1601,74,5,0,0.08,9\n2010,ALBANIA,1254,117,953,7,177,0,0.43,7\n"}
Similar to dump_csv/2
, but raises in case of error.
@spec dump_ipc(df :: t(), opts :: Keyword.t()) :: {:ok, binary()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of an IPC file.
Groups are ignored if the dataframe is using any.
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Similar to dump_ipc/2
, but raises in case of error.
@spec dump_ipc_stream(df :: t(), opts :: Keyword.t()) :: {:ok, binary()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of an IPC Stream file.
Groups are ignored if the dataframe is using any.
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
Similar to dump_ipc_stream/2
, but raises in case of error.
@spec dump_ndjson(df :: t()) :: {:ok, binary()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of a NDJSON file.
Groups are ignored if the dataframe is using any.
Examples
iex> df = Explorer.DataFrame.new(col_a: [1, 2], col_b: [5.1, 5.2])
iex> Explorer.DataFrame.dump_ndjson(df)
{:ok, ~s({"col_a":1,"col_b":5.1}\n{"col_a":2,"col_b":5.2}\n)}
Similar to dump_ndjson!/2
, but raises in case of error.
@spec dump_parquet(df :: t(), opts :: Keyword.t()) :: {:ok, binary()} | {:error, Exception.t()}
Writes a dataframe to a binary representation of a Parquet file.
Groups are ignored if the dataframe is using any.
Options
:compression
- The compression algorithm to use when writing files. Where a compression level is available, this can be passed as a tuple, such as{:zstd, 3}
. Supported options are:nil
(uncompressed, default):snappy
:gzip
(with levels 1-9):brotli
(with levels 1-11):zstd
(with levels -7-22):lz4raw
.
Similar to dump_parquet/2
, but raises in case of error.
@spec from_csv(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads a delimited file into a dataframe.
It accepts a filename that can be a local file, a "s3://" schema, or
a FSS
entry like FSS.S3.Entry
.
If the CSV is compressed, it is automatically decompressed.
Options
:delimiter
- A single character used to separate fields within a record. (default:","
):dtypes
- A list/map of{"column_name", dtype}
tuples. Any non-specified column has its type imputed from the first 1000 rows. (default:[]
):header
- Does the file have a header of column names as the first row or not? (default:true
):max_rows
- Maximum number of lines to read. (default:nil
):nil_values
- A list of strings that should be interpreted as a nil values. (default:[]
):skip_rows
- The number of lines to skip at the beginning of the file. (default:0
):skip_rows_after_header
- The number of lines to skip after the header row. (default:0
):columns
- A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default:nil
):infer_schema_length
Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).:parse_dates
- Automatically try to parse dates/ datetimes and time. If parsing fails, columns remain of dtypestring
:eol_delimiter
- A single character used to represent new lines. (default:" "
):config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
):backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.:encoding
- Encoding to use when reading the file. For now, the only possible values areutf8
andutf8-lossy
. The utf8-lossy option means that invalid utf8 values are replaced with � characters. (default:"utf8"
)
Similar to from_csv/2
but raises if there is a problem reading the CSV.
@spec from_ipc(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads an IPC file into a dataframe.
It accepts a filename that can be a local file, a "s3://" schema, or
a FSS
entry like FSS.S3.Entry
.
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.:config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
):backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.
Similar to from_ipc/2
but raises if there is a problem reading the IPC file.
Reads an IPC Streaming file into a dataframe.
It's possible to read from an IPC Streaming file using the lazy Polars backend, but the implementation is not truly lazy. We are going to read it first using the eager backend, and then convert the dataframe to lazy.
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.:config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
)
Similar to from_ipc_stream/2
but raises if there is a problem reading the IPC Stream file.
@spec from_ndjson(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Read a file of JSON objects or lists separated by new lines
Options
:batch_size
- Sets the batch size for reading rows. This value may have significant impact in performance, so adjust it for your needs (default:1000
).:infer_schema_length
- Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.:config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
)
Similar to from_ndjson/2
, but raises in case of error.
@spec from_parquet(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads a parquet file into a dataframe.
It accepts a filename that can be a local file, a "s3://" schema, or
a FSS
entry like FSS.S3.Entry
.
Options
:max_rows
- Maximum number of lines to read. (default:nil
):columns
- A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default:nil
):config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
):backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.
Similar to from_parquet/2
but raises if there is a problem reading the Parquet file.
@spec from_query( Adbc.Connection.t(), query :: String.t(), params :: [term()], opts :: Keyword.t() ) :: {:ok, t()} | {:error, Exception.t()}
Reads data from a query.
conn
must be a Adbc.Connection
process. sql
is a string representing
the sql query and params
is the list of query parameters. See Adbc
for more information.
Example
In order to read data from a database, you must list :adbc
as a dependency,
download the relevant driver, and start both database and connection processes
in your supervision tree.
First, add :adbc
as a dependency in your mix.exs
:
{:adbc, "~> 0.1"}
Now, in your config/config.exs, configure the drivers you are going to use
(see Adbc
module docs for more information on supported drivers):
config :adbc, :drivers, [:sqlite]
If you are using a notebook or scripting, you can also use Adbc.download_driver!/1
to dynamically download one.
Then start the database and the relevant connection processes in your supervision tree:
children = [
{Adbc.Database,
driver: :sqlite,
process_options: [name: MyApp.DB]},
{Adbc.Connection,
database: MyApp.DB,
process_options: [name: MyApp.Conn]}
]
Supervisor.start_link(children, strategy: :one_for_one)
In a notebook, the above would look like this:
db = Kino.start_child!({Adbc.Database, driver: :sqlite})
conn = Kino.start_child!({Adbc.Connection, database: db})
And now you can make queries with:
# For named connections
{:ok, _} = Explorer.DataFrame.from_query(MyApp.Conn, "SELECT 123", [])
# When using the conn PID directly
{:ok, _} = Explorer.DataFrame.from_query(conn, "SELECT 123", [])
Options
:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.
@spec from_query!( Adbc.Connection.t(), query :: String.t(), params :: [term()], opts :: Keyword.t() ) :: t()
Similar to from_query/4
but raises if there is an error.
@spec load_csv(contents :: String.t(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads a representation of a CSV file into a dataframe.
If the CSV is compressed, it is automatically decompressed.
Options
:delimiter
- A single character used to separate fields within a record. (default:","
):dtypes
- A list/map of{"column_name", dtype}
tuples. Any non-specified column has its type imputed from the first 1000 rows. (default:[]
):header
- Does the file have a header of column names as the first row or not? (default:true
):max_rows
- Maximum number of lines to read. (default:nil
):nil_values
- A list of strings that should be interpreted as a nil values. (default:[]
):skip_rows
- The number of lines to skip at the beginning of the file. (default:0
):skip_rows_after_header
- The number of lines to skip after the heqader row. (default:0
):columns
- A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default:nil
):infer_schema_length
Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).:parse_dates
- Automatically try to parse dates/ datetimes and time. If parsing fails, columns remain of dtypestring
:eol_delimiter
- A single character used to represent new lines. (default:" "
):backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.:encoding
- Encoding to use when reading the file. For now, the only possible values areutf8
andutf8-lossy
. The utf8-lossy option means that invalid utf8 values are replaced with � characters. (default:"utf8"
)
Similar to load_csv/2
but raises if there is a problem reading the CSV.
@spec load_ipc(contents :: binary(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads a binary representing an IPC file into a dataframe.
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.
Similar to load_ipc/2
but raises if there is a problem reading the IPC file.
@spec load_ipc_stream(contents :: binary(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads a binary representing an IPC Stream file into a dataframe.
Options
:columns
- List with the name or index of columns to be selected. Defaults to all columns.:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.
Similar to load_ipc_stream/2
but raises if there is a problem.
@spec load_ndjson(contents :: String.t(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads a representation of a NDJSON file into a dataframe.
Options
:batch_size
- Sets the batch size for reading rows. This value may have significant impact in performance, so adjust it for your needs (default:1000
).:infer_schema_length
- Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default:1000
).:backend
- The Explorer backend to use. Defaults to the value returned byExplorer.Backend.get/0
.:lazy
- force the results into the lazy version of the current backend.
Similar to load_ndjson/2
, but raises in case of error.
Examples
iex> contents = ~s({"col_a":1,"col_b":5.1}\n{"col_a":2,"col_b":5.2}\n)
iex> Explorer.DataFrame.load_ndjson!(contents)
#Explorer.DataFrame<
Polars[2 x 2]
col_a s64 [1, 2]
col_b f64 [5.1, 5.2]
>
@spec load_parquet(contents :: binary(), opts :: Keyword.t()) :: {:ok, t()} | {:error, Exception.t()}
Reads a binary representation of a parquet file into a dataframe.
Similar to load_parquet/2
but raises if there is a problem reading the Parquet file.
@spec to_csv(df :: t(), filename :: fs_entry() | String.t(), opts :: Keyword.t()) :: :ok | {:error, Exception.t()}
Writes a dataframe to a delimited file.
Groups are ignored if the dataframe is using any.
Options
:header
- Should the column names be written as the first line of the file? (default:true
):delimiter
- A single character used to separate fields within a record. (default:","
):config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
)
Similar to to_csv/3
but raises if there is a problem reading the CSV.
@spec to_ipc(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: :ok | {:error, Exception.t()}
Writes a dataframe to an IPC file.
Apache IPC is a language-agnostic columnar data structure that can be used to store dataframes. It excels as a format for quickly exchange data between different programming languages.
Groups are ignored if the dataframe is using any.
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
:streaming
- Tells the backend if it should use streaming, which means that the dataframe is not loaded to the memory at once, and instead it is written in chunks from a lazy dataframe.This option has no effect on eager - the default - dataframes. It defaults to
true
.:config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
)
Similar to to_ipc/3
, but raises in case of error.
Writes a dataframe to an IPC Stream file.
Arrow IPC Streams provide a streaming protocol or “format" for sending an arbitrary length sequence of record batches. The format must be processed from start to end, and does not support random access.
Options
:compression
- The compression algorithm to use when writing files. Supported options are:nil
(uncompressed, default):zstd
:lz4
.
:config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
)
Similar to to_ipc_stream/3
, but raises in case of error.
@spec to_ndjson(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: :ok | {:error, Exception.t()}
Writes a dataframe to a ndjson file.
Groups are ignored if the dataframe is using any.
NDJSON are files that contains JSON files separated by new lines. They are often used as structured logs.
Options
:config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
)
Similar to to_ndjson/3
, but raises in case of error.
@spec to_parquet(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: :ok | {:error, Exception.t()}
Writes a dataframe to a parquet file.
Groups are ignored if the dataframe is using any.
Options
:compression
- The compression algorithm to use when writing files. Where a compression level is available, this can be passed as a tuple, such as{:zstd, 3}
. Supported options are:nil
(uncompressed, default):snappy
:gzip
(with levels 1-9):brotli
(with levels 1-11):zstd
(with levels -7-22):lz4raw
.
:streaming
- Tells the backend if it should use streaming, which means that the dataframe is not loaded to the memory at once, and instead it is written in chunks from a lazy dataframe.This option has no effect on eager - the default - dataframes. It defaults to
true
.:config
- An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default:nil
)
Similar to to_parquet/3
, but raises in case of error.
Types
@type column() :: column_name() | non_neg_integer()
Represents a column name or its index.
Represents a column name as atom or string.
@type column_names() :: [column_name()]
Represents multiple column names as atoms or strings.
Represents a column pair where the value is a column name or
a column index, and the value is of type value
.
@type columns() :: [column()] | Range.t() | (String.t() -> boolean()) | (String.t(), Explorer.Series.dtype() -> boolean())
Represents multiple columns.
The columns may be specified as one of:
a list of columns indexes or names as atoms and strings
a range
a regex that keeps only the names matching the regex
a one-arity function that receives column names and returns true for column names to keep
a two-arity function that receives column names and types and returns true for column names to keep
@type fs_entry() :: FSS.entry()
Represents a filesystem entry, that can be local, S3 or URL.
@type t() :: %Explorer.DataFrame{ data: Explorer.Backend.DataFrame.t(), dtypes: %{required(String.t()) => Explorer.Series.dtype()}, groups: [String.t()], names: [String.t()] }
Represents a dataframe.