View Source Explorer.DataFrame (Explorer v0.10.1)

The DataFrame struct and API.

Dataframes are two-dimensional tabular data structures similar to a spreadsheet. For example, the Iris dataset:

iex> Explorer.Datasets.iris()
#Explorer.DataFrame<
  Polars[150 x 5]
  sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
  sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
  petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
  petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

This dataframe has 150 rows and five columns. Each column is an Explorer.Series of the same size (150):

iex> df = Explorer.Datasets.iris()
iex> df["sepal_length"]
#Explorer.Series<
  Polars[150]
  f64 [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, ...]
>

Creating dataframes

Dataframes can be created from normal Elixir terms. The main way you might do this is with the new/1 function. For example:

iex> Explorer.DataFrame.new(a: ["a", "b"], b: [1, 2])
#Explorer.DataFrame<
  Polars[2 x 2]
  a string ["a", "b"]
  b s64 [1, 2]
>

Or with a list of maps:

iex> Explorer.DataFrame.new([%{"col1" => "a", "col2" => 1}, %{"col1" => "b", "col2" => 2}])
#Explorer.DataFrame<
  Polars[2 x 2]
  col1 string ["a", "b"]
  col2 s64 [1, 2]
>

Verbs

Explorer uses the idea of a consistent set of SQL-like verbs like dplyr which can help solve common data manipulation challenges. These are split into single table verbs, multiple table verbs, and row-based verbs:

Single table verbs

Single table verbs are (unsurprisingly) used for manipulating a single dataframe. Those operations typically driven by column names. These are:

select/2 for picking columns and discard/2 to discard them
filter/2 for picking rows based on predicates
mutate/2 for adding or replacing columns that are functions of existing columns
sort_by/2 for changing the ordering of rows
distinct/2 for picking unique rows
summarise/2 for reducing multiple rows down to a single summary
pivot_longer/3 and pivot_wider/4 for massaging dataframes into longer or wider forms, respectively

Each of these combine with Explorer.DataFrame.group_by/2 for operating by group.

Multiple table verbs

Multiple table verbs are used for combining tables. These are:

join/3 for performing SQL-like joins
concat_columns/1 for horizontally "stacking" dataframes
concat_rows/1 for vertically "stacking" dataframes

Row-based verbs

Those operations are driven by the row index. These are:

head/2 for picking the first rows
tail/2 for picking the last rows
slice/2 for slicing the dataframe by row indexes or a range
slice/3 for slicing a section by an offset
sample/2 for sampling the data-frame by row

IO operations

Explorer supports reading and writing of:

delimited files (such as CSV or TSV)
Parquet
Arrow IPC
Arrow Streaming IPC
Newline Delimited JSON
Databases via Adbc in from_query/3

The convention Explorer uses is to have from_* and to_* functions to read and write to files in the formats above. load_* and dump_* versions are also available to read and write those formats directly in memory.

Files can be fetched from local or remote file system, such as S3, using the following formats:

# path to a file in disk
Explorer.DataFrame.from_parquet("/path/to/file.parquet")

# path to a URL schema (with optional configuration)
Explorer.DataFrame.from_parquet("s3://bucket/file.parquet", config: FSS.S3.config_from_system_env())

# it's possible to configure using keyword lists
Explorer.DataFrame.from_parquet("s3://bucket/file.parquet", config: [access_key_id: "my-key", secret_access_key: "my-secret"])

# a FSS entry (it already includes its config)
Explorer.DataFrame.from_parquet(FSS.S3.parse("s3://bucket/file.parquet"))

The :config option of from_* functions is only required if the filename is a path to a remote resource. In case it's a FSS entry, the requirement is that the config is passed inside the entry struct.

For more details about the options, see the FSS docs.

Selecting columns and access

Several functions in this module, such as select/2, discard/2, drop_nil/2, and so forth accept a single or multiple columns as arguments. The columns can be specified in a variety of formats, which we describe below.

Explorer.DataFrame also implements the Access behaviour (also known as the brackets syntax). This should be familiar for users coming from other language with dataframes such as R or Python. For example:

iex> df = Explorer.Datasets.wine()
iex> df["class"]
#Explorer.Series<
  Polars[178]
  s64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>

Accessing the dataframe with a column name either as a string or an atom, will return the column. You can also pass an integer representing the column order:

iex> df = Explorer.Datasets.wine()
iex> df[0]
#Explorer.Series<
  Polars[178]
  s64 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
>

You can also pass a list, a range, or a regex to return a dataframe matching the given data type. For example, by passing a list:

iex> df = Explorer.Datasets.wine()
iex> df[["class", "hue"]]
#Explorer.DataFrame<
  Polars[178 x 2]
  class s64 [1, 1, 1, 1, 1, ...]
  hue f64 [1.04, 1.05, 1.03, 0.86, 1.04, ...]
>

Or a range for the given positions:

iex> df = Explorer.Datasets.wine()
iex> df[0..2]
#Explorer.DataFrame<
  Polars[178 x 3]
  class s64 [1, 1, 1, 1, 1, ...]
  alcohol f64 [14.23, 13.2, 13.16, 14.37, 13.24, ...]
  malic_acid f64 [1.71, 1.78, 2.36, 1.95, 2.59, ...]
>

Or a regex to keep only columns matching a given pattern:

iex> df = Explorer.Datasets.wine()
iex> df[~r/(class|hue)/]
#Explorer.DataFrame<
  Polars[178 x 2]
  class s64 [1, 1, 1, 1, 1, ...]
  hue f64 [1.04, 1.05, 1.03, 0.86, 1.04, ...]
>

Given you can also access a series using its index, you can use multiple accesses to select a column and row at the same time:

iex> df = Explorer.Datasets.wine()
iex> df["class"][3]
1

When trying to access a column in the dataframe that does not exist, an ArgumentError is raised.

iex> df =  Explorer.DataFrame.new(a: [1, 2, 3])
iex> df["b"]
** (ArgumentError) could not find column name "b". The available columns are: ["a"].
If you are attempting to interpolate a value, use ^b.

Summary

Functions: Conversion

collect(df)

Computes and collects a data frame to the current node.

compute(df)

Computes the lazy data frame into an eager one, executing the query.

lazy(df)

Converts the dataframe to the lazy version of the current backend.

lazy?(data_frame)

Returns true if the data frame is a lazy one.

new(data, opts \\ [])

Creates a new dataframe.

to_columns(df, opts \\ [])

Converts a dataframe to a list of columns with lists as values.

to_rows(df, opts \\ [])

Converts a dataframe to a list of maps (rows).

to_rows_stream(df, opts \\ [])

Converts a dataframe to a stream of maps (rows).

to_series(df, opts \\ [])

Converts a dataframe to a list of columns with series as values.

Functions: Single-table

correlation(df, opts \\ [])

Calculates the pairwise correlation of numeric columns.

covariance(df, opts \\ [])

Calculates the pairwise covariance of numeric columns.

describe(df, opts \\ [])

Describe numeric columns of a DataFrame.

discard(df, columns_or_column)

Discards a subset of columns by name.

distinct(df, columns \\ .., opts \\ [])

Takes distinct rows by a selection of columns.

drop_nil(df, columns_or_column \\ ..)

Drop nil values.

dummies(df, columns_or_column)

Turns a set of columns to dummy variables.

explode(df, column_or_columns)

Explodes one or multiple list column into multiple rows.

filter(df, query)

Picks rows based on Explorer.Query.

filter_with(df, fun)

Picks rows based on a callback function.

frequencies(df, columns)

Creates a new dataframe with unique rows and the frequencies of each.

group_by(df, group)

Group the dataframe by one or more variables.

mask(df, mask)

Picks rows based on a list or series of values.

mutate(df, mutations, opts \\ [])

Creates or modifies columns based on Explorer.Query.

mutate_with(df, query_or_fun, opts \\ [])

Creates or modifies columns using a callback function and an optional keep argument.

nil_count(df)

Counts the number of nil elements in each column.

pivot_longer(df, columns_to_pivot, opts \\ [])

Pivot data from wide to long.

pivot_wider(df, names_from, values_from, opts \\ [])

Pivot data from long to wide.

pull(df, column)

Extracts a single column as a series.

put(df, column_name, series_or_tensor_or_list, opts \\ [])

Creates or modifies a single column.

relocate(df, columns_or_column, opts)

Relocates columns.

rename(df, names)

Renames columns.

rename_with(df, columns \\ .., callback)

Renames columns with a function.

select(df, columns_or_column)

Selects a subset of columns by name.

sort_by(df, query, opts \\ [])

Sorts rows by columns using Explorer.Query.

sort_with(df, fun, opts \\ [])

Sorts rows by columns using a callback function.

sql(df, sql_string, opts \\ [])

Create a dataframe from the result of a SQL query on an existing dataframe.

summarise(df, query)

Summarise each group to a single row using Explorer.Query.

summarise_with(df, fun)

Summarise each group to a single row using a callback function.

transform(df, opts \\ [], row_transform)

Make new columns by applying a native Elixir function to each row.

transpose(df, opts \\ [])

Transpose a DataFrame.

ungroup(df, groups \\ ..)

Removes grouping variables.

unnest(df, column_or_columns)

Unnests one or multiple struct columns into individual columns.

Functions: Multi-table

concat_columns(dfs)

Combine two or more dataframes column-wise.

concat_columns(df1, df2)

Combine two dataframes column-wise.

concat_rows(dfs)

Combine two or more dataframes row-wise (stack).

concat_rows(df1, df2)

Combine two dataframes row-wise.

join(left, right, opts \\ [])

Join two tables.

Functions: Row-based

head(df, nrows \\ 5)

Returns the first n rows of the dataframe.

sample(df, n_or_frac, opts \\ [])

Sample rows from a dataframe.

shuffle(df, opts \\ [])

Change the order of the rows of a dataframe randomly.

slice(df, row_indices)

Slices rows at the given indices as a new dataframe.

slice(df, offset, length)

Subset a continuous set of rows.

tail(df, nrows \\ 5)

Returns the last n rows of the dataframe.

Functions: Introspection

dtypes(df)

Gets the dtypes of the dataframe columns.

groups(data_frame)

Returns the groups of a dataframe.

n_columns(df)

Returns the number of columns in the dataframe.

n_rows(df)

Returns the number of rows in the dataframe.

names(df)

Gets the names of the dataframe columns.

print(df, opts \\ [])

Prints the DataFrame in a tabular fashion.

shape(df)

Gets the shape of the dataframe as a {height, width} tuple.

Functions: IO

dump_csv(df, opts \\ [])

Writes a dataframe to a binary representation of a delimited file.

dump_csv!(df, opts \\ [])

Similar to dump_csv/2, but raises in case of error.

dump_ipc(df, opts \\ [])

Writes a dataframe to a binary representation of an IPC file.

dump_ipc!(df, opts \\ [])

Similar to dump_ipc/2, but raises in case of error.

dump_ipc_stream(df, opts \\ [])

Writes a dataframe to a binary representation of an IPC Stream file.

dump_ipc_stream!(df, opts \\ [])

Similar to dump_ipc_stream/2, but raises in case of error.

dump_ndjson(df)

Writes a dataframe to a binary representation of a NDJSON file.

dump_ndjson!(df)

Similar to dump_ndjson!/2, but raises in case of error.

dump_parquet(df, opts \\ [])

Writes a dataframe to a binary representation of a Parquet file.

dump_parquet!(df, opts \\ [])

Similar to dump_parquet/2, but raises in case of error.

from_csv(filename, opts \\ [])

Reads a delimited file into a dataframe.

from_csv!(filename, opts \\ [])

Similar to from_csv/2 but raises if there is a problem reading the CSV.

from_ipc(filename, opts \\ [])

Reads an IPC file into a dataframe.

from_ipc!(filename, opts \\ [])

Similar to from_ipc/2 but raises if there is a problem reading the IPC file.

from_ipc_stream(filename, opts \\ [])

Reads an IPC Streaming file into a dataframe.

from_ipc_stream!(filename, opts \\ [])

Similar to from_ipc_stream/2 but raises if there is a problem reading the IPC Stream file.

from_ndjson(filename, opts \\ [])

Read a file of JSON objects or lists separated by new lines

from_ndjson!(filename, opts \\ [])

Similar to from_ndjson/2, but raises in case of error.

from_parquet(filename, opts \\ [])

Reads a parquet file into a dataframe.

from_parquet!(filename, opts \\ [])

Similar to from_parquet/2 but raises if there is a problem reading the Parquet file.

from_query(conn, query, params, opts \\ [])

Reads data from a query.

from_query!(conn, query, params, opts \\ [])

Similar to from_query/4 but raises if there is an error.

load_csv(contents, opts \\ [])

Reads a representation of a CSV file into a dataframe.

load_csv!(contents, opts \\ [])

Similar to load_csv/2 but raises if there is a problem reading the CSV.

load_ipc(contents, opts \\ [])

Reads a binary representing an IPC file into a dataframe.

load_ipc!(contents, opts \\ [])

Similar to load_ipc/2 but raises if there is a problem reading the IPC file.

load_ipc_stream(contents, opts \\ [])

Reads a binary representing an IPC Stream file into a dataframe.

load_ipc_stream!(contents, opts \\ [])

Similar to load_ipc_stream/2 but raises if there is a problem.

load_ndjson(contents, opts \\ [])

Reads a representation of a NDJSON file into a dataframe.

load_ndjson!(contents, opts \\ [])

Similar to load_ndjson/2, but raises in case of error.

load_parquet(contents, opts \\ [])

Reads a binary representation of a parquet file into a dataframe.

load_parquet!(contents, opts \\ [])

Similar to load_parquet/2 but raises if there is a problem reading the Parquet file.

to_csv(df, filename, opts \\ [])

Writes a dataframe to a delimited file.

to_csv!(df, filename, opts \\ [])

Similar to to_csv/3 but raises if there is a problem reading the CSV.

to_ipc(df, filename, opts \\ [])

Writes a dataframe to an IPC file.

to_ipc!(df, filename, opts \\ [])

Similar to to_ipc/3, but raises in case of error.

to_ipc_stream(df, filename, opts \\ [])

Writes a dataframe to an IPC Stream file.

to_ipc_stream!(df, filename, opts \\ [])

Similar to to_ipc_stream/3, but raises in case of error.

to_ndjson(df, filename, opts \\ [])

Writes a dataframe to a ndjson file.

to_ndjson!(df, filename, opts \\ [])

Similar to to_ndjson/3, but raises in case of error.

to_parquet(df, filename, opts \\ [])

Writes a dataframe to a parquet file.

to_parquet!(df, filename, opts \\ [])

Similar to to_parquet/3, but raises in case of error.

Types

column()

Represents a column name or its index.

column_name()

Represents a column name as atom or string.

column_names()

Represents multiple column names as atoms or strings.

column_pairs(value)

Represents a column pair where the value is a column name or a column index, and the value is of type value.

columns()

Represents multiple columns.

fs_entry()

Represents a filesystem entry, that can be local, S3 or URL.

sort_callback_result()

t()

Represents a dataframe.

Functions: Conversion

collect(df)

@spec collect(df :: t()) :: t()

Computes and collects a data frame to the current node.

If the data frame is already in the current node, it is computed but no transfer happens.

Examples

series = Explorer.Series.from_list([1, 2, 3], node: :some@node)
Explorer.Series.collect(series)

compute(df)

@spec compute(df :: t()) :: t()

Computes the lazy data frame into an eager one, executing the query.

It is the opposite of lazy/1. If it is not a lazy data frame, this is a noop.

Collecting a grouped dataframe should return a grouped dataframe.

lazy(df)

@spec lazy(df :: t()) :: t()

Converts the dataframe to the lazy version of the current backend.

Operations on a lazy dataframe are not executed immediately, instead they are batched together, and performed when collect/1 is invoked. This allows for query optimizations, parallelism, leading to better performance.

If already lazy, this is a noop. Converting a grouped dataframe should return a lazy dataframe with groups.

lazy?(data_frame)

@spec lazy?(df :: t()) :: boolean()

Returns true if the data frame is a lazy one.

new(data, opts \\ [])

@spec new(
  Table.Reader.t() | series_pairs,
  opts :: Keyword.t()
) :: t()
when series_pairs:
       %{required(column_name()) => Explorer.Series.t()}
       | [{column_name(), Explorer.Series.t()}]

Creates a new dataframe.

It accepts any of:

a map or keyword list of string/atom keys and series as values
a map or keyword list of string/atom keys and tensors as values
any data structure adhering to the Table.Reader protocol

Options

:backend - The Explorer backend to use. Defaults to the value returned by Explorer.Backend.get/0.
:dtypes - A list/map of {column_name, dtype} pairs. (default: [])
:lazy - force the results into the lazy version of the current backend.

Examples

From series

Series can be given either as keyword lists or maps where the keys are the name and the values are series:

iex> Explorer.DataFrame.new(%{
...>   floats: Explorer.Series.from_list([1.0, 2.0]),
...>   ints: Explorer.Series.from_list([1, nil])
...> })
#Explorer.DataFrame<
  Polars[2 x 2]
  floats f64 [1.0, 2.0]
  ints s64 [1, nil]
>

From tensors

To create dataframe from tensors, you can pass a matrix as argument. Each matrix column becomes a dataframe column with names x1, x2, x3, etc:

iex> Explorer.DataFrame.new(Nx.tensor([
...>   [1, 2, 3],
...>   [4, 5, 6]
...> ]))
#Explorer.DataFrame<
  Polars[2 x 3]
  x1 s64 [1, 4]
  x2 s64 [2, 5]
  x3 s64 [3, 6]
>

Explorer expects tensors to have certain types, so you may need to cast the data accordingly. See Explorer.Series.from_tensor/2 for more info.

You can also pass a keyword list or maps of vectors (rank 1 tensors):

iex> Explorer.DataFrame.new(%{
...>   floats: Nx.tensor([1.0, 2.0], type: :f64),
...>   ints: Nx.tensor([3, 4])
...> })
#Explorer.DataFrame<
  Polars[2 x 2]
  floats f64 [1.0, 2.0]
  ints s64 [3, 4]
>

Use dtypes to force a particular representation:

iex> Explorer.DataFrame.new([
...>   floats: Nx.tensor([1.0, 2.0], type: :f64),
...>   times: Nx.tensor([3_000, 4_000])
...> ], dtypes: [times: :time])
#Explorer.DataFrame<
  Polars[2 x 2]
  floats f64 [1.0, 2.0]
  times time [00:00:00.000003, 00:00:00.000004]
>

From tabular

Tabular data can be either columnar or row-based. Let's start with column data:

iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, nil]})
#Explorer.DataFrame<
  Polars[2 x 2]
  floats f64 [1.0, 2.0]
  ints s64 [1, nil]
>

iex> Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
#Explorer.DataFrame<
  Polars[2 x 2]
  floats f64 [1.0, 2.0]
  ints s64 [1, nil]
>

iex> Explorer.DataFrame.new([floats: [1.0, 2.0], ints: [1, nil], binaries: [<<239, 191, 19>>, nil]], dtypes: [{:binaries, :binary}])
#Explorer.DataFrame<
  Polars[2 x 3]
  floats f64 [1.0, 2.0]
  ints s64 [1, nil]
  binaries binary [<<239, 191, 19>>, nil]
>

iex> Explorer.DataFrame.new(%{floats: [1.0, 2.0], ints: [1, "wrong"]})
** (ArgumentError) cannot create series "ints": the value "wrong" does not match the inferred dtype {:s, 64}

From row data:

iex> rows = [%{id: 1, name: "José"}, %{id: 2, name: "Christopher"}, %{id: 3, name: "Cristine"}]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
  Polars[3 x 2]
  id s64 [1, 2, 3]
  name string ["José", "Christopher", "Cristine"]
>

iex> rows = [[id: 1, name: "José"], [id: 2, name: "Christopher"], [id: 3, name: "Cristine"]]
iex> Explorer.DataFrame.new(rows)
#Explorer.DataFrame<
  Polars[3 x 2]
  id s64 [1, 2, 3]
  name string ["José", "Christopher", "Cristine"]
>

to_columns(df, opts \\ [])

@spec to_columns(df :: t(), Keyword.t()) :: map()

Converts a dataframe to a list of columns with lists as values.

See to_series/2 if you want a list of columns with series as values. Note that this function does not take into account groups.

Warning

This is an expensive operation since it converts series to lists and doing so will copy the whole dataframe. Prefer to use the operations in this and the Explorer.Series module rather than the ones in Enum whenever possible, as Explorer is optimized for large series.

Options

:atom_keys - Configure if the resultant map should have atom keys. (default: false)

Examples

iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> Explorer.DataFrame.to_columns(df)
%{"floats" => [1.0, 2.0], "ints" => [1, nil]}

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_columns(df, atom_keys: true)
%{floats: [1.0, 2.0], ints: [1, nil]}

to_rows(df, opts \\ [])

@spec to_rows(df :: t(), Keyword.t()) :: [map()]

Converts a dataframe to a list of maps (rows).

Warning

This is an expensive operation since data is stored in a columnar format. You must avoid converting a dataframe to rows, as that will transform and copy the whole dataframe in memory. Prefer to use the operations in this module rather than the ones in Enum whenever possible, as this module is optimized for large series.

Options

:atom_keys - Configure if the resultant maps should have atom keys. (default: false)

Examples

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows(df, atom_keys: true)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]

to_rows_stream(df, opts \\ [])

@spec to_rows_stream(df :: t(), Keyword.t()) :: Enumerable.t()

Converts a dataframe to a stream of maps (rows).

Warning

This is an expensive operation since data is stored in a columnar format. Prefer to use the operations in this module rather than the ones in Enum whenever possible, as this module is optimized for large series.

Options

:atom_keys - Configure if the resultant maps should have atom keys. (default: false)
:chunk_size - Number of rows passed to to_rows/2 while streaming over the data. (default: 1000)

Examples

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows_stream(df) |> Enum.map(& &1)
[%{"floats" => 1.0, "ints" => 1}, %{"floats" => 2.0 ,"ints" => nil}]

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, nil])
iex> Explorer.DataFrame.to_rows_stream(df, atom_keys: true) |> Enum.map(& &1)
[%{floats: 1.0, ints: 1}, %{floats: 2.0, ints: nil}]

to_series(df, opts \\ [])

@spec to_series(df :: t(), Keyword.t()) :: map()

Converts a dataframe to a list of columns with series as values.

See to_columns/2 if you want a list of columns with lists as values. Note that this function does not take into account groups.

Options

:atom_keys - Configure if the resultant map should have atom keys. (default: false)

Examples

iex> df = Explorer.DataFrame.new(ints: [1, nil], floats: [1.0, 2.0])
iex> map = Explorer.DataFrame.to_series(df)
iex> Explorer.Series.to_list(map["floats"])
[1.0, 2.0]
iex> Explorer.Series.to_list(map["ints"])
[1, nil]

Functions: Single-table

correlation(df, opts \\ [])

@spec correlation(df :: t(), opts :: Keyword.t()) :: df :: t()

Calculates the pairwise correlation of numeric columns.

The returned dataframe is the correlation matrix.

Supported dtypes

Only columns with the following dtypes are taken into account.

floats: {:f, 32} and {:f, 64}
integers: {:s, 8}, {:s, 16}, {:s, 32}, {:s, 64}, {:u, 8}, {:u, 16}, {:u, 32} and {:u, 64}

The resultant columns are always {:f, 64}.

Options

:columns - the selection of columns to calculate. Defaults to all numeric columns.
:column_name - the name of the column with column names. Defaults to "names".
:ddof - the 'delta degrees of freedom' - the divisor used in the correlation calculation. Defaults to 1.
:method refers to the correlation method. The following methods are available:
- :pearson : Standard correlation coefficient. (default)
- :spearman : Spearman rank correlation.

Examples

iex> df = Explorer.DataFrame.new(dogs: [1, 8, 3], cats: [4, 5, 2])
iex> Explorer.DataFrame.correlation(df)
#Explorer.DataFrame<
  Polars[2 x 3]
  names string ["dogs", "cats"]
  dogs f64 [1.0000000000000002, 0.5447047794019219]
  cats f64 [0.5447047794019219, 1.0]
>

covariance(df, opts \\ [])

@spec covariance(df :: t(), opts :: Keyword.t()) :: df :: t()

Calculates the pairwise covariance of numeric columns.

The returned dataframe is the covariance matrix.

Supported dtypes

Only columns with the following dtypes are taken into account.

floats: {:f, 32} and {:f, 64}
integers: {:s, 8}, {:s, 16}, {:s, 32}, {:s, 64}, {:u, 8}, {:u, 16}, {:u, 32} and {:u, 64}

The resultant columns are always {:f, 64}.

Options

:columns - the selection of columns to calculate. Defaults to all numeric columns.
:column_name - the name of the column with column names. Defaults to "names".
:ddof - the 'delta degrees of freedom' - the divisor used in the covariance calculation. Defaults to 1.

Examples

iex> df = Explorer.DataFrame.new(dogs: [1, 0, 2, 1], cats: [2, 3, 0, 1])
iex> Explorer.DataFrame.covariance(df)
#Explorer.DataFrame<
  Polars[2 x 3]
  names string ["dogs", "cats"]
  dogs f64 [0.6666666666666666, -1.0]
  cats f64 [-1.0, 1.6666666666666667]
>

describe(df, opts \\ [])

@spec describe(df :: t(), Keyword.t()) :: t()

Describe numeric columns of a DataFrame.

Groups are ignored if the dataframe is using any.

Options

:percentiles - Floating point list with the percentiles to be calculated. (default: [0.25, 0.5, 0.75])

Examples

iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
iex> Explorer.DataFrame.describe(df)
#Explorer.DataFrame<
  Polars[9 x 4]
  describe string ["count", "nil_count", "mean", "std", "min", ...]
  a string ["2", "1", nil, nil, nil, ...]
  b f64 [3.0, 0.0, 2.0, 1.0, 1.0, ...]
  c string ["3", "0", nil, nil, nil, ...]
>

iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
iex> Explorer.DataFrame.describe(df, percentiles: [0.3, 0.5, 0.8])
#Explorer.DataFrame<
  Polars[9 x 4]
  describe string ["count", "nil_count", "mean", "std", "min", ...]
  a string ["2", "1", nil, nil, nil, ...]
  b f64 [3.0, 0.0, 2.0, 1.0, 1.0, ...]
  c string ["3", "0", nil, nil, nil, ...]
>

discard(df, columns_or_column)

@spec discard(df :: t(), column() | columns()) :: t()

Discards a subset of columns by name.

It's important to notice that groups are kept: you can't discard grouping columns.

Examples

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.discard(df, ["b"])
#Explorer.DataFrame<
  Polars[3 x 1]
  a string ["a", "b", "c"]
>

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.discard(df, ["a", "b"])
#Explorer.DataFrame<
  Polars[3 x 1]
  c s64 [4, 5, 6]
>

Ranges, regexes, and functions are also accepted in column names, as in select/2.

Grouped examples

You cannot discard grouped columns. You need to ungroup before removing them:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.discard(grouped, ["species"])
#Explorer.DataFrame<
  Polars[150 x 5]
  Groups: ["species"]
  sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
  sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
  petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
  petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

distinct(df, columns \\ .., opts \\ [])

@spec distinct(df :: t(), columns :: columns(), opts :: Keyword.t()) :: t()

Takes distinct rows by a selection of columns.

Distinct is not affected by groups, although groups are kept in the columns selection if keep_all option is false (the default).

Options

:keep_all - If set to true, keep all columns. Default is false.

Examples

By default will return unique values of the requested columns:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"])
#Explorer.DataFrame<
  Polars[1094 x 2]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>

If keep_all is set to true, then the first value of each column not in the requested columns will be returned:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.distinct(df, ["year", "country"], keep_all: true)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

A callback on the dataframe's names can be passed instead of a list (like select/2):

iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> Explorer.DataFrame.distinct(df, &String.starts_with?(&1, "x"))
#Explorer.DataFrame<
  Polars[2 x 2]
  x1 s64 [1, 3]
  x2 string ["a", "c"]
>

If the dataframe has groups, then the columns of each group will be added to the distinct columns:

iex> df = Explorer.DataFrame.new(x1: [1, 3, 3], x2: ["a", "c", "c"], y1: [1, 2, 3])
iex> df = Explorer.DataFrame.group_by(df, "x1")
iex> Explorer.DataFrame.distinct(df, ["x2"])
#Explorer.DataFrame<
  Polars[2 x 2]
  Groups: ["x1"]
  x1 s64 [1, 3]
  x2 string ["a", "c"]
>

drop_nil(df, columns_or_column \\ ..)

@spec drop_nil(df :: t(), column() | columns()) :: t()

Drop nil values.

Optionally accepts a subset of columns.

Examples

To drop nils on all columns:

iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df)
#Explorer.DataFrame<
  Polars[1 x 2]
  a s64 [1]
  b s64 [1]
>

To drop nils on a single column:

iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3])
iex> Explorer.DataFrame.drop_nil(df, :a)
#Explorer.DataFrame<
  Polars[2 x 2]
  a s64 [1, 2]
  b s64 [1, nil]
>

To drop some columns:

iex> df = Explorer.DataFrame.new(a: [1, 2, nil], b: [1, nil, 3], c: [nil, 5, 6])
iex> Explorer.DataFrame.drop_nil(df, [:a, :c])
#Explorer.DataFrame<
  Polars[1 x 3]
  a s64 [2]
  b s64 [nil]
  c s64 [5]
>

Ranges, regexes, and functions are also accepted in column names, as in select/2.

dummies(df, columns_or_column)

@spec dummies(df :: t(), column() | columns()) :: t()

Turns a set of columns to dummy variables.

In case the dataframe is using groups, all groups will be removed.

Examples

To mark a single column as dummy:

iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, "col_x")
#Explorer.DataFrame<
  Polars[4 x 3]
  col_x_a u8 [1, 0, 1, 0]
  col_x_b u8 [0, 1, 0, 0]
  col_x_c u8 [0, 0, 0, 1]
>

Or multiple columns:

iex> df = Explorer.DataFrame.new(col_x: ["a", "b", "a", "c"], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, ["col_x", "col_y"])
#Explorer.DataFrame<
  Polars[4 x 6]
  col_x_a u8 [1, 0, 1, 0]
  col_x_b u8 [0, 1, 0, 0]
  col_x_c u8 [0, 0, 0, 1]
  col_y_b u8 [1, 0, 1, 0]
  col_y_a u8 [0, 1, 0, 0]
  col_y_d u8 [0, 0, 0, 1]
>

Or all string columns:

iex> df = Explorer.DataFrame.new(num: [1, 2, 3, 4], col_y: ["b", "a", "b", "d"])
iex> Explorer.DataFrame.dummies(df, fn _name, type -> type == :string end)
#Explorer.DataFrame<
  Polars[4 x 3]
  col_y_b u8 [1, 0, 1, 0]
  col_y_a u8 [0, 1, 0, 0]
  col_y_d u8 [0, 0, 0, 1]
>

Ranges, regexes, and functions are also accepted in column names, as in select/2.

explode(df, column_or_columns)

@spec explode(df :: t(), column :: column_name() | [column_name()]) :: t()

Explodes one or multiple list column into multiple rows.

When exploding multiple columns, the number of list elements in each row for the exploded columns must be the same.

Examples

iex> df = Explorer.DataFrame.new(a: [[1, 2], [3, 4]], b: [[5, 6], [7, 8]], c: ["a", "b"])
iex> Explorer.DataFrame.explode(df, :a)
#Explorer.DataFrame<
  Polars[4 x 3]
  a s64 [1, 2, 3, 4]
  b list[s64] [[5, 6], [5, 6], [7, 8], [7, ...]]
  c string ["a", "a", "b", "b"]
>
iex> Explorer.DataFrame.explode(df, [:a, :b])
#Explorer.DataFrame<
  Polars[4 x 3]
  a s64 [1, 2, 3, 4]
  b s64 [5, 6, 7, 8]
  c string ["a", "a", "b", "b"]
>

You can think of exploding multiple list columns as being the inverse of aggregating the elements the exploded columns into lists:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3, 4], b: [5, 6, 7, 8], c: ["a", "a", "b", "b"])
iex> df = df |> Explorer.DataFrame.group_by(:c) |> Explorer.DataFrame.summarise(a: a, b: b)
#Explorer.DataFrame<
  Polars[2 x 3]
  c string ["a", "b"]
  a list[s64] [[1, 2], [3, 4]]
  b list[s64] [[5, 6], [7, 8]]
>
iex> Explorer.DataFrame.explode(df, [:a, :b]) # we are back where we started
#Explorer.DataFrame<
  Polars[4 x 3]
  c string ["a", "a", "b", "b"]
  a s64 [1, 2, 3, 4]
  b s64 [5, 6, 7, 8]
>

If you want to perform the cartesian product of two list columns, you must call explode/2 once for each column:

iex> df = Explorer.DataFrame.new(a: [[1, 2], [3, 4]], b: [[5, 6], [7, 8]], c: ["a", "b"])
iex> df |> Explorer.DataFrame.explode(:a) |> Explorer.DataFrame.explode(:b)
#Explorer.DataFrame<
  Polars[8 x 3]
  a s64 [1, 1, 2, 2, 3, ...]
  b s64 [5, 6, 5, 6, 7, ...]
  c string ["a", "a", "a", "a", "b", ...]
>

filter(df, query)

(macro)

Picks rows based on Explorer.Query.

The query is compiled and runs efficiently against the dataframe. The query must return a boolean expression or a list of boolean expressions. When a list is returned, they are joined as and expressions.

Notice
This is a macro. You must require Explorer.DataFrame before using it.

Besides element-wise series operations, you can also use window functions and aggregations inside comparisons. In such cases, grouped dataframes may have different results than ungrouped ones, because the filtering is computed withing groups. See examples below.

See filter_with/2 for a callback version of this function without Explorer.Query.

Examples

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col2 > 2)
#Explorer.DataFrame<
  Polars[1 x 2]
  col1 string ["c"]
  col2 s64 [3]
>

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, col1 == "b")
#Explorer.DataFrame<
  Polars[1 x 2]
  col1 string ["b"]
  col2 s64 [2]
>

iex> df = Explorer.DataFrame.new(col1: [5, 4, 3], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, [col1 > 3, col2 < 3])
#Explorer.DataFrame<
  Polars[2 x 2]
  col1 s64 [5, 4]
  col2 s64 [1, 2]
>

Returning a non-boolean expression errors:

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2))
** (ArgumentError) expecting the function to return a boolean LazySeries, but instead it returned a LazySeries of type {:s, 64}

Which can be addressed by converting it to boolean:

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter(df, cumulative_max(col2) == 1)
#Explorer.DataFrame<
  Polars[1 x 2]
  col1 string ["a"]
  col2 s64 [1]
>

Grouped examples

In a grouped dataframe, the aggregation is calculated within each group.

In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter(grouped, petal_length > mean(petal_length))
#Explorer.DataFrame<
  Polars[79 x 5]
  Groups: ["species"]
  sepal_length f64 [4.6, 5.4, 5.0, 4.9, 5.4, ...]
  sepal_width f64 [3.1, 3.9, 3.4, 3.1, 3.7, ...]
  petal_length f64 [1.5, 1.7, 1.5, 1.5, 1.5, ...]
  petal_width f64 [0.2, 0.4, 0.2, 0.1, 0.2, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

filter_with(df, fun)

@spec filter_with(
  df :: t(),
  callback_or_lazy_series_or_list ::
    (Explorer.Backend.QueryFrame.t() ->
       Explorer.Series.lazy_t() | [Explorer.Series.lazy_t()])
    | Explorer.Series.lazy_t()
    | [Explorer.Series.lazy_t()]
) :: t()

Picks rows based on a callback function.

The callback receives a lazy dataframe. A lazy dataframe does not hold any values, instead it stores all operations in order to execute all filtering performantly.

This is a callback version of filter/2.

Examples

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, &Explorer.Series.greater(&1["col2"], 2))
#Explorer.DataFrame<
  Polars[1 x 2]
  col1 string ["c"]
  col2 s64 [3]
>

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.equal(df["col1"], "b") end)
#Explorer.DataFrame<
  Polars[1 x 2]
  col1 string ["b"]
  col2 s64 [2]
>

Grouped examples

In a grouped dataframe, the aggregation is calculated within each group.

In the following example we select the flowers of the Iris dataset that have the "petal length" above the average of each species group.

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.filter_with(grouped, &Explorer.Series.greater(&1["petal_length"], Explorer.Series.mean(&1["petal_length"])))
#Explorer.DataFrame<
  Polars[79 x 5]
  Groups: ["species"]
  sepal_length f64 [4.6, 5.4, 5.0, 4.9, 5.4, ...]
  sepal_width f64 [3.1, 3.9, 3.4, 3.1, 3.7, ...]
  petal_length f64 [1.5, 1.7, 1.5, 1.5, 1.5, ...]
  petal_width f64 [0.2, 0.4, 0.2, 0.1, 0.2, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

frequencies(df, columns)

@spec frequencies(df :: t(), columns :: column_names()) :: t()

Creates a new dataframe with unique rows and the frequencies of each.

Examples

iex> df = Explorer.DataFrame.new(a: ["a", "a", "b"], b: [1, 1, nil])
iex> Explorer.DataFrame.frequencies(df, [:a, :b])
#Explorer.DataFrame<
  Polars[2 x 3]
  a string ["a", "b"]
  b s64 [1, nil]
  counts u32 [2, 1]
>

group_by(df, group)

@spec group_by(df :: t(), groups_or_group :: column_names() | column_name()) :: t()

Group the dataframe by one or more variables.

When the dataframe has grouping variables, operations are performed per group. Explorer.DataFrame.ungroup/2 removes grouping.

Examples

You can group by a single variable:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, "country")
#Explorer.DataFrame<
  Polars[1094 x 10]
  Groups: ["country"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

Or you can group by multiple columns in a given list:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, ["country", "year"])
#Explorer.DataFrame<
  Polars[1094 x 10]
  Groups: ["country", "year"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

Or by a range:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.group_by(df, 0..1)
#Explorer.DataFrame<
  Polars[1094 x 10]
  Groups: ["year", "country"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

Regexes and functions are also accepted in column names, as in select/2.

mask(df, mask)

@spec mask(df :: t(), mask :: Explorer.Series.t() | [boolean()]) :: t()

Picks rows based on a list or series of values.

Examples

This function must only be used when you need to select rows based on external values that are not available to the dataframe. For example, you can pass a list:

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, [false, true, false])
#Explorer.DataFrame<
  Polars[1 x 2]
  col1 string ["b"]
  col2 s64 [2]
>

You must avoid using masks when the masks themselves are computed from other columns. For example, DO NOT do this:

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.mask(df, Explorer.Series.greater(df["col2"], 1))
#Explorer.DataFrame<
  Polars[2 x 2]
  col1 string ["b", "c"]
  col2 s64 [2, 3]
>

Instead, do this:

iex> df = Explorer.DataFrame.new(col1: ["a", "b", "c"], col2: [1, 2, 3])
iex> Explorer.DataFrame.filter_with(df, fn df -> Explorer.Series.greater(df["col2"], 1) end)
#Explorer.DataFrame<
  Polars[2 x 2]
  col1 string ["b", "c"]
  col2 s64 [2, 3]
>

The filter_with/2 version is much more efficient because it doesn't need to create intermediate series representations to apply the mask.

mutate(df, mutations, opts \\ [])

(macro)

Creates or modifies columns based on Explorer.Query.

The query is compiled and runs efficiently against the dataframe. New variables overwrite existing variables of the same name. Column names are coerced from atoms to strings.

Notice
This is a macro. You must require Explorer.DataFrame before using it.

Besides element-wise series operations, you can also use window functions and aggregations inside mutations. In such cases, grouped dataframes may have different results than ungrouped ones, because the mutation is computed withing groups. See examples below.

See mutate_with/3 for a callback version of this function without Explorer.Query. If your mutation cannot be expressed with queries, you may compute the values using the Explorer.Series API directly and then add it to the dataframe using put/3.

Options

keep - Controls which columns should be kept in the resulting dataframe. Supported options for keep are:
- :all - Retains all columns from the input dataframe (default).
- :none - Retains only the columns created by the mutation (and
```
      grouping columns if operating on a grouped DataFrame).
```

Notice
You will need to wrap your mutations in [] in order to use the :keep option.

Examples

Mutations are useful to add or modify columns in your dataframe:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, c: b + 1)
#Explorer.DataFrame<
  Polars[3 x 3]
  a string ["a", "b", "c"]
  b s64 [1, 2, 3]
  c s64 [2, 3, 4]
>

It's also possible to overwrite existing columns:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: b * 2)
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [2, 4, 6]
  b s64 [1, 2, 3]
>

Scalar values are repeated to fill the series:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, a: 4)
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [4, 4, 4]
  b s64 [1, 2, 3]
>

It's also possible to use functions from the Series module, like Explorer.Series.window_sum/3:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, b: window_sum(a, 2))
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [1, 2, 3]
  b s64 [1, 3, 5]
>

Alternatively, all of the above works with a map instead of a keyword list:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, %{"c" => cast(b, {:f, 64})})
#Explorer.DataFrame<
  Polars[3 x 3]
  a string ["a", "b", "c"]
  b s64 [1, 2, 3]
  c f64 [1.0, 2.0, 3.0]
>

Using the :keep option to retain only new columns:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate(df, [c: b + 1], keep: :none)
#Explorer.DataFrame<
  Polars[3 x 1]
  c s64 [2, 3, 4]
>

Grouped examples

Mutations in grouped dataframes takes the context of the group. This enables some aggregations to be made considering each group. It's almost like summarise/2, but repeating the results for each member in the group. For example, if we want to count how many elements of a given group, we can add a new column with that aggregation:

iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate(grouped, count: count(b))
#Explorer.DataFrame<
  Polars[3 x 3]
  Groups: ["id"]
  id string ["a", "a", "b"]
  b s64 [1, 2, 3]
  count u32 [2, 2, 1]
>

In case we want to get the average size of the petal length from the Iris dataset, we can:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.mutate(grouped, petal_length_avg: mean(petal_length))
#Explorer.DataFrame<
  Polars[150 x 6]
  Groups: ["species"]
  sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
  sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
  petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
  petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
  petal_length_avg f64 [1.464, 1.464, 1.464, 1.464, 1.464, ...]
>

We can also use the keep option on grouped dataframes:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.mutate(grouped, [petal_length_avg: mean(petal_length)], keep: :none)
#Explorer.DataFrame<
  Polars[150 x 2]
  Groups: ["species"]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
  petal_length_avg f64 [1.464, 1.464, 1.464, 1.464, 1.464, ...]
>

mutate_with(df, query_or_fun, opts \\ [])

@spec mutate_with(
  df :: t(),
  callback_or_column_pairs ::
    (Explorer.Backend.QueryFrame.t() -> column_pairs(Explorer.Series.lazy_t()))
    | column_pairs(Explorer.Series.lazy_t()),
  opts :: keyword()
) :: t()

Creates or modifies columns using a callback function and an optional keep argument.

The callback receives a lazy dataframe. A lazy dataframe doesn't hold any values, instead it stores all operations in order to execute all mutations performantly.

This is a callback version of mutate/3. If your mutation cannot be expressed with lazy dataframes, you may compute the values using the Explorer.Series API directly and then add it to the dataframe using put/3.

Options

keep - Controls which columns should be kept in the resulting dataframe. Supported options for keep are:
- :all - Retains all columns from the input dataframe (default).
- :none - Retains only the columns created by the mutation (and
```
      grouping columns if operating on a grouped DataFrame).
```

Examples

Here is an example of a new column that sums the value of two other columns:

iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])])
#Explorer.DataFrame<
  Polars[3 x 3]
  a s64 [4, 5, 6]
  b s64 [1, 2, 3]
  c s64 [5, 7, 9]
>

You can overwrite existing columns as well:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[b: Explorer.Series.pow(&1["b"], 2)])
#Explorer.DataFrame<
  Polars[3 x 2]
  a string ["a", "b", "c"]
  b s64 [1, 4, 9]
>

It's possible to "reuse" a variable for different computations:

iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, fn ldf ->
iex>   c = Explorer.Series.add(ldf["a"], ldf["b"])
iex>   [c: c, d: Explorer.Series.window_sum(c, 2)]
iex> end)
#Explorer.DataFrame<
  Polars[3 x 4]
  a s64 [4, 5, 6]
  b s64 [1, 2, 3]
  c s64 [5, 7, 9]
  d s64 [5, 12, 16]
>

Grouped examples

Mutations in grouped dataframes takes the context of the group. For example, if we want to count how many elements of a given group, we can add a new column with that aggregation:

iex> df = Explorer.DataFrame.new(id: ["a", "a", "b"], b: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, :id)
iex> Explorer.DataFrame.mutate_with(grouped, &[count: Explorer.Series.count(&1["b"])])
#Explorer.DataFrame<
  Polars[3 x 3]
  Groups: ["id"]
  id string ["a", "a", "b"]
  b s64 [1, 2, 3]
  count u32 [2, 2, 1]
>

Examples with the `keep` option

Here is an example of a new column that sums the value of two other columns:

iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])], keep: :all)
#Explorer.DataFrame<
  Polars[3 x 3]
  a s64 [4, 5, 6]
  b s64 [1, 2, 3]
  c s64 [5, 7, 9]
>

Note that keep: :all is the default behavior.

Using the :keep option to retain only new columns:

iex> df = Explorer.DataFrame.new(a: [4, 5, 6], b: [1, 2, 3])
iex> Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])], keep: :none)
#Explorer.DataFrame<
  Polars[3 x 1]
  c s64 [5, 7, 9]
>

When using keep: :none on a grouped DataFrame, the grouping columns are retained along with the newly created columns:

iex> df = Explorer.DataFrame.new(name: ["a", "a", "b", "b"], a: [1, 2, 3, 4], b: [5, 6, 7, 8])
iex> grouped = Explorer.DataFrame.group_by(df, :name)
iex> Explorer.DataFrame.mutate_with(grouped, &[ab_sum: Explorer.Series.sum(Explorer.Series.add(&1["a"], &1["b"]))], keep: :none)
#Explorer.DataFrame<
  Polars[4 x 2]
  Groups: ["name"]
  name string ["a", "a", "b", "b"]
  ab_sum s64 [14, 14, 22, 22]
>

Note that the name column is retained because it's the grouping column, even though keep: :none is specified. The a and b columns are dropped, and only the new ab_sum column is added.

nil_count(df)

@spec nil_count(df :: t()) :: t()

Counts the number of nil elements in each column.

Examples

iex> df = Explorer.DataFrame.new(a: ["d", nil, "f"], b: [nil, 2, nil], c: ["a", "b", "c"])
iex> Explorer.DataFrame.nil_count(df)
#Explorer.DataFrame<
  Polars[1 x 3]
  a u32 [1]
  b u32 [2]
  c u32 [0]
>

pivot_longer(df, columns_to_pivot, opts \\ [])

@spec pivot_longer(
  df :: t(),
  columns_to_pivot :: columns(),
  opts :: Keyword.t()
) :: t()

Pivot data from wide to long.

pivot_longer/3 "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider/4.

The second argument, columns_to_pivot, can be either list of column names to pivot. Ranges, regexes, and functions are also accepted in column names, as in select/2. The selected columns must always have the same data type.

In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.

Options

:select - Columns that are not in the list of pivot and should be kept in the dataframe. Ranges, regexes, and functions are also accepted in column names, as in select/2. Defaults to all columns except the ones to pivot.
:discard - Columns that are not in the list of pivot and should be dropped from the dataframe. Ranges, regexes, and functions are also accepted in column names, as in select/2. This list of columns is going to be subtracted from the list of select. Defaults to an empty list.
:names_to - A string specifying the name of the column to create from the data stored in the column names of the dataframe. Defaults to "variable".
:values_to - A string specifying the name of the column to create from the data stored in series element values. Defaults to "value".

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
  Polars[3282 x 9]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
  variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
  value s64 [627, 117, 332, 0, 0, ...]
>

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, &String.ends_with?(&1, "fuel"), select: ["year", "country"])
#Explorer.DataFrame<
  Polars[3282 x 4]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  variable string ["solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", "solid_fuel", ...]
  value s64 [627, 117, 332, 0, 0, ...]
>

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: ["year", "country"], discard: ["country"])
#Explorer.DataFrame<
  Polars[1094 x 3]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  variable string ["total", "total", "total", "total", "total", ...]
  value s64 [2308, 1254, 32500, 141, 7924, ...]
>

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pivot_longer(df, ["total"], select: [], names_to: "my_var", values_to: "my_value")
#Explorer.DataFrame<
  Polars[1094 x 2]
  my_var string ["total", "total", "total", "total", "total", ...]
  my_value s64 [2308, 1254, 32500, 141, 7924, ...]
>

Grouped examples

In the following example we want to take the Iris dataset and increase the number of rows by pivoting the "sepal_length" column. This dataset is grouped by "species", so the resultant dataframe is going to keep the "species" group:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["sepal_length"])
#Explorer.DataFrame<
  Polars[150 x 6]
  Groups: ["species"]
  sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
  petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
  petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
  variable string ["sepal_length", "sepal_length", "sepal_length", "sepal_length", "sepal_length", ...]
  value f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
>

Now we want to do something different: we want to pivot the "species" column that is also a group. This is going to remove the group in the resultant dataframe:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.pivot_longer(grouped, ["species"])
#Explorer.DataFrame<
  Polars[150 x 6]
  sepal_length f64 [5.1, 4.9, 4.7, 4.6, 5.0, ...]
  sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
  petal_length f64 [1.4, 1.4, 1.3, 1.5, 1.4, ...]
  petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.2, ...]
  variable string ["species", "species", "species", "species", "species", ...]
  value string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

pivot_wider(df, names_from, values_from, opts \\ [])

@spec pivot_wider(
  df :: t(),
  names_from :: column(),
  values_from :: column() | columns(),
  opts :: Keyword.t()
) :: t()

Pivot data from long to wide.

pivot_wider/4 "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer/3.

In case the dataframe is using groups, the groups that are also in the list of columns to pivot will be removed from the resultant dataframe. See the examples below.

Options

:id_columns - A set of columns that uniquely identifies each observation.
Defaults to all columns in data except for the columns specified in names_from and values_from, and columns that are of the {:f, 64} dtype.
Typically used when you have redundant variables, i.e. variables whose values are perfectly correlated with existing variables. May accept a filter callback, a list or a range of column names. If an empty list is passed, or a range that results in a empty list of column names, it raises an error.
ID columns cannot be of the float type and any columns of this dtype is discarded. If you need to use float columns as IDs, you must carefully consider rounding or truncating the column and converting it to integer, as long as doing so preserves the properties of the column.
:names_prefix - String added to the start of every variable name. This is particularly useful if names_from is a numeric vector and you want to create syntactic variable names.

Examples

Suppose we have a basketball court and multiple teams that want to train in that court. They need to share a schedule with the hours each team is going to use it. Here is a dataframe representing that schedule:

iex> Explorer.DataFrame.new(
iex>   weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex>   team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex>   hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )

This dataframe is going to look like this - using print/2:

  +----------------------------------------------+
  |  Explorer DataFrame: [rows: 10, columns: 3]  |
  +---------------+--------------+---------------+
  |    weekday    |     team     |     hour      |
  |   <string>    |   <string>   |   <integer>   |
  +===============+==============+===============+
  | Monday        | A            | 10            |
  +---------------+--------------+---------------+
  | Tuesday       | B            | 9             |
  +---------------+--------------+---------------+
  | Wednesday     | C            | 10            |
  +---------------+--------------+---------------+
  | Thursday      | A            | 10            |
  +---------------+--------------+---------------+
  | Friday        | B            | 11            |
  +---------------+--------------+---------------+
  | Monday        | C            | 15            |
  +---------------+--------------+---------------+
  | Tuesday       | A            | 14            |
  +---------------+--------------+---------------+
  | Wednesday     | B            | 16            |
  +---------------+--------------+---------------+
  | Thursday      | C            | 14            |
  +---------------+--------------+---------------+
  | Friday        | A            | 16            |
  +---------------+--------------+---------------+

You can see that the "weekday" repeats, and it's not clear how free the agenda is. We can solve that by pivoting the "weekday" column in multiple columns, making each weekday a new column in the resultant dataframe.

iex> df = Explorer.DataFrame.new(
iex>   weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex>   team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex>   hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "weekday", "hour")
#Explorer.DataFrame<
  Polars[3 x 6]
  team string ["A", "B", "C"]
  Monday s64 [10, nil, 15]
  Tuesday s64 [14, 9, nil]
  Wednesday s64 [nil, 16, 10]
  Thursday s64 [10, nil, 14]
  Friday s64 [16, 11, nil]
>

Now if we print that same dataframe with print/2, we get a better picture of the schedule:

  +----------------------------------------------------------------------+
  |              Explorer DataFrame: [rows: 3, columns: 6]               |
  +----------+-----------+-----------+-----------+-----------+-----------+
  |   team   |  Monday   |  Tuesday  | Wednesday | Thursday  |  Friday   |
  | <string> | <integer> | <integer> | <integer> | <integer> | <integer> |
  +==========+===========+===========+===========+===========+===========+
  | A        | 10        | 14        |           | 10        | 16        |
  +----------+-----------+-----------+-----------+-----------+-----------+
  | B        |           | 9         | 16        |           | 11        |
  +----------+-----------+-----------+-----------+-----------+-----------+
  | C        | 15        |           | 10        | 14        |           |
  +----------+-----------+-----------+-----------+-----------+-----------+

Pivot wider can create unpredictable column names, and sometimes they can conflict with ID columns. In that scenario, we add a number as suffix to duplicated column names. Here is an example:

iex> df = Explorer.DataFrame.new(
iex>   product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex>   property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex>   property_value: [1, 42, 40, 64, 2, 35, 20, 40]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", "property_value")
#Explorer.DataFrame<
  Polars[2 x 5]
  product_id s64 [1, 2]
  product_id_1 s64 [1, 2]
  width_cm s64 [42, 35]
  height_cm s64 [40, 20]
  length_cm s64 [64, 40]
>

But if the option :names_prefix is used, that suffix is not added:

iex> df = Explorer.DataFrame.new(
iex>   product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex>   property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex>   property_value: [1, 42, 40, 64, 2, 35, 20, 40]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", "property_value", names_prefix: "col_")
#Explorer.DataFrame<
  Polars[2 x 5]
  product_id s64 [1, 2]
  col_product_id s64 [1, 2]
  col_width_cm s64 [42, 35]
  col_height_cm s64 [40, 20]
  col_length_cm s64 [64, 40]
>

Multiple columns are accepted for the values_from parameter, but the behaviour is slightly different for the naming of new columns in the resultant dataframe. The new columns are going to be prefixed by the name of the original value column, followed by the name of the variable.

iex> df = Explorer.DataFrame.new(
iex>   product_id: [1, 1, 1, 1, 2, 2, 2, 2],
iex>   property: ["product_id", "width_cm", "height_cm", "length_cm", "product_id", "width_cm", "height_cm", "length_cm"],
iex>   property_value: [1, 42, 40, 64, 2, 35, 20, 40],
iex>   another_value: [1, 43, 41, 65, 2, 36, 21, 42]
iex> )
iex> Explorer.DataFrame.pivot_wider(df, "property", ["property_value", "another_value"])
#Explorer.DataFrame<
  Polars[2 x 9]
  product_id s64 [1, 2]
  property_value_product_id s64 [1, 2]
  property_value_width_cm s64 [42, 35]
  property_value_height_cm s64 [40, 20]
  property_value_length_cm s64 [64, 40]
  another_value_product_id s64 [1, 2]
  another_value_width_cm s64 [43, 36]
  another_value_height_cm s64 [41, 21]
  another_value_length_cm s64 [65, 42]
>

Grouped examples

Now using the same idea, we can see that there is not much difference for grouped dataframes. The only detail is that groups that are not ID columns are discarded.

iex> df = Explorer.DataFrame.new(
iex>   weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex>   team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex>   hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "team")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
  Polars[3 x 6]
  Groups: ["team"]
  team string ["A", "B", "C"]
  Monday s64 [10, nil, 15]
  Tuesday s64 [14, 9, nil]
  Wednesday s64 [nil, 16, 10]
  Thursday s64 [10, nil, 14]
  Friday s64 [16, 11, nil]
>

In the following example the group "weekday" is going to be removed, because the column is going to be pivoted in multiple columns:

iex> df = Explorer.DataFrame.new(
iex>   weekday: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"],
iex>   team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
iex>   hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16]
iex> )
iex> grouped = Explorer.DataFrame.group_by(df, "weekday")
iex> Explorer.DataFrame.pivot_wider(grouped, "weekday", "hour")
#Explorer.DataFrame<
  Polars[3 x 6]
  team string ["A", "B", "C"]
  Monday s64 [10, nil, 15]
  Tuesday s64 [14, 9, nil]
  Wednesday s64 [nil, 16, 10]
  Thursday s64 [10, nil, 14]
  Friday s64 [16, 11, nil]
>

pull(df, column)

@spec pull(df :: t(), column :: column()) :: Explorer.Series.t()

Extracts a single column as a series.

This is equivalent to df[field] for retrieving a single field. The returned series will have its :name field set to the column name.

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, "total")
#Explorer.Series<
  Polars[1094]
  s64 [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.pull(df, 2)
#Explorer.Series<
  Polars[1094]
  s64 [2308, 1254, 32500, 141, 7924, 41, 143, 51246, 1150, 684, 106589, 18408, 8366, 451, 7981, 16345, 403, 17192, 30222, 147, 1388, 166, 133, 5802, 1278, 114468, 47, 2237, 12030, 535, 58, 1367, 145806, 152, 152, 72, 141, 19703, 2393248, 20773, 44, 540, 19, 2064, 1900, 5501, 10465, 2102, 30428, 18122, ...]
>

put(df, column_name, series_or_tensor_or_list, opts \\ [])

Creates or modifies a single column.

This is a simplified way to add or modify one column, accepting a series, tensor, or a list.

If you are computing a series, it is preferrable to use mutate/2 or mutate_with/2 to compute the series and modify it in a single step, as it is more powerful and it handles both expressions and scalar values accordingly.

If you are passing tensors or lists, they will be automatically converted to a series. By default, the new series will have the same dtype as the existing series, unless the :dtype option is given. If there is no existing series, one is inferred from the tensor/list.

Examples

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.transform(df[:a], fn n -> n * 2 end))
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [1, 2, 3]
  b s64 [2, 4, 6]
>

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :b, Explorer.Series.from_list([4, 5, 6]))
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [1, 2, 3]
  b s64 [4, 5, 6]
>

Grouped examples

If the dataframe is grouped, put/3 is going to ignore the groups. So the series must be of the same size of the entire dataframe.

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> grouped = Explorer.DataFrame.group_by(df, "a")
iex> series = Explorer.Series.from_list([9, 8, 7])
iex> Explorer.DataFrame.put(grouped, :b, series)
#Explorer.DataFrame<
  Polars[3 x 2]
  Groups: ["a"]
  a s64 [1, 2, 3]
  b s64 [9, 8, 7]
>

Tensor examples

You can also put tensors into the dataframe:

iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([1, 2, 3]))
#Explorer.DataFrame<
  Polars[3 x 1]
  a s64 [1, 2, 3]
>

You can specify which dtype the tensor represents. For example, a tensor of s64 represents integers by default, but it may also represent timestamps in microseconds from the Unix epoch:

iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([1, 2, 3]), dtype: {:naive_datetime, :microsecond})
#Explorer.DataFrame<
  Polars[3 x 1]
  a naive_datetime[μs] [1970-01-01 00:00:00.000001, 1970-01-01 00:00:00.000002, 1970-01-01 00:00:00.000003]
>

If there is already a column where we want to place the tensor, the column dtype will be automatically used, this means that updating dataframes in place while preserving their types is straight-forward:

iex> df = Explorer.DataFrame.new(a: [~N[1970-01-01 00:00:00]])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(529550625987654))
#Explorer.DataFrame<
  Polars[1 x 1]
  a naive_datetime[μs] [1986-10-13 01:23:45.987654]
>

This is particularly useful for categorical columns:

iex> cat = Explorer.Series.from_list(["foo", "bar", "baz"], dtype: :category)
iex> df = Explorer.DataFrame.new(a: cat)
iex> Explorer.DataFrame.put(df, :a, Nx.tensor([2, 1, 0]))
#Explorer.DataFrame<
  Polars[3 x 1]
  a category ["baz", "bar", "foo"]
>

On the other hand, if you try to put a floating tensor on an integer column, an error will be raised unless a dtype or dtype: :infer is given:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64))
** (ArgumentError) dtype {:s, 64} expects a tensor of type {:s, 64} but got type {:f, 64}

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: {:f, 64})
#Explorer.DataFrame<
  Polars[3 x 1]
  a f64 [1.0, 1.0, 1.0]
>

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.put(df, :a, Nx.tensor(1.0, type: :f64), dtype: :infer)
#Explorer.DataFrame<
  Polars[3 x 1]
  a f64 [1.0, 1.0, 1.0]
>

List examples

Similar to tensors, we can also put lists in the dataframe:

iex> df = Explorer.DataFrame.new([])
iex> Explorer.DataFrame.put(df, :a, [1, 2, 3])
#Explorer.DataFrame<
  Polars[3 x 1]
  a s64 [1, 2, 3]
>

The same considerations as above apply.

relocate(df, columns_or_column, opts)

@spec relocate(
  df :: t(),
  columns :: [column()] | column(),
  opts :: Keyword.t()
) :: t()

Relocates columns.

Change column order within a DataFrame. The before and after options are mutually exclusive. Providing no options will relocate the columns to beginning of the DataFrame.

Options

:before - Specifies to relocate before the given column. It can be an index or a column name.
:after - Specifies to relocate after the given column. It can be an index or a column name.

Examples

Relocate a single column

iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1], c: [nil, 5, 6])
iex> Explorer.DataFrame.relocate(df, "a", after: "c")
#Explorer.DataFrame<
  Polars[3 x 3]
  b s64 [1, 3, 1]
  c s64 [nil, 5, 6]
  a string ["a", "b", "a"]
>

Relocate (and reorder) multiple columns to the beginning

iex> df = Explorer.DataFrame.new(a: [1, 2], b: [5.1, 5.2], c: [4, 5], d: ["yes", "no"])
iex> Explorer.DataFrame.relocate(df, ["d", 1], before: 0)
#Explorer.DataFrame<
  Polars[2 x 4]
  d string ["yes", "no"]
  b f64 [5.1, 5.2]
  a s64 [1, 2]
  c s64 [4, 5]
>

Relocate before another column

iex> df = Explorer.DataFrame.new(a: [1, 2], b: [5.1, 5.2], c: [4, 5], d: ["yes", "no"])
iex> Explorer.DataFrame.relocate(df, ["a", "c"], before: "b")
#Explorer.DataFrame<
  Polars[2 x 4]
  a s64 [1, 2]
  c s64 [4, 5]
  b f64 [5.1, 5.2]
  d string ["yes", "no"]
>

rename(df, names)

@spec rename(
  df :: t(),
  names :: column_names() | column_pairs(column_name())
) :: t()

Renames columns.

Renaming a column that is also a group is going to rename the group as well. To apply a function to a subset of columns, see rename_with/3.

Examples

You can pass in a list of new names:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, ["c", "d"])
#Explorer.DataFrame<
  Polars[3 x 2]
  c string ["a", "b", "a"]
  d s64 [1, 3, 1]
>

Or you can rename individual columns using keyword args:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, a: "first")
#Explorer.DataFrame<
  Polars[3 x 2]
  first string ["a", "b", "a"]
  b s64 [1, 3, 1]
>

Or you can rename individual columns using a map:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "a"], b: [1, 3, 1])
iex> Explorer.DataFrame.rename(df, %{"a" => "first"})
#Explorer.DataFrame<
  Polars[3 x 2]
  first string ["a", "b", "a"]
  b s64 [1, 3, 1]
>

rename_with(df, columns \\ .., callback)

@spec rename_with(df :: t(), columns :: columns(), callback :: function()) :: t()

Renames columns with a function.

Renaming a column that is also a group is going to rename the group as well.

Examples

If no columns are specified, it will apply the function to all column names:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.upcase/1)
#Explorer.DataFrame<
  Polars[1094 x 10]
  YEAR s64 [2010, 2010, 2010, 2010, 2010, ...]
  COUNTRY string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  TOTAL s64 [2308, 1254, 32500, 141, 7924, ...]
  SOLID_FUEL s64 [627, 117, 332, 0, 0, ...]
  LIQUID_FUEL s64 [1601, 953, 12381, 141, 3649, ...]
  GAS_FUEL s64 [74, 7, 14565, 0, 374, ...]
  CEMENT s64 [5, 177, 2598, 0, 204, ...]
  GAS_FLARING s64 [0, 0, 2623, 0, 3697, ...]
  PER_CAPITA f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  BUNKER_FUELS s64 [9, 7, 663, 0, 321, ...]
>

A callback can be used to filter the column names that will be renamed, similarly to select/2:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, &String.ends_with?(&1, "_fuel"), &String.trim_trailing(&1, "_fuel"))
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid s64 [627, 117, 332, 0, 0, ...]
  liquid s64 [1601, 953, 12381, 141, 3649, ...]
  gas s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

Or you can just pass in the list of column names you'd like to apply the function to:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.rename_with(df, ["total", "cement"], &String.upcase/1)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  TOTAL s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  CEMENT s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

Ranges, regexes, and functions are also accepted in column names, as in select/2.

select(df, columns_or_column)

@spec select(df :: t(), column() | columns()) :: t()

Selects a subset of columns by name.

It's important to notice that groups are kept: you can't select off grouping columns.

Examples

You can select a single column:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, "a")
#Explorer.DataFrame<
  Polars[3 x 1]
  a string ["a", "b", "c"]
>

Or a list of names:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ["a"])
#Explorer.DataFrame<
  Polars[3 x 1]
  a string ["a", "b", "c"]
>

You can also use a range or a list of integers:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, [0, 1])
#Explorer.DataFrame<
  Polars[3 x 2]
  a string ["a", "b", "c"]
  b s64 [1, 2, 3]
>

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3], c: [4, 5, 6])
iex> Explorer.DataFrame.select(df, 0..1)
#Explorer.DataFrame<
  Polars[3 x 2]
  a string ["a", "b", "c"]
  b s64 [1, 2, 3]
>

Or you can use a callback function that takes the dataframe's names as its first argument:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, &String.starts_with?(&1, "b"))
#Explorer.DataFrame<
  Polars[3 x 1]
  b s64 [1, 2, 3]
>

Or, if you prefer, a regex:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, ~r/^b$/)
#Explorer.DataFrame<
  Polars[3 x 1]
  b s64 [1, 2, 3]
>

Or a callback function that takes names and types:

iex> df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
iex> Explorer.DataFrame.select(df, fn _name, type -> type == {:s, 64} end)
#Explorer.DataFrame<
  Polars[3 x 1]
  b s64 [1, 2, 3]
>

Grouped examples

Columns that are also groups cannot be removed, you need to ungroup before removing these columns.

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.select(grouped, ["sepal_width"])
#Explorer.DataFrame<
  Polars[150 x 2]
  Groups: ["species"]
  sepal_width f64 [3.5, 3.0, 3.2, 3.1, 3.6, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

sort_by(df, query, opts \\ [])

(macro)

Sorts rows by columns using Explorer.Query.

Notice
This is a macro. You must require Explorer.DataFrame before using it.

See sort_with/2 for a callback version of this function without Explorer.Query.

Options

:nils - :first or :last. By default it is :last if direction is :asc, and :first otherwise.
:parallel - boolean. Whether to parallelize the sorting. By default it is true.
Parallel sort isn't available on certain lazy operations. In those situations this option is ignored.
:stable - boolean. Determines if the sorting is stable (ties are guaranteed to maintain their order) or not. Unstable sorting may be more performant. By default it is false.

Examples

A single column name will sort ascending by that column:

iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_by(df, a)
#Explorer.DataFrame<
  Polars[3 x 2]
  a string ["a", "b", "c"]
  b s64 [3, 1, 2]
>

You can also sort descending:

iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_by(df, desc: a)
#Explorer.DataFrame<
  Polars[3 x 2]
  a string ["c", "b", "a"]
  b s64 [2, 1, 3]
>

You can specify how nils are sorted:

iex> df = Explorer.DataFrame.new(a: ["b", "c", nil, "a"])
iex> Explorer.DataFrame.sort_by(df, [desc: a], nils: :first)
#Explorer.DataFrame<
  Polars[4 x 1]
  a string [nil, "c", "b", "a"]
>

Sorting by more than one column sorts them in the order they are entered:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sort_by(df, asc: total, desc: country)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2011, 2011, 2012, ...]
  country string ["NIUE", "TUVALU", "TUVALU", "NIUE", "NIUE", ...]
  total s64 [1, 2, 2, 2, 2, ...]
  solid_fuel s64 [0, 0, 0, 0, 0, ...]
  liquid_fuel s64 [1, 2, 2, 2, 2, ...]
  gas_fuel s64 [0, 0, 0, 0, 0, ...]
  cement s64 [0, 0, 0, 0, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [0.52, 0.0, 0.0, 1.04, 1.04, ...]
  bunker_fuels s64 [0, 0, 0, 0, 0, ...]
>

Grouped examples

When used in a grouped dataframe, sort_by is going to sort each group individually and then return the entire dataframe with the existing groups. If one of the sort_by columns is also a group, the sorting for that column is not going to work. It is necessary to first summarise the desired column and then sort_by it.

Here is an example using the Iris dataset. We group by species and then we try to sort the dataframe by species and petal length, but only "petal length" is taken into account because "species" is a group.

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sort_by(grouped, desc: species, asc: sepal_width)
#Explorer.DataFrame<
  Polars[150 x 5]
  Groups: ["species"]
  sepal_length f64 [4.5, 4.4, 4.9, 4.8, 4.3, ...]
  sepal_width f64 [2.3, 2.9, 3.0, 3.0, 3.0, ...]
  petal_length f64 [1.3, 1.4, 1.4, 1.4, 1.1, ...]
  petal_width f64 [0.3, 0.2, 0.2, 0.1, 0.1, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

sort_with(df, fun, opts \\ [])

@spec sort_with(
  df :: t(),
  callback_or_result ::
    (Explorer.Backend.QueryFrame.t() -> sort_callback_result())
    | sort_callback_result(),
  opts :: [nils: :first | :last, stable: boolean()]
) :: t()

Sorts rows by columns using a callback function.

The callback receives a lazy dataframe which stores operations instead of values for efficient sorting.

This is a callback version of sort_by/2.

Options

:nils - :first or :last. By default it is :last if direction is :asc, and :first otherwise.
:parallel - boolean. Whether to parallelize the sorting. By default it is true.
Parallel sort isn't available on certain lazy operations. In those situations this option is ignored.
:stable - boolean. Determines if the sorting is stable (ties are guaranteed to maintain their order) or not. Unstable sorting may be more performant. By default it is false.

Examples

A single column name will sort ascending by that column:

iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_with(df, &(&1["a"]))
#Explorer.DataFrame<
  Polars[3 x 2]
  a string ["a", "b", "c"]
  b s64 [3, 1, 2]
>

You can also sort descending:

iex> df = Explorer.DataFrame.new(a: ["b", "c", "a"], b: [1, 2, 3])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"]])
#Explorer.DataFrame<
  Polars[3 x 2]
  a string ["c", "b", "a"]
  b s64 [2, 1, 3]
>

You can specify how nils are sorted:

iex> df = Explorer.DataFrame.new(a: ["b", "c", nil, "a"])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"]], nils: :first)
#Explorer.DataFrame<
  Polars[4 x 1]
  a string [nil, "c", "b", "a"]
>

Sorting by more than one column sorts them in the order they are entered:

iex> df = Explorer.DataFrame.new(a: [3, 1, 3], b: [2, 1, 3])
iex> Explorer.DataFrame.sort_with(df, &[desc: &1["a"], asc: &1["b"]])
#Explorer.DataFrame<
  Polars[3 x 2]
  a s64 [3, 3, 1]
  b s64 [2, 3, 1]
>

Grouped examples

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sort_with(grouped, &[desc: &1["species"], asc: &1["sepal_width"]])
#Explorer.DataFrame<
  Polars[150 x 5]
  Groups: ["species"]
  sepal_length f64 [4.5, 4.4, 4.9, 4.8, 4.3, ...]
  sepal_width f64 [2.3, 2.9, 3.0, 3.0, 3.0, ...]
  petal_length f64 [1.3, 1.4, 1.4, 1.4, 1.1, ...]
  petal_width f64 [0.3, 0.2, 0.2, 0.1, 0.1, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

sql(df, sql_string, opts \\ [])

@spec sql(df :: t(), sql_string :: binary(), opts :: Keyword.t()) :: df :: t()

Create a dataframe from the result of a SQL query on an existing dataframe.

SQL Query is Unvalidated

Explorer does not validate the SQL query string provided. Instead it passes it directly to the backend. As such, the SQL dialect will be backend dependent and any errors will come directly from the backend.

`from` Clause

The from clause of the SQL query should reference a chosen table name. The default name is "df". See the examples for a custom table name.

Examples

Basic example:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["x", "y", "y"])
iex> Explorer.DataFrame.sql(df, "select ARRAY_AGG(a), b from df group by b order by b")
#Explorer.DataFrame<
  Polars[2 x 2]
  a list[s64] [[1], [2, 3]]
  b string ["x", "y"]
>

Custom table name:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> Explorer.DataFrame.sql(df, "select a + 1 from my_table", table_name: "my_table")
#Explorer.DataFrame<
  Polars[3 x 1]
  a s64 [2, 3, 4]
>

summarise(df, query)

(macro)

Summarise each group to a single row using Explorer.Query.

To summarise, you must either perform an aggregation (defined in Explorer.Series) on the desired columns, or select plain columns to retrieve all elements of a group.

The query is compiled and runs efficiently against the dataframe. This function performs aggregations based on groups and it implicitly ungroups the resultant dataframe.

Notice
This is a macro. You must require Explorer.DataFrame before using it.

See summarise_with/2 for a callback version of this function without Explorer.Query.

Examples

iex> df = Explorer.DataFrame.new(letters: ~w(a b c d e), is_vowel: [true, false, false, false, true])
iex> grouped_df = Explorer.DataFrame.group_by(df, :is_vowel)
iex> Explorer.DataFrame.summarise(grouped_df, letters: letters)
#Explorer.DataFrame<
  Polars[2 x 2]
  is_vowel boolean [true, false]
  letters list[string] [["a", "e"], ["b", "c", "d"]]
>

iex> df = Explorer.Datasets.fossil_fuels()
iex> grouped_df = Explorer.DataFrame.group_by(df, "year")
iex> Explorer.DataFrame.summarise(grouped_df, total_max: max(total), total_min: min(total))
#Explorer.DataFrame<
  Polars[5 x 3]
  year s64 [2010, 2011, 2012, 2013, 2014]
  total_max s64 [2393248, 2654360, 2734817, 2797384, 2806634]
  total_min s64 [1, 2, 2, 2, 3]
>

Suppose you want to get the mean sepal width of each Iris species. You could do something like this:

iex> df = Explorer.Datasets.iris()
iex> grouped_df = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.summarise(grouped_df, mean_sepal_width: round(mean(sepal_width), 3))
#Explorer.DataFrame<
  Polars[3 x 2]
  species string ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
  mean_sepal_width f64 [3.418, 2.77, 2.974]
>

In case aggregations for all the dataframe is what you want, you can use ungrouped dataframes:

iex> df = Explorer.Datasets.iris()
iex> Explorer.DataFrame.summarise(df, mean_petal_length: round(mean(petal_length), 2))
#Explorer.DataFrame<
  Polars[1 x 1]
  mean_petal_length f64 [3.76]
>

summarise_with(df, fun)

@spec summarise_with(
  df :: t(),
  callback_or_column_pairs ::
    (Explorer.Backend.QueryFrame.t() -> column_pairs(Explorer.Series.lazy_t()))
    | column_pairs(Explorer.Series.lazy_t())
) :: t()

Summarise each group to a single row using a callback function.

In case no group is set, the entire dataframe will be considered. The callback receives a lazy dataframe. A lazy dataframe does not hold any values, instead it stores all operations in order to execute all summarizations performantly.

This is a callback version of summarise/2.

Examples

iex> alias Explorer.{DataFrame, Series}
iex> df = Explorer.Datasets.fossil_fuels() |> DataFrame.group_by("year")
iex> DataFrame.summarise_with(df, &[total_max: Series.max(&1["total"]), countries: Series.n_distinct(&1["country"])])
#Explorer.DataFrame<
  Polars[5 x 3]
  year s64 [2010, 2011, 2012, 2013, 2014]
  total_max s64 [2393248, 2654360, 2734817, 2797384, 2806634]
  countries u32 [217, 217, 220, 220, 220]
>

iex> alias Explorer.{DataFrame, Series}
iex> df = Explorer.Datasets.fossil_fuels()
iex> DataFrame.summarise_with(df, &[total_max: Series.max(&1["total"]), countries: Series.n_distinct(&1["country"])])
#Explorer.DataFrame<
  Polars[1 x 2]
  total_max s64 [2806634]
  countries u32 [222]
>

transform(df, opts \\ [], row_transform)

@spec transform(t(), Keyword.t(), (... -> any())) :: t()

Make new columns by applying a native Elixir function to each row.

The native Elixir function should take a map (an element from to_rows_stream/2) as input and return a map where the keys are the desired new columns and the values are the values for that row.

See also the version of this function for a single series:

Explorer.Series.transform/2

Warning

This is an expensive operation since data is stored in a columnar format. You should avoid converting a dataframe to rows where possible as it will incur significant overhead. Prefer instead to use the operations in this module rather than native Elixir ones. This function should only be used as a fallback when no equivalent DataFrame operation is available.

Options

:names - Column names to select for serializing. By default all columns will be returned.

Also, all options which are valid for to_rows_stream are valid here.

The `:names` field is recommended

It's recommended that you provide the :names option and to set it to only those column names needed by your function. Otherwise, you'll be serializing unnecessary information.

Examples

Pre-selecting with :names:

iex> alias Explorer.DataFrame, as: DF
iex> df = DF.new(
...>   numbers: [1, 2],
...>   datetime_local: [~N[2024-01-01 00:00:00], ~N[2024-01-01 00:00:00]],
...>   timezone: ["Etc/UTC", "America/New_York"]
...> )
iex> DF.transform(df, [names: ["datetime_local", "timezone"]], fn row ->
...>   datetime_utc =
...>     row["datetime_local"]
...>     |> DateTime.from_naive!(row["timezone"])
...>     |> DateTime.shift_zone!("Etc/UTC")
...>
...>   %{datetime_utc: datetime_utc}
...> end)
#Explorer.DataFrame<
  Polars[2 x 4]
  numbers s64 [1, 2]
  datetime_local naive_datetime[μs] [2024-01-01 00:00:00.000000, 2024-01-01 00:00:00.000000]
  timezone string ["Etc/UTC", "America/New_York"]
  datetime_utc datetime[μs, Etc/UTC] [2024-01-01 00:00:00.000000Z, 2024-01-01 05:00:00.000000Z]
>

Converting to atoms with :atom_keys:

iex> alias Explorer.DataFrame, as: DF
iex> df = DF.new(
...>   major: [1, 2, 3],
...>   minor: [11, 12, 13],
...>   patch: [0, 0, 0]
...> )
iex> DF.transform(df, [atom_keys: true], fn row ->
...>   version = Version.parse!("#{row.major}.#{row.minor}.#{row.patch}")
...>   %{meets_requirements: Version.match?(version, "> 2.0.0")}
...> end)
#Explorer.DataFrame<
  Polars[3 x 4]
  major s64 [1, 2, 3]
  minor s64 [11, 12, 13]
  patch s64 [0, 0, 0]
  meets_requirements boolean [false, true, true]
>

transpose(df, opts \\ [])

@spec transpose(df :: t(), opts :: Keyword.t()) :: t()

Transpose a DataFrame.

Warning

This is an expensive operation since data is stored in a columnar format.

Options

:header - When a string is passed, name of the header column is set to the value. When header is true, name is set to column (default: false)
:columns - names for non header columns. Length of column_names should match the row count of data frame. (default: nil)

Examples

iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df)
#Explorer.DataFrame<
  Polars[3 x 2]
  column_0 string ["d", "1", "a"]
  column_1 string [nil, "2", "b"]
>

iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df, columns: ["x", "y"])
#Explorer.DataFrame<
  Polars[3 x 2]
  x string ["d", "1", "a"]
  y string [nil, "2", "b"]
>

iex> df = Explorer.DataFrame.new(a: ["d", nil], b: [1, 2], c: ["a", "b"])
iex> Explorer.DataFrame.transpose(df, header: "name")
#Explorer.DataFrame<
  Polars[3 x 3]
  name string ["a", "b", "c"]
  column_0 string ["d", "1", "a"]
  column_1 string [nil, "2", "b"]
>

ungroup(df, groups \\ ..)

@spec ungroup(df :: t(), groups_or_group :: column_names() | column_name() | :all) ::
  t()

Removes grouping variables.

Accepts a list of group names. If groups is not specified, then all groups are removed.

Examples

Ungroups all by default:

iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

Ungrouping a single column:

iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, ["country", "year"])
iex> Explorer.DataFrame.ungroup(df, "country")
#Explorer.DataFrame<
  Polars[1094 x 10]
  Groups: ["year"]
  year s64 [2010, 2010, 2010, 2010, 2010, ...]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total s64 [2308, 1254, 32500, 141, 7924, ...]
  solid_fuel s64 [627, 117, 332, 0, 0, ...]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
  gas_fuel s64 [74, 7, 14565, 0, 374, ...]
  cement s64 [5, 177, 2598, 0, 204, ...]
  gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
  bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>

Lists, ranges, regexes, and functions are also accepted in column names, as in select/2.

unnest(df, column_or_columns)

@spec unnest(df :: t(), column_or_columns :: column_name() | [column_name()]) :: t()

Unnests one or multiple struct columns into individual columns.

The field names in a unnested column must not exist in the dataframe or in any other unnested columns.

Examples

iex> df = Explorer.DataFrame.new(before: [1, 2], struct: [%{x: 1, y: 2}, %{x: 3, y: 4}], after: [3, 4])
iex> Explorer.DataFrame.unnest(df, :struct)
#Explorer.DataFrame<
  Polars[2 x 4]
  before s64 [1, 2]
  x s64 [1, 3]
  y s64 [2, 4]
  after s64 [3, 4]
>

iex> df = Explorer.DataFrame.new(struct1: [%{x: 1, y: 2}, %{x: 3, y: 4}], struct2: [%{z: 5}, %{z: 6}])
iex> Explorer.DataFrame.unnest(df, [:struct1, :struct2])
#Explorer.DataFrame<
  Polars[2 x 3]
  x s64 [1, 3]
  y s64 [2, 4]
  z s64 [5, 6]
>

Functions: Multi-table

concat_columns(dfs)

@spec concat_columns([t()]) :: t()

Combine two or more dataframes column-wise.

This function expects the dataframes to have the same number of rows, otherwise rows may be silently discarded. Eager backends may check whenever this happens and raise instead of silently fail. But this may not be possible for lazy dataframes as the number of rows is not known upfront.

When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.

Examples

iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(z: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
  Polars[3 x 4]
  x s64 [1, 2, 3]
  y string ["a", "b", "c"]
  z s64 [4, 5, 6]
  a string ["d", "e", "f"]
>

Conflicting names are suffixed with the index of the dataframe in the array:

iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], a: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_columns([df1, df2])
#Explorer.DataFrame<
  Polars[3 x 4]
  x s64 [1, 2, 3]
  y string ["a", "b", "c"]
  x_1 s64 [4, 5, 6]
  a string ["d", "e", "f"]
>

concat_columns(df1, df2)

@spec concat_columns(t(), t()) :: t()

Combine two dataframes column-wise.

When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.

concat_columns(df1, df2) is equivalent to concat_columns([df1, df2]).

concat_rows(dfs)

@spec concat_rows([t()]) :: t()

Combine two or more dataframes row-wise (stack).

Column names and dtypes must match. The only exception is for numeric columns that can be mixed together, and casted automatically to the "highest" dtype.

For example, if a column is represented by {:u, 16} and {:u, 32}, it will be casted to the highest unsigned integer dtype between the two, which is {:u, 32}. If it is mixing signed and unsigned integers, it will be casted to the highest signed integer possible. And if floats and integers are mixed together, Explorer will cast them to the float dtype {:f, 64}.

When working with grouped dataframes, be aware that only groups from the first dataframe are kept in the resultant dataframe.

Examples

iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4, 5, 6], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
  Polars[6 x 2]
  x s64 [1, 2, 3, 4, 5, ...]
  y string ["a", "b", "c", "d", "e", ...]
>

iex> df1 = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
iex> df2 = Explorer.DataFrame.new(x: [4.2, 5.3, 6.4], y: ["d", "e", "f"])
iex> Explorer.DataFrame.concat_rows([df1, df2])
#Explorer.DataFrame<
  Polars[6 x 2]
  x f64 [1.0, 2.0, 3.0, 4.2, 5.3, ...]
  y string ["a", "b", "c", "d", "e", ...]
>

concat_rows(df1, df2)

@spec concat_rows(t(), t()) :: t()

Combine two dataframes row-wise.

concat_rows(df1, df2) is equivalent to concat_rows([df1, df2]).

When working with grouped dataframes, be aware that only groups from the left-hand side dataframe are kept in the resultant dataframe.

join(left, right, opts \\ [])

@spec join(left :: t(), right :: t(), opts :: Keyword.t()) :: t()

Join two tables.

Join types

:inner - Returns all rows from left where there are matching values in right, and all columns from left and right.
:left - Returns all rows from left and all columns from left and right. Rows in left with no match in right will have nil values in the new columns.
:right - Returns all rows from right and all columns from left and right. Rows in right with no match in left will have nil values in the new columns.
:outer - Returns all rows and all columns from both left and right. Where there are not matching values, returns nil for the one missing.
:cross - Also known as a cartesian join. Returns all combinations of left and right. Can be very computationally expensive.

Options

:on - The column(s) to join on. Defaults to overlapping columns. Does not apply to cross join.
:how - One of the join types (as an atom) described above. Defaults to :inner.

Examples

Inner join:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right)
#Explorer.DataFrame<
  Polars[3 x 3]
  a s64 [1, 2, 2]
  b string ["a", "b", "b"]
  c string ["d", "e", "f"]
>

Left join:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :left)
#Explorer.DataFrame<
  Polars[4 x 3]
  a s64 [1, 2, 2, 3]
  b string ["a", "b", "b", "c"]
  c string ["d", "e", "f", nil]
>

Right join:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :right)
#Explorer.DataFrame<
  Polars[3 x 3]
  a s64 [1, 2, 4]
  c string ["d", "e", "f"]
  b string ["a", "b", nil]
>

Outer join:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :outer)
#Explorer.DataFrame<
  Polars[4 x 4]
  a s64 [1, 2, nil, 3]
  b string ["a", "b", nil, "c"]
  a_right s64 [1, 2, 4, nil]
  c string ["d", "e", "f", nil]
>

Cross join:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, how: :cross)
#Explorer.DataFrame<
  Polars[9 x 4]
  a s64 [1, 1, 1, 2, 2, ...]
  b string ["a", "a", "a", "b", "b", ...]
  a_right s64 [1, 2, 4, 1, 2, ...]
  c string ["d", "e", "f", "d", "e", ...]
>

Inner join with different names:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(d: [1, 2, 2], c: ["d", "e", "f"])
iex> Explorer.DataFrame.join(left, right, on: [a: :d])
#Explorer.DataFrame<
  Polars[3 x 3]
  a s64 [1, 2, 2]
  b string ["a", "b", "b"]
  c string ["d", "e", "f"]
>

Grouped examples

When doing a join operation with grouped dataframes, the joined dataframe may keep the groups from only one side.

An inner join operation will keep the groups from the left-hand side dataframe:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right)
#Explorer.DataFrame<
  Polars[3 x 3]
  Groups: ["b"]
  a s64 [1, 2, 2]
  b string ["a", "b", "b"]
  c string ["d", "e", "f"]
>

A left join operation will keep the groups from the left-hand side dataframe:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 2], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :left)
#Explorer.DataFrame<
  Polars[4 x 3]
  Groups: ["b"]
  a s64 [1, 2, 2, 3]
  b string ["a", "b", "b", "c"]
  c string ["d", "e", "f", nil]
>

A right join operation will keep the groups from the right-hand side dataframe:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :right)
#Explorer.DataFrame<
  Polars[3 x 3]
  Groups: ["c"]
  a s64 [1, 2, 4]
  c string ["d", "e", "f"]
  b string ["a", "b", nil]
>

An outer join operation is going to keep the groups from the left-hand side dataframe:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :outer)
#Explorer.DataFrame<
  Polars[4 x 4]
  Groups: ["b"]
  a s64 [1, 2, nil, 3]
  b string ["a", "b", nil, "c"]
  a_right s64 [1, 2, 4, nil]
  c string ["d", "e", "f", nil]
>

A cross join operation is going to keep the groups from the left-hand side dataframe:

iex> left = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> right = Explorer.DataFrame.new(a: [1, 2, 4], c: ["d", "e", "f"])
iex> grouped_left = Explorer.DataFrame.group_by(left, "b")
iex> grouped_right = Explorer.DataFrame.group_by(right, "c")
iex> Explorer.DataFrame.join(grouped_left, grouped_right, how: :cross)
#Explorer.DataFrame<
  Polars[9 x 4]
  Groups: ["b"]
  a s64 [1, 1, 1, 2, 2, ...]
  b string ["a", "a", "a", "b", "b", ...]
  a_right s64 [1, 2, 4, 1, 2, ...]
  c string ["d", "e", "f", "d", "e", ...]
>

Functions: Row-based

head(df, nrows \\ 5)

@spec head(df :: t(), nrows :: integer()) :: t()

Returns the first n rows of the dataframe.

By default it returns the first 5 rows.

If the dataframe is using groups, then the first n rows of each group is returned.

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df)
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2010, 2010, 2010, 2010, 2010]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
  total s64 [2308, 1254, 32500, 141, 7924]
  solid_fuel s64 [627, 117, 332, 0, 0]
  liquid_fuel s64 [1601, 953, 12381, 141, 3649]
  gas_fuel s64 [74, 7, 14565, 0, 374]
  cement s64 [5, 177, 2598, 0, 204]
  gas_flaring s64 [0, 0, 2623, 0, 3697]
  per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37]
  bunker_fuels s64 [9, 7, 663, 0, 321]
>

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.head(df, 2)
#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2010, 2010]
  country string ["AFGHANISTAN", "ALBANIA"]
  total s64 [2308, 1254]
  solid_fuel s64 [627, 117]
  liquid_fuel s64 [1601, 953]
  gas_fuel s64 [74, 7]
  cement s64 [5, 177]
  gas_flaring s64 [0, 0]
  per_capita f64 [0.08, 0.43]
  bunker_fuels s64 [9, 7]
>

Grouped examples

Using grouped dataframes makes head/2 return n rows from each group. Here is an example using the Iris dataset, and returning two rows from each group:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.head(grouped, 2)
#Explorer.DataFrame<
  Polars[6 x 5]
  Groups: ["species"]
  sepal_length f64 [5.1, 4.9, 7.0, 6.4, 6.3, ...]
  sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.3, ...]
  petal_length f64 [1.4, 1.4, 4.7, 4.5, 6.0, ...]
  petal_width f64 [0.2, 0.2, 1.4, 1.5, 2.5, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>

sample(df, n_or_frac, opts \\ [])

@spec sample(df :: t(), n_or_frac :: number(), opts :: Keyword.t()) :: t()

Sample rows from a dataframe.

If given an integer as the second argument, it will return N samples. If given a float, it will return that proportion of the series.

Can sample with or without replacement.

For grouped dataframes, sample will take into account the rows of each group, meaning that if you try to get N samples and you have G groups, you will get N * G rows. See the examples below.

Options

:replace - If set to true, each sample will be independent and therefore values may repeat. Required to be true for n greater then the number of rows in the dataframe or frac > 1.0. (default: false)
:seed - An integer to be used as a random seed. If nil, a random value between 0 and 2^64 - 1 will be used. (default: nil)
:shuffle - If set to true, the resultant dataframe is going to be shuffle if the sample is equal to the size of the dataframe. (default: false)

Examples

You can sample N rows:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 3, seed: 100)
#Explorer.DataFrame<
  Polars[3 x 10]
  year s64 [2012, 2013, 2014]
  country string ["SYRIAN ARAB REPUBLIC", "EGYPT", "AFGHANISTAN"]
  total s64 [12198, 58198, 2675]
  solid_fuel s64 [1, 224, 1194]
  liquid_fuel s64 [7909, 26501, 1393]
  gas_fuel s64 [3265, 24672, 74]
  cement s64 [816, 6800, 14]
  gas_flaring s64 [208, 0, 0]
  per_capita f64 [0.61, 0.66, 0.08]
  bunker_fuels s64 [437, 694, 9]
>

Or you can sample a proportion of rows:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.sample(df, 0.03, seed: 100)
#Explorer.DataFrame<
  Polars[32 x 10]
  year s64 [2013, 2012, 2014, 2011, 2011, ...]
  country string ["BRITISH VIRGIN ISLANDS", "TAJIKISTAN", "AFGHANISTAN", "ICELAND", "SINGAPORE", ...]
  total s64 [48, 800, 2675, 513, 12332, ...]
  solid_fuel s64 [0, 192, 1194, 94, 7, ...]
  liquid_fuel s64 [48, 501, 1393, 400, 7774, ...]
  gas_fuel s64 [0, 74, 74, 0, 4551, ...]
  cement s64 [0, 34, 14, 19, 0, ...]
  gas_flaring s64 [0, 0, 0, 0, 0, ...]
  per_capita f64 [1.64, 0.1, 0.08, 1.6, 2.38, ...]
  bunker_fuels s64 [0, 28, 9, 168, 41786, ...]
>

Grouped examples

In the following example we have the Iris dataset grouped by species, and we want to take a sample of two plants from each group. Since we have three species, the resultant dataframe is going to have six rows (2 * 3).

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 2, seed: 100)
#Explorer.DataFrame<
  Polars[6 x 5]
  Groups: ["species"]
  sepal_length f64 [4.8, 5.0, 5.5, 6.5, 7.4, ...]
  sepal_width f64 [3.1, 3.6, 2.4, 2.8, 2.8, ...]
  petal_length f64 [1.6, 1.4, 3.8, 4.6, 6.1, ...]
  petal_width f64 [0.2, 0.2, 1.1, 1.5, 1.9, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>

The behaviour is similar when you want to take a fraction of the rows from each group. The main difference is that each group can have more or less rows, depending on its size.

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.sample(grouped, 0.1, seed: 100)
#Explorer.DataFrame<
  Polars[15 x 5]
  Groups: ["species"]
  sepal_length f64 [5.2, 5.0, 5.2, 5.0, 5.0, ...]
  sepal_width f64 [3.4, 3.6, 3.5, 3.0, 3.4, ...]
  petal_length f64 [1.4, 1.4, 1.5, 1.6, 1.6, ...]
  petal_width f64 [0.2, 0.2, 0.2, 0.2, 0.4, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", ...]
>

shuffle(df, opts \\ [])

@spec shuffle(df :: t(), opts :: Keyword.t()) :: t()

Change the order of the rows of a dataframe randomly.

This function is going to ignore groups.

Options

:seed - An integer to be used as a random seed. If nil, a random value between 0 and 2^64 - 1 will be used. (default: nil)

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.shuffle(df, seed: 100)
#Explorer.DataFrame<
  Polars[1094 x 10]
  year s64 [2014, 2014, 2014, 2012, 2010, ...]
  country string ["ISRAEL", "ARGENTINA", "NETHERLANDS", "YEMEN", "GRENADA", ...]
  total s64 [17617, 55638, 45624, 5091, 71, ...]
  solid_fuel s64 [6775, 1588, 9070, 129, 0, ...]
  liquid_fuel s64 [6013, 25685, 18272, 4173, 71, ...]
  gas_fuel s64 [3930, 26368, 18010, 414, 0, ...]
  cement s64 [898, 1551, 272, 375, 0, ...]
  gas_flaring s64 [0, 446, 0, 0, 0, ...]
  per_capita f64 [2.22, 1.29, 2.7, 0.2, 0.68, ...]
  bunker_fuels s64 [1011, 2079, 14210, 111, 4, ...]
>

slice(df, row_indices)

Slices rows at the given indices as a new dataframe.

The indices may be a list or series of indices, or a range. A list of indices does not support negative numbers. Ranges may be negative on either end, which are then normalized. Note ranges in Elixir are inclusive.

Slice works differently when a dataframe is grouped. It is going to consider the indices of each group instead of the entire dataframe. See the examples below.

If your intention is to grab a portion of each group, prefer to use sample/3 instead.

Examples

iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, [0, 2])
#Explorer.DataFrame<
  Polars[2 x 2]
  a s64 [1, 3]
  b string ["a", "c"]
>

With a series

iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, Explorer.Series.from_list([0, 2]))
#Explorer.DataFrame<
  Polars[2 x 2]
  a s64 [1, 3]
  b string ["a", "c"]
>

With a range:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, 1..2)
#Explorer.DataFrame<
  Polars[2 x 2]
  a s64 [2, 3]
  b string ["b", "c"]
>

With a range with negative first and last:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
iex> Explorer.DataFrame.slice(df, -2..-1)
#Explorer.DataFrame<
  Polars[2 x 2]
  a s64 [2, 3]
  b string ["b", "c"]
>

Grouped examples

We are going to once again use the Iris dataset. In this example we want to take elements at indexes 0 and 2:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, [0, 2])
#Explorer.DataFrame<
  Polars[6 x 5]
  Groups: ["species"]
  sepal_length f64 [5.1, 4.7, 7.0, 6.9, 6.3, ...]
  sepal_width f64 [3.5, 3.2, 3.2, 3.1, 3.3, ...]
  petal_length f64 [1.4, 1.3, 4.7, 4.9, 6.0, ...]
  petal_width f64 [0.2, 0.2, 1.4, 1.5, 2.5, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>

Now we want to take the first 3 rows of each group. This is going to work with the range 0..2:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0..2)
#Explorer.DataFrame<
  Polars[9 x 5]
  Groups: ["species"]
  sepal_length f64 [5.1, 4.9, 4.7, 7.0, 6.4, ...]
  sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.2, ...]
  petal_length f64 [1.4, 1.4, 1.3, 4.7, 4.5, ...]
  petal_width f64 [0.2, 0.2, 0.2, 1.4, 1.5, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>

slice(df, offset, length)

Subset a continuous set of rows.

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, 1, 2)
#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2010, 2010]
  country string ["ALBANIA", "ALGERIA"]
  total s64 [1254, 32500]
  solid_fuel s64 [117, 332]
  liquid_fuel s64 [953, 12381]
  gas_fuel s64 [7, 14565]
  cement s64 [177, 2598]
  gas_flaring s64 [0, 2623]
  per_capita f64 [0.43, 0.9]
  bunker_fuels s64 [7, 663]
>

Negative offsets count from the end of the series:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 2)
#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2014, 2014]
  country string ["UNITED STATES OF AMERICA", "URUGUAY"]
  total s64 [1432855, 1840]
  solid_fuel s64 [450047, 2]
  liquid_fuel s64 [576531, 1700]
  gas_fuel s64 [390719, 25]
  cement s64 [11314, 112]
  gas_flaring s64 [4244, 0]
  per_capita f64 [4.43, 0.54]
  bunker_fuels s64 [30722, 251]
>

If the length would run past the end of the dataframe, the result may be shorter than the length:

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.slice(df, -10, 20)
#Explorer.DataFrame<
  Polars[10 x 10]
  year s64 [2014, 2014, 2014, 2014, 2014, ...]
  country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN", "VANUATU", "VENEZUELA", ...]
  total s64 [1432855, 1840, 28692, 42, 50510, ...]
  solid_fuel s64 [450047, 2, 1677, 0, 204, ...]
  liquid_fuel s64 [576531, 1700, 2086, 42, 28445, ...]
  gas_fuel s64 [390719, 25, 23929, 0, 12731, ...]
  cement s64 [11314, 112, 1000, 0, 1088, ...]
  gas_flaring s64 [4244, 0, 0, 0, 8042, ...]
  per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65, ...]
  bunker_fuels s64 [30722, 251, 0, 10, 1256, ...]
>

Grouped examples

We want to take the first 3 rows of each group. We need the offset 0 and the length 3:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, 0, 3)
#Explorer.DataFrame<
  Polars[9 x 5]
  Groups: ["species"]
  sepal_length f64 [5.1, 4.9, 4.7, 7.0, 6.4, ...]
  sepal_width f64 [3.5, 3.0, 3.2, 3.2, 3.2, ...]
  petal_length f64 [1.4, 1.4, 1.3, 4.7, 4.5, ...]
  petal_width f64 [0.2, 0.2, 0.2, 1.4, 1.5, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>

We can also pass a negative offset:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.slice(grouped, -6, 3)
#Explorer.DataFrame<
  Polars[9 x 5]
  Groups: ["species"]
  sepal_length f64 [5.1, 4.8, 5.1, 5.6, 5.7, ...]
  sepal_width f64 [3.8, 3.0, 3.8, 2.7, 3.0, ...]
  petal_length f64 [1.9, 1.4, 1.6, 4.2, 4.2, ...]
  petal_width f64 [0.4, 0.3, 0.2, 1.3, 1.2, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", ...]
>

tail(df, nrows \\ 5)

@spec tail(df :: t(), nrows :: integer()) :: t()

Returns the last n rows of the dataframe.

By default it returns the last 5 rows.

If the dataframe is using groups, then the last n rows of each group is returned.

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df)
#Explorer.DataFrame<
  Polars[5 x 10]
  year s64 [2014, 2014, 2014, 2014, 2014]
  country string ["VIET NAM", "WALLIS AND FUTUNA ISLANDS", "YEMEN", "ZAMBIA", "ZIMBABWE"]
  total s64 [45517, 6, 6190, 1228, 3278]
  solid_fuel s64 [19246, 0, 137, 132, 2097]
  liquid_fuel s64 [12694, 6, 5090, 797, 1005]
  gas_fuel s64 [5349, 0, 581, 0, 0]
  cement s64 [8229, 0, 381, 299, 177]
  gas_flaring s64 [0, 0, 0, 0, 0]
  per_capita f64 [0.49, 0.44, 0.24, 0.08, 0.22]
  bunker_fuels s64 [761, 1, 153, 33, 9]
>

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.tail(df, 2)
#Explorer.DataFrame<
  Polars[2 x 10]
  year s64 [2014, 2014]
  country string ["ZAMBIA", "ZIMBABWE"]
  total s64 [1228, 3278]
  solid_fuel s64 [132, 2097]
  liquid_fuel s64 [797, 1005]
  gas_fuel s64 [0, 0]
  cement s64 [299, 177]
  gas_flaring s64 [0, 0]
  per_capita f64 [0.08, 0.22]
  bunker_fuels s64 [33, 9]
>

Grouped examples

Using grouped dataframes makes tail/2 return n rows from each group. Here is an example using the Iris dataset, and returning two rows from each group:

iex> df = Explorer.Datasets.iris()
iex> grouped = Explorer.DataFrame.group_by(df, "species")
iex> Explorer.DataFrame.tail(grouped, 2)
#Explorer.DataFrame<
  Polars[6 x 5]
  Groups: ["species"]
  sepal_length f64 [5.3, 5.0, 5.1, 5.7, 6.2, ...]
  sepal_width f64 [3.7, 3.3, 2.5, 2.8, 3.4, ...]
  petal_length f64 [1.5, 1.4, 3.0, 4.1, 5.4, ...]
  petal_width f64 [0.2, 0.2, 1.1, 1.3, 2.3, ...]
  species string ["Iris-setosa", "Iris-setosa", "Iris-versicolor", "Iris-versicolor", "Iris-virginica", ...]
>

Functions: Introspection

dtypes(df)

@spec dtypes(df :: t()) :: %{required(String.t()) => Explorer.Series.dtype()}

Gets the dtypes of the dataframe columns.

Examples

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.dtypes(df)
%{"floats" => {:f, 64}, "ints" => {:s, 64}}

groups(data_frame)

@spec groups(df :: t()) :: [String.t()]

Returns the groups of a dataframe.

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> df = Explorer.DataFrame.group_by(df, "country")
iex> Explorer.DataFrame.groups(df)
["country"]

iex> df = Explorer.Datasets.iris()
iex> Explorer.DataFrame.groups(df)
[]

n_columns(df)

@spec n_columns(df :: t()) :: integer()

Returns the number of columns in the dataframe.

This function works the same way for grouped dataframes.

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_columns(df)
10

n_rows(df)

@spec n_rows(df :: t()) :: integer()

Returns the number of rows in the dataframe.

This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.

Examples

iex> df = Explorer.Datasets.fossil_fuels()
iex> Explorer.DataFrame.n_rows(df)
1094

names(df)

@spec names(df :: t()) :: [String.t()]

Gets the names of the dataframe columns.

Examples

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0], ints: [1, 2])
iex> Explorer.DataFrame.names(df)
["floats", "ints"]

print(df, opts \\ [])

@spec print(df :: t(), opts :: Keyword.t()) :: :ok

Prints the DataFrame in a tabular fashion.

Options

:limit (non_neg_integer() | :infinity) - number of rows to print.

Examples

df = Explorer.Datasets.iris()
Explorer.DataFrame.print(df)
Explorer.DataFrame.print(df, limit: 1)
Explorer.DataFrame.print(df, limit: :infinity)

shape(df)

@spec shape(df :: t()) :: {integer(), integer()}

Gets the shape of the dataframe as a {height, width} tuple.

This function works the same way for grouped dataframes, considering the entire dataframe in the counting of rows.

Examples

iex> df = Explorer.DataFrame.new(floats: [1.0, 2.0, 3.0], ints: [1, 2, 3])
iex> Explorer.DataFrame.shape(df)
{3, 2}

Functions: IO

dump_csv(df, opts \\ [])

@spec dump_csv(df :: t(), opts :: Keyword.t()) ::
  {:ok, String.t()} | {:error, Exception.t()}

Writes a dataframe to a binary representation of a delimited file.

Options

:header - Should the column names be written as the first line of the file? (default: true)
:delimiter - A single character used to separate fields within a record. (default: ",")

Examples

iex> df = Explorer.Datasets.fossil_fuels() |> Explorer.DataFrame.head(2)
iex> Explorer.DataFrame.dump_csv(df)
{:ok, "year,country,total,solid_fuel,liquid_fuel,gas_fuel,cement,gas_flaring,per_capita,bunker_fuels\n2010,AFGHANISTAN,2308,627,1601,74,5,0,0.08,9\n2010,ALBANIA,1254,117,953,7,177,0,0.43,7\n"}

dump_csv!(df, opts \\ [])

@spec dump_csv!(df :: t(), opts :: Keyword.t()) :: String.t()

Similar to dump_csv/2, but raises in case of error.

dump_ipc(df, opts \\ [])

@spec dump_ipc(df :: t(), opts :: Keyword.t()) ::
  {:ok, binary()} | {:error, Exception.t()}

Writes a dataframe to a binary representation of an IPC file.

Groups are ignored if the dataframe is using any.

Options

:compression - The compression algorithm to use when writing files. Supported options are:
- nil (uncompressed, default)
- :zstd
- :lz4.

dump_ipc!(df, opts \\ [])

@spec dump_ipc!(df :: t(), opts :: Keyword.t()) :: binary()

Similar to dump_ipc/2, but raises in case of error.

dump_ipc_stream(df, opts \\ [])

@spec dump_ipc_stream(df :: t(), opts :: Keyword.t()) ::
  {:ok, binary()} | {:error, Exception.t()}

Writes a dataframe to a binary representation of an IPC Stream file.

Groups are ignored if the dataframe is using any.

Options

:compression - The compression algorithm to use when writing files. Supported options are:
- nil (uncompressed, default)
- :zstd
- :lz4.

dump_ipc_stream!(df, opts \\ [])

@spec dump_ipc_stream!(df :: t(), opts :: Keyword.t()) :: binary()

Similar to dump_ipc_stream/2, but raises in case of error.

dump_ndjson(df)

@spec dump_ndjson(df :: t()) :: {:ok, binary()} | {:error, Exception.t()}

Writes a dataframe to a binary representation of a NDJSON file.

Groups are ignored if the dataframe is using any.

Examples

iex> df = Explorer.DataFrame.new(col_a: [1, 2], col_b: [5.1, 5.2])
iex> Explorer.DataFrame.dump_ndjson(df)
{:ok, ~s({"col_a":1,"col_b":5.1}\n{"col_a":2,"col_b":5.2}\n)}

dump_ndjson!(df)

@spec dump_ndjson!(df :: t()) :: binary()

Similar to dump_ndjson!/2, but raises in case of error.

dump_parquet(df, opts \\ [])

@spec dump_parquet(df :: t(), opts :: Keyword.t()) ::
  {:ok, binary()} | {:error, Exception.t()}

Writes a dataframe to a binary representation of a Parquet file.

Groups are ignored if the dataframe is using any.

Options

:compression - The compression algorithm to use when writing files. Where a compression level is available, this can be passed as a tuple, such as {:zstd, 3}. Supported options are:
- nil (uncompressed, default)
- :snappy
- :gzip (with levels 1-9)
- :brotli (with levels 1-11)
- :zstd (with levels -7-22)
- :lz4raw.

dump_parquet!(df, opts \\ [])

@spec dump_parquet!(df :: t(), opts :: Keyword.t()) :: binary()

Similar to dump_parquet/2, but raises in case of error.

from_csv(filename, opts \\ [])

@spec from_csv(filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads a delimited file into a dataframe.

It accepts a filename that can be a local file, a "s3://" schema, or a FSS entry like FSS.S3.Entry.

If the CSV is compressed, it is automatically decompressed.

Options

:delimiter - A single character used to separate fields within a record. (default: ",")
:dtypes - A list of shape [column_name: dtype]. The column names must match the ones in the CSV header or the ones given in the :columns option. Any non-specified column has its type imputed from the first 1000 rows. (default: [])
:header - Does the file have a header of column names as the first row or not? (default: true)
:max_rows - Maximum number of lines to read. (default: nil)
:nil_values - A list of strings that should be interpreted as a nil values. (default: [])
:skip_rows - The number of lines to skip at the beginning of the file. (default: 0)
:skip_rows_after_header - The number of lines to skip after the header row. (default: 0)
:columns - A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default: nil)
:infer_schema_length - Maximum number of rows read for schema inference. Set it to zero to disable inference and default all values to string. Setting this to nil will do a full table scan and will be slow (default: 1000).
:parse_dates - Automatically try to parse dates/ datetimes and time. If parsing fails, columns remain of dtype string
:eol_delimiter - A single character used to represent new lines. (default: " ")
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)
:backend - The Explorer backend to use. Defaults to the value returned by Explorer.Backend.get/0.
:lazy - force the results into the lazy version of the current backend.
:node - The Erlang node to allocate the data frame on.
:encoding - Encoding to use when reading the file. For now, the only possible values are utf8 and utf8-lossy. The utf8-lossy option means that invalid utf8 values are replaced with � characters. (default: "utf8")

from_csv!(filename, opts \\ [])

@spec from_csv!(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: t()

Similar to from_csv/2 but raises if there is a problem reading the CSV.

from_ipc(filename, opts \\ [])

@spec from_ipc(filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads an IPC file into a dataframe.

It accepts a filename that can be a local file, a "s3://" schema, or a FSS entry like FSS.S3.Entry.

Options

:columns - List with the name or index of columns to be selected. Defaults to all columns.
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)
:backend - The Explorer backend to use. Defaults to the value returned by Explorer.Backend.get/0.
:lazy - force the results into the lazy version of the current backend.
:node - The Erlang node to allocate the data frame on.

from_ipc!(filename, opts \\ [])

@spec from_ipc!(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: t()

Similar to from_ipc/2 but raises if there is a problem reading the IPC file.

from_ipc_stream(filename, opts \\ [])

Reads an IPC Streaming file into a dataframe.

It's possible to read from an IPC Streaming file using the lazy Polars backend, but the implementation is not truly lazy. We are going to read it first using the eager backend, and then convert the dataframe to lazy.

Options

:columns - List with the name or index of columns to be selected. Defaults to all columns.
:backend - The Explorer backend to use. Defaults to the value returned by Explorer.Backend.get/0.
:lazy - force the results into the lazy version of the current backend.
:node - The Erlang node to allocate the data frame on.
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)

from_ipc_stream!(filename, opts \\ [])

@spec from_ipc_stream!(filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  t()

Similar to from_ipc_stream/2 but raises if there is a problem reading the IPC Stream file.

from_ndjson(filename, opts \\ [])

@spec from_ndjson(filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Read a file of JSON objects or lists separated by new lines

Options

:batch_size - Sets the batch size for reading rows. This value may have significant impact in performance, so adjust it for your needs (default: 1000).
:infer_schema_length - Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default: 1000).
:backend - The Explorer backend to use. Defaults to the value returned by Explorer.Backend.get/0.
:lazy - force the results into the lazy version of the current backend.
:node - The Erlang node to allocate the data frame on.
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)

from_ndjson!(filename, opts \\ [])

@spec from_ndjson!(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: t()

Similar to from_ndjson/2, but raises in case of error.

from_parquet(filename, opts \\ [])

@spec from_parquet(filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads a parquet file into a dataframe.

It accepts a filename that can be a local file, a "s3://" schema, or a FSS entry like FSS.S3.Entry.

Options

:max_rows - Maximum number of lines to read. (default: nil)
:columns - A list of column names or indexes to keep. If present, only these columns are read into the dataframe. (default: nil)
:rechunk - Make sure that all columns are contiguous in memory by aggregating the chunks into a single array. (default: false)
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)
:backend - The Explorer backend to use. Defaults to the value returned by Explorer.Backend.get/0.
:lazy - force the results into the lazy version of the current backend.
:node - The Erlang node to allocate the data frame on.

from_parquet!(filename, opts \\ [])

@spec from_parquet!(filename :: String.t() | fs_entry(), opts :: Keyword.t()) :: t()

Similar to from_parquet/2 but raises if there is a problem reading the Parquet file.

from_query(conn, query, params, opts \\ [])

@spec from_query(
  Adbc.Connection.t(),
  query :: String.t(),
  params :: [term()],
  opts :: Keyword.t()
) :: {:ok, t()} | {:error, Exception.t()}

Reads data from a query.

conn must be a Adbc.Connection process. sql is a string representing the sql query and params is the list of query parameters. See Adbc for more information.

Example

In order to read data from a database, you must list :adbc as a dependency, download the relevant driver, and start both database and connection processes in your supervision tree.

First, add :adbc as a dependency in your mix.exs:

{:adbc, "~> 0.1"}

Now, in your config/config.exs, configure the drivers you are going to use (see Adbc module docs for more information on supported drivers):

config :adbc, :drivers, [:sqlite]

If you are using a notebook or scripting, you can also use Adbc.download_driver!/1 to dynamically download one.

Then start the database and the relevant connection processes in your supervision tree:

children = [
  {Adbc.Database,
   driver: :sqlite,
   process_options: [name: MyApp.DB]},
  {Adbc.Connection,
   database: MyApp.DB,
   process_options: [name: MyApp.Conn]}
]

Supervisor.start_link(children, strategy: :one_for_one)

In a notebook, the above would look like this:

db = Kino.start_child!({Adbc.Database, driver: :sqlite})
conn = Kino.start_child!({Adbc.Connection, database: db})

And now you can make queries with:

# For named connections
{:ok, _} = Explorer.DataFrame.from_query(MyApp.Conn, "SELECT 123", [])

# When using the conn PID directly
{:ok, _} = Explorer.DataFrame.from_query(conn, "SELECT 123", [])

Parameterized queries are possible by passing a list of parameters as the third argument. The query string will use a different placeholder depending on the driver. For example, ? for SQLite and $1, $2, etc. for PostgreSQL.

# For SQLite
{:ok, _} = Explorer.DataFrame.from_query(conn, "SELECT ?", [123])

# When using PostgreSQL
{:ok, _} = Explorer.DataFrame.from_query(conn, "SELECT $1", [123])

Options

:backend - The Explorer backend to use. Defaults to the value returned by Explorer.Backend.get/0.
:lazy - force the results into the lazy version of the current backend.

from_query!(conn, query, params, opts \\ [])

@spec from_query!(
  Adbc.Connection.t(),
  query :: String.t(),
  params :: [term()],
  opts :: Keyword.t()
) :: t()

Similar to from_query/4 but raises if there is an error.

load_csv(contents, opts \\ [])

@spec load_csv(contents :: String.t(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads a representation of a CSV file into a dataframe.

If the CSV is compressed, it is automatically decompressed.

Accepts the same options as from_csv/2.

load_csv!(contents, opts \\ [])

@spec load_csv!(contents :: String.t(), opts :: Keyword.t()) :: t()

Similar to load_csv/2 but raises if there is a problem reading the CSV.

load_ipc(contents, opts \\ [])

@spec load_ipc(contents :: binary(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads a binary representing an IPC file into a dataframe.

Accepts the same options as from_ipc/2.

load_ipc!(contents, opts \\ [])

@spec load_ipc!(contents :: binary(), opts :: Keyword.t()) :: t()

Similar to load_ipc/2 but raises if there is a problem reading the IPC file.

load_ipc_stream(contents, opts \\ [])

@spec load_ipc_stream(contents :: binary(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads a binary representing an IPC Stream file into a dataframe.

Accepts the same options as from_ipc_stream/2.

load_ipc_stream!(contents, opts \\ [])

@spec load_ipc_stream!(contents :: binary(), opts :: Keyword.t()) :: t()

Similar to load_ipc_stream/2 but raises if there is a problem.

load_ndjson(contents, opts \\ [])

@spec load_ndjson(contents :: String.t(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads a representation of a NDJSON file into a dataframe.

Accepts the same options as from_ndjson/2.

load_ndjson!(contents, opts \\ [])

@spec load_ndjson!(contents :: String.t(), opts :: Keyword.t()) :: t()

Similar to load_ndjson/2, but raises in case of error.

Examples

iex> contents = ~s({"col_a":1,"col_b":5.1}\n{"col_a":2,"col_b":5.2}\n)
iex> Explorer.DataFrame.load_ndjson!(contents)
#Explorer.DataFrame<
  Polars[2 x 2]
  col_a s64 [1, 2]
  col_b f64 [5.1, 5.2]
>

load_parquet(contents, opts \\ [])

@spec load_parquet(contents :: binary(), opts :: Keyword.t()) ::
  {:ok, t()} | {:error, Exception.t()}

Reads a binary representation of a parquet file into a dataframe.

Accepts the same options as from_parquet/2.

load_parquet!(contents, opts \\ [])

@spec load_parquet!(contents :: binary(), opts :: Keyword.t()) :: t()

Similar to load_parquet/2 but raises if there is a problem reading the Parquet file.

to_csv(df, filename, opts \\ [])

@spec to_csv(df :: t(), filename :: fs_entry() | String.t(), opts :: Keyword.t()) ::
  :ok | {:error, Exception.t()}

Writes a dataframe to a delimited file.

Note not all fields can be serialized to csv and it may require explicitly casting to other formats. In such cases, an error is returned.

Groups are ignored if the dataframe is using any.

Options

:header - Should the column names be written as the first line of the file? (default: true)
:delimiter - A single character used to separate fields within a record. (default: ",")
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)
:streaming - Tells the backend if it should use streaming, which means that the dataframe is not loaded to the memory at once, and instead it is written in chunks from a lazy dataframe. Defaults to true for local filesystems, ignored on all others.

to_csv!(df, filename, opts \\ [])

@spec to_csv!(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  :ok

Similar to to_csv/3 but raises if there is a problem reading the CSV.

to_ipc(df, filename, opts \\ [])

@spec to_ipc(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  :ok | {:error, Exception.t()}

Writes a dataframe to an IPC file.

Apache IPC is a language-agnostic columnar data structure that can be used to store dataframes. It excels as a format for quickly exchange data between different programming languages.

Groups are ignored if the dataframe is using any.

Options

:compression - The compression algorithm to use when writing files. Supported options are:
- nil (uncompressed, default)
- :zstd
- :lz4.
:streaming - Tells the backend if it should use streaming, which means that the dataframe is not loaded to the memory at once, and instead it is written in chunks from a lazy dataframe.
This option has no effect on eager - the default - dataframes. It defaults to true.
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)

to_ipc!(df, filename, opts \\ [])

@spec to_ipc!(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  :ok

Similar to to_ipc/3, but raises in case of error.

to_ipc_stream(df, filename, opts \\ [])

Writes a dataframe to an IPC Stream file.

Arrow IPC Streams provide a streaming protocol or “format" for sending an arbitrary length sequence of record batches. The format must be processed from start to end, and does not support random access.

Options

:compression - The compression algorithm to use when writing files. Supported options are:
- nil (uncompressed, default)
- :zstd
- :lz4.
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)

to_ipc_stream!(df, filename, opts \\ [])

@spec to_ipc_stream!(
  df :: t(),
  filename :: String.t() | fs_entry(),
  opts :: Keyword.t()
) :: :ok

Similar to to_ipc_stream/3, but raises in case of error.

to_ndjson(df, filename, opts \\ [])

@spec to_ndjson(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  :ok | {:error, Exception.t()}

Writes a dataframe to a ndjson file.

Groups are ignored if the dataframe is using any.

NDJSON are files that contains JSON files separated by new lines. They are often used as structured logs.

Options

:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)

to_ndjson!(df, filename, opts \\ [])

@spec to_ndjson!(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  :ok

Similar to to_ndjson/3, but raises in case of error.

to_parquet(df, filename, opts \\ [])

@spec to_parquet(df :: t(), filename :: String.t() | fs_entry(), opts :: Keyword.t()) ::
  :ok | {:error, Exception.t()}

Writes a dataframe to a parquet file.

Groups are ignored if the dataframe is using any.

Options

:compression - The compression algorithm to use when writing files. Where a compression level is available, this can be passed as a tuple, such as {:zstd, 3}. Supported options are:
- nil (uncompressed, default)
- :snappy
- :gzip (with levels 1-9)
- :brotli (with levels 1-11)
- :zstd (with levels -7-22)
- :lz4raw.
:streaming - Tells the backend if it should use streaming, which means that the dataframe is not loaded to the memory at once, and instead it is written in chunks from a lazy dataframe.
This option has no effect on eager - the default - dataframes. It defaults to true.
:config - An optional struct, keyword list or map, normally associated with remote file systems. See IO section for more details. (default: nil)

to_parquet!(df, filename, opts \\ [])

Similar to to_parquet/3, but raises in case of error.

Types

column()

@type column() :: column_name() | non_neg_integer()

Represents a column name or its index.

column_name()

@type column_name() :: atom() | String.t()

Represents a column name as atom or string.

column_names()

@type column_names() :: [column_name()]

Represents multiple column names as atoms or strings.

column_pairs(value)

@type column_pairs(value) :: [{column(), value}] | %{required(column()) => value}

Represents a column pair where the value is a column name or a column index, and the value is of type value.

columns()

@type columns() ::
  [column()]
  | Range.t()
  | (String.t() -> boolean())
  | (String.t(), Explorer.Series.dtype() -> boolean())

Represents multiple columns.

The columns may be specified as one of:

a list of columns indexes or names as atoms and strings
a range
a regex that keeps only the names matching the regex
a one-arity function that receives column names and returns true for column names to keep
a two-arity function that receives column names and types and returns true for column names to keep

fs_entry()

@type fs_entry() :: FSS.entry()

Represents a filesystem entry, that can be local, S3 or URL.

sort_callback_result()

@type sort_callback_result() ::
  Explorer.Series.lazy_t()
  | [Explorer.Series.lazy_t()]
  | [{:asc | :desc, Explorer.Series.lazy_t()}]

t()

@type t() :: %Explorer.DataFrame{
  data: Explorer.Backend.DataFrame.t(),
  dtypes: %{required(String.t()) => Explorer.Series.dtype()},
  groups: [String.t()],
  names: [String.t()],
  remote: term()
}

Represents a dataframe.