View Source Ten Minutes to Explorer
Mix.install([
{:explorer, "~> 0.9.0"},
{:kino, "~> 0.13.0"}
])
Introduction
Explorer is a dataframe library for Elixir. A dataframe is a common data structure used in data analysis. It is a two-dimensional table composed of columns and rows similar to a SQL table or a spreadsheet.
Explorer's aim is to provide a simple and powerful API for manipulating dataframes. It takes influences mainly
from the tidyverse
, but if you've used other dataframe libraries like pandas
you shouldn't have too much trouble working with Explorer.
This document is meant to give you a crash course in using Explorer. More in-depth documentation can be found in the relevant sections of the docs.
We strongly recommend you run this livebook locally so you can see the outputs and play with the inputs!
Reading and writing data
Data can be read from delimited files (like CSV), NDJSON, Parquet, and the Arrow IPC (feather) format. You can also load in data from a map or keyword list of columns with Explorer.DataFrame.new/1
.
For CSV, your 'usual suspects' of options are available:
delimiter
- A single character used to separate fields within a record. (default:","
)dtypes
- A keyword list of[column_name: dtype]
. If a type is not specified for a column, it is imputed from the first 1000 rows. (default:[]
)header
- Does the file have a header of column names as the first row or not? (default:true
)max_rows
- Maximum number of lines to read. (default:nil
)nil_values
- A list of strings that should be interpreted as a nil values. (default:[]
)skip_rows
- The number of lines to skip at the beginning of the file. (default:0
)skip_rows_after_header
- The number of lines to skip at the after the header row. (default:0
)columns
- A list of column names to keep. If present, only these columns are read into the dataframe. (default:nil
)
Explorer
also has multiple example datasets built in, which you can load from the Explorer.Datasets
module like so:
df = Explorer.Datasets.fossil_fuels()
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
You'll notice that the output looks slightly different than many dataframe libraries. Explorer
takes inspiration on this front from glimpse
in R. A benefit to this approach is that you will rarely need to elide columns.
If you'd like to see a table with your data, take a look at Kino Explorer, that provides a rich table with filtering and sorting.
Writing files is very similar to reading them. The options are a little more limited:
header
- Should the column names be written as the first line of the file? (default:true
)delimiter
- A single character used to separate fields within a record. (default:","
)
First, let's add some useful aliases:
alias Explorer.DataFrame
alias Explorer.Series
Explorer.Series
And then write to a file of your choosing:
input = Kino.Input.text("Filename")
filename = Kino.Input.read(input)
DataFrame.to_csv(df, filename)
:ok
Working with Series
Explorer
, like Polars
, works up from the concept of a Series
. In many ways, you can think of a dataframe as a row-aligned map of Series
. These are like vectors
in R or series
in Pandas.
Explorer
supports the following Series
dtypes
:
:null
- Null:binary
- Binaries (sequences of bytes):boolean
- Boolean:category
- Strings but represented internally as integers:date
- Date type that unwraps toElixir.Date
{:datetime, precision}
- DateTime type with millisecond/microsecond/nanosecond precision that unwraps toElixir.NaiveDateTime
{:duration, precision}
- Duration type with millisecond/microsecond/nanosecond precision that unwraps toExplorer.Duration
{:f, size}
- a 32-bit or 64-bit floating point number{:s, size}
- a 8-bit or 16-bit or 32-bit or 64-bit signed integer{:u, size}
- a 8-bit or 16-bit or 32-bit or 64-bit unsigned integer:string
- UTF-8 encoded binary:time
- Time type that unwraps toElixir.Time
:list
- a nested data type likeElixir.List
. It is a sequence of values having the same data type.:struct
- a nested data type like aElixir.Map
, but with the same keys and matching data types.
Series
can be constructed from Elixir basic types. For example:
s1 = Series.from_list([1, 2, 3])
#Explorer.Series<
Polars[3]
s64 [1, 2, 3]
>
s2 = Series.from_list(["a", "b", "c"])
#Explorer.Series<
Polars[3]
string ["a", "b", "c"]
>
s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])
#Explorer.Series<
Polars[2]
date [2011-01-01, 1965-01-21]
>
You'll notice that the dtype
and size of the Series
are at the top of the printed value. You can get those programmatically as well.
Series.dtype(s3)
:date
Series.size(s3)
2
And the printed values max out at 50:
1..100 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[100]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, ...]
>
Series are also nullable.
s = Series.from_list([1.0, 2.0, nil, nil, 5.0])
#Explorer.Series<
Polars[5]
f64 [1.0, 2.0, nil, nil, 5.0]
>
And you can fill in those missing values using one of the following strategies:
:forward
- replace nil with the previous value:backward
- replace nil with the next value:max
- replace nil with the series maximum:min
- replace nil with the series minimum:mean
- replace nil with the series mean
Series.fill_missing(s, :forward)
#Explorer.Series<
Polars[5]
f64 [1.0, 2.0, 2.0, 2.0, 5.0]
>
In the case of mixed numeric types (i.e. integers and floats), Series
will downcast to a float:
Series.from_list([1, 2.0])
#Explorer.Series<
Polars[2]
f64 [1.0, 2.0]
>
In all other cases, Series
must all be of the same dtype
or else you'll get an ArgumentError
.
Series.from_list([1, 2, 3, "a"])
One of the goals of Explorer
is useful error messages. If you look at the error above, you get:
the value "a" does not match the inferred dtype {:s, 64}
Hopefully this makes abundantly clear what's going on.
Series
also implements the Access
protocol. You can slice and dice in many ways:
s = 1..10 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[10]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>
s[1]
2
s[-1]
10
s[0..4]
#Explorer.Series<
Polars[5]
s64 [1, 2, 3, 4, 5]
>
s[[0, 4, 4]]
#Explorer.Series<
Polars[3]
s64 [1, 5, 5]
>
And of course, you can convert back to an Elixir list.
Series.to_list(s)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Explorer
comparisons return boolean series. We will talk more
about boolean series later.
s = 1..11 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[11]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>
s1 = 11..1//-1 |> Enum.to_list() |> Series.from_list()
#Explorer.Series<
Polars[11]
s64 [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
>
Series.equal(s, s1)
#Explorer.Series<
Polars[11]
boolean [false, false, false, false, false, true, false, false, false, false, false]
>
Series.equal(s, 5)
#Explorer.Series<
Polars[11]
boolean [false, false, false, false, true, false, false, false, false, false, false]
>
Series.not_equal(s, 10)
#Explorer.Series<
Polars[11]
boolean [true, true, true, true, true, true, true, true, true, false, true]
>
Series.greater_equal(s, 4)
#Explorer.Series<
Polars[11]
boolean [false, false, false, true, true, true, true, true, true, true, true]
>
Explorer
supports arithmetic.
Series.add(s, s1)
#Explorer.Series<
Polars[11]
s64 [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]
>
Series.subtract(s, 4)
#Explorer.Series<
Polars[11]
s64 [-3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
>
Series.multiply(s, s1)
#Explorer.Series<
Polars[11]
s64 [11, 20, 27, 32, 35, 36, 35, 32, 27, 20, 11]
>
Remember those helpful errors? We've tried to add those throughout. So if you try to do arithmetic with mismatching dtypes:
s = Series.from_list([1, 2, 3])
s1 = Series.from_list([1.0, 2.0, 3.0])
Series.add(s, s1)
#Explorer.Series<
Polars[3]
f64 [2.0, 4.0, 6.0]
>
Just kidding! Integers and floats will downcast to floats. Let's try again:
s = Series.from_list([1, 2, 3])
s1 = Series.from_list(["a", "b", "c"])
Series.add(s, s1)
You can flip them around.
s = Series.from_list([1, 2, 3, 4])
Series.reverse(s)
#Explorer.Series<
Polars[4]
s64 [4, 3, 2, 1]
>
And sort.
1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list() |> Series.sort()
#Explorer.Series<
Polars[100]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, ...]
>
Or argsort.
s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
ids = Series.argsort(s) |> Series.to_list()
[84, 8, 3, 96, 76, 91, 36, 87, 42, 94, 69, 27, 52, 80, 34, 13, 28, 38, 89, 61, 21, 56, 55, 82, 22,
31, 4, 45, 49, 12, 30, 41, 92, 70, 83, 77, 79, 1, 26, 50, 75, 97, 71, 64, 68, 47, 44, 95, 2, 51,
...]
Which you can pass to Explorer.Series.slice/2
if you want the sorted values.
Series.slice(s, ids)
#Explorer.Series<
Polars[100]
s64 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, ...]
>
You can calculate cumulative values.
s = 1..100 |> Enum.to_list() |> Series.from_list()
Series.cumulative_sum(s)
#Explorer.Series<
Polars[100]
s64 [1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171, 190, 210, 231, 253,
276, 300, 325, 351, 378, 406, 435, 465, 496, 528, 561, 595, 630, 666, 703, 741, 780, 820, 861,
903, 946, 990, 1035, 1081, 1128, 1176, 1225, 1275, ...]
>
Or rolling ones.
Series.window_sum(s, 4)
#Explorer.Series<
Polars[100]
s64 [1, 3, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, 66, 70, 74, 78, 82, 86, 90,
94, 98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138, 142, 146, 150, 154, 158, 162, 166, 170,
174, 178, 182, 186, 190, 194, ...]
>
You can count and list unique values.
s = Series.from_list(["a", "b", "b", "c", "c", "c"])
Series.distinct(s)
#Explorer.Series<
Polars[3]
string ["a", "b", "c"]
>
Series.n_distinct(s)
3
And you can even get a dataframe showing the frequencies for each distinct value.
Series.frequencies(s)
#Explorer.DataFrame<
Polars[3 x 2]
values string ["c", "b", "a"]
counts u32 [3, 2, 1]
>
Back to those boolean series returned by comparison functions like equal
and not_equal
.
These boolean series can be combined with other functions to perform conditional operations.
s1 = Series.from_list(["It", "was", "the", "best", "of", "times"])
s1 |> Series.equal("best") |> Series.select("worst", s1)
#Explorer.Series<
Polars[6]
string ["It", "was", "the", "worst", "of", "times"]
>
Working with DataFrames
A DataFrame
is really just a collection of Series
of the same size. Which is why you can create a DataFrame
from a Keyword
list.
DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
#Explorer.DataFrame<
Polars[3 x 2]
a s64 [1, 2, 3]
b string ["a", "b", "c"]
>
Similarly to Series
, the Inspect
implementation prints some info at the top and to the left. At the top we see the shape of the dataframe (rows and columns) and then for each column we see the name, dtype, and first five values. We can see a bit more from that built-in dataset we loaded in earlier.
df
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
You will also see grouping information there, but we'll get to that later. You can get the info yourself directly:
DataFrame.names(df)
["year", "country", "total", "solid_fuel", "liquid_fuel", "gas_fuel", "cement", "gas_flaring",
"per_capita", "bunker_fuels"]
DataFrame.dtypes(df)
%{
"bunker_fuels" => {:s, 64},
"cement" => {:s, 64},
"country" => :string,
"gas_flaring" => {:s, 64},
"gas_fuel" => {:s, 64},
"liquid_fuel" => {:s, 64},
"per_capita" => {:f, 64},
"solid_fuel" => {:s, 64},
"total" => {:s, 64},
"year" => {:s, 64}
}
DataFrame.shape(df)
{1094, 10}
{DataFrame.n_rows(df), DataFrame.n_columns(df)}
{1094, 10}
We can grab the head.
DataFrame.head(df)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2010, 2010, 2010, 2010, 2010]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA"]
total s64 [2308, 1254, 32500, 141, 7924]
solid_fuel s64 [627, 117, 332, 0, 0]
liquid_fuel s64 [1601, 953, 12381, 141, 3649]
gas_fuel s64 [74, 7, 14565, 0, 374]
cement s64 [5, 177, 2598, 0, 204]
gas_flaring s64 [0, 0, 2623, 0, 3697]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37]
bunker_fuels s64 [9, 7, 663, 0, 321]
>
Or the tail. Let's get a few more values from the tail.
DataFrame.tail(df, 10)
#Explorer.DataFrame<
Polars[10 x 10]
year s64 [2014, 2014, 2014, 2014, 2014, ...]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN", "VANUATU", "VENEZUELA", ...]
total s64 [1432855, 1840, 28692, 42, 50510, ...]
solid_fuel s64 [450047, 2, 1677, 0, 204, ...]
liquid_fuel s64 [576531, 1700, 2086, 42, 28445, ...]
gas_fuel s64 [390719, 25, 23929, 0, 12731, ...]
cement s64 [11314, 112, 1000, 0, 1088, ...]
gas_flaring s64 [4244, 0, 0, 0, 8042, ...]
per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65, ...]
bunker_fuels s64 [30722, 251, 0, 10, 1256, ...]
>
Verbs and macros
In Explorer
, like in dplyr
, we have five main verbs to work with dataframes:
- select
- filter
- mutate
- sort
- summarise
We are going to explore then in this notebook, but first we need to "require"
the Explorer.DataFrame
module in order to load the macros needed for these verbs.
I want to take the opportunity to create a shorter alias for the DataFrame
module,
called DF
:
require DataFrame, as: DF
Explorer.DataFrame
From now on we are using the shorter version, DF
, to refer to the required Explorer.DataFrame
module.
Select
Let's jump right into it. We can select columns pretty simply.
DF.select(df, ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
But Elixir gives us some superpowers. In R there's tidy-select
. I don't think we need that in Elixir. Anywhere in Explorer
where you need to pass a list of column names, you can also execute a filtering callback on the column names. It's just an anonymous function passed to df |> DataFrame.names() |> Enum.filter(callback_here)
.
DF.select(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[1094 x 3]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
>
Want all but some columns? discard/2
performs the opposite of select/2
.
DF.discard(df, &String.ends_with?(&1, "fuel"))
#Explorer.DataFrame<
Polars[1094 x 7]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Filter
The next verb we'll look at is filter
.
This is implemented using a macro, so it's possible to use expressions like you would if comparing variables in Elixir:
DF.filter(df, country == "BRAZIL")
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2010, 2011, 2012, 2013, 2014]
country string ["BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL", "BRAZIL"]
total s64 [114468, 119829, 128178, 137354, 144480]
solid_fuel s64 [15965, 17498, 17165, 18773, 20089]
liquid_fuel s64 [74689, 78849, 84409, 88898, 92454]
gas_fuel s64 [14372, 13778, 16328, 19399, 21297]
cement s64 [8040, 8717, 9428, 9517, 9691]
gas_flaring s64 [1402, 987, 848, 767, 949]
per_capita f64 [0.58, 0.6, 0.63, 0.67, 0.7]
bunker_fuels s64 [5101, 5516, 5168, 4895, 4895]
>
Using complex filters is also possible:
DF.filter(df, country == "ALGERIA" and year > 2012)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2013, 2014]
country string ["ALGERIA", "ALGERIA"]
total s64 [36669, 39651]
solid_fuel s64 [198, 149]
liquid_fuel s64 [14170, 14422]
gas_fuel s64 [17863, 20151]
cement s64 [2516, 2856]
gas_flaring s64 [1922, 2073]
per_capita f64 [0.96, 1.02]
bunker_fuels s64 [687, 581]
>
You can also write the same filter without the macro, by using the callback version function which is filter_with/2
:
DF.filter_with(df, fn ldf ->
ldf["country"]
|> Series.equal("ALGERIA")
|> Series.and(Series.greater(ldf["year"], 2012))
end)
#Explorer.DataFrame<
Polars[2 x 10]
year s64 [2013, 2014]
country string ["ALGERIA", "ALGERIA"]
total s64 [36669, 39651]
solid_fuel s64 [198, 149]
liquid_fuel s64 [14170, 14422]
gas_fuel s64 [17863, 20151]
cement s64 [2516, 2856]
gas_flaring s64 [1922, 2073]
per_capita f64 [0.96, 1.02]
bunker_fuels s64 [687, 581]
>
By the way, all the Explorer.DataFrame
macros have a correspondent function that accepts a callback.
In fact, our macros are implemented using those functions.
The filter_with/2
function is going to use a virtual representation of the dataframe
that we call a "lazy frame". With lazy frames you can´t access the
series contents, but every operation will be optimized and run only once.
Remember those helpful error messages?
DF.filter(df, cuontry == "BRAZIL")
Mutate
A common task in data analysis is to add columns or change existing ones. Mutate is a handy verb.
DF.mutate(df, new_column: solid_fuel + cement)
#Explorer.DataFrame<
Polars[1094 x 11]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
new_column s64 [632, 294, 2930, 0, 204, ...]
>
Did you catch that? You can pass in new columns as keyword arguments. It also works to transform existing columns.
DF.mutate(df,
gas_fuel: Series.cast(gas_fuel, :float),
gas_and_liquid_fuel: gas_fuel + liquid_fuel
)
#Explorer.DataFrame<
Polars[1094 x 11]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel f64 [74.0, 7.0, 14565.0, 0.0, 374.0, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
gas_and_liquid_fuel s64 [1675, 960, 26946, 141, 4023, ...]
>
DataFrame.mutate/2
is flexible though. You may not always want to use keyword arguments. Given that column names are String.t()
, it may make more sense to use a map.
DF.mutate(df, %{"gas_fuel" => gas_fuel - 10})
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [64, -3, 14555, -10, 364, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DF.transmute/2
, which is DF.mutate/2
that only retains the specified columns, is forthcoming.
Sort
Sorting the dataframe is pretty straightforward.
DF.sort_by(df, year)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
But it comes with some tricks up its sleeve.
DF.sort_by(df, asc: total, desc: year)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2013, 2012, 2011, 2011, ...]
country string ["NIUE", "NIUE", "NIUE", "NIUE", "TUVALU", ...]
total s64 [1, 2, 2, 2, 2, ...]
solid_fuel s64 [0, 0, 0, 0, 0, ...]
liquid_fuel s64 [1, 2, 2, 2, 2, ...]
gas_fuel s64 [0, 0, 0, 0, 0, ...]
cement s64 [0, 0, 0, 0, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [0.52, 1.04, 1.04, 1.04, 0.0, ...]
bunker_fuels s64 [0, 0, 0, 0, 0, ...]
>
As the examples show, sort_by/2
is a macro, and therefore you can use some Series functions to sort your dataframe:
DF.sort_by(df, asc: window_sum(total, 2))
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2011, 2012, 2010, 2011, ...]
country string ["FEDERATED STATES OF MICRONESIA", "FEDERATED STATES OF MICRONESIA",
"FEDERATED STATES OF MICRONESIA", "TUVALU", "TUVALU", ...]
total s64 [31, 33, 37, 2, 2, ...]
solid_fuel s64 [0, 0, 0, 0, 0, ...]
liquid_fuel s64 [31, 33, 37, 2, 2, ...]
gas_fuel s64 [0, 0, 0, 0, 0, ...]
cement s64 [0, 0, 0, 0, 0, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [0.3, 0.32, 0.36, 0.0, 0.0, ...]
bunker_fuels s64 [1, 1, 1, 0, 0, ...]
>
Sort operations happen left to right. And keyword list args permit specifying the direction.
Distinct
Okay, as expected here too. Very straightforward.
DF.distinct(df, ["year", "country"])
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
You can specify whether to keep the other columns as well, so the first row of each distinct value is kept:
DF.distinct(df, ["country"], keep_all: true)
#Explorer.DataFrame<
Polars[222 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Rename
Rename can take either a list of new names or a callback that is passed to Enum.map/2
against the names. You can also use a map or keyword args to rename specific columns.
DF.rename(df, year: "year_test")
#Explorer.DataFrame<
Polars[1094 x 10]
year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
DF.rename_with(df, &(&1 <> "_test"))
#Explorer.DataFrame<
Polars[1094 x 10]
year_test s64 [2010, 2010, 2010, 2010, 2010, ...]
country_test string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total_test s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel_test s64 [627, 117, 332, 0, 0, ...]
liquid_fuel_test s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel_test s64 [74, 7, 14565, 0, 374, ...]
cement_test s64 [5, 177, 2598, 0, 204, ...]
gas_flaring_test s64 [0, 0, 2623, 0, 3697, ...]
per_capita_test f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels_test s64 [9, 7, 663, 0, 321, ...]
>
Dummies
This is fun! We can get dummy variables for unique values.
DF.dummies(df, ["year"])
#Explorer.DataFrame<
Polars[1094 x 5]
year_2010 u8 [1, 1, 1, 1, 1, ...]
year_2011 u8 [0, 0, 0, 0, 0, ...]
year_2012 u8 [0, 0, 0, 0, 0, ...]
year_2013 u8 [0, 0, 0, 0, 0, ...]
year_2014 u8 [0, 0, 0, 0, 0, ...]
>
DF.dummies(df, ["country"])
#Explorer.DataFrame<
Polars[1094 x 222]
country_AFGHANISTAN u8 [1, 0, 0, 0, 0, ...]
country_ALBANIA u8 [0, 1, 0, 0, 0, ...]
country_ALGERIA u8 [0, 0, 1, 0, 0, ...]
country_ANDORRA u8 [0, 0, 0, 1, 0, ...]
country_ANGOLA u8 [0, 0, 0, 0, 1, ...]
country_ANGUILLA u8 [0, 0, 0, 0, 0, ...]
country_ANTIGUA & BARBUDA u8 [0, 0, 0, 0, 0, ...]
country_ARGENTINA u8 [0, 0, 0, 0, 0, ...]
country_ARMENIA u8 [0, 0, 0, 0, 0, ...]
country_ARUBA u8 [0, 0, 0, 0, 0, ...]
country_AUSTRALIA u8 [0, 0, 0, 0, 0, ...]
country_AUSTRIA u8 [0, 0, 0, 0, 0, ...]
country_AZERBAIJAN u8 [0, 0, 0, 0, 0, ...]
country_BAHAMAS u8 [0, 0, 0, 0, 0, ...]
country_BAHRAIN u8 [0, 0, 0, 0, 0, ...]
country_BANGLADESH u8 [0, 0, 0, 0, 0, ...]
country_BARBADOS u8 [0, 0, 0, 0, 0, ...]
country_BELARUS u8 [0, 0, 0, 0, 0, ...]
country_BELGIUM u8 [0, 0, 0, 0, 0, ...]
country_BELIZE u8 [0, 0, 0, 0, 0, ...]
country_BENIN u8 [0, 0, 0, 0, 0, ...]
country_BERMUDA u8 [0, 0, 0, 0, 0, ...]
country_BHUTAN u8 [0, 0, 0, 0, 0, ...]
country_BOSNIA & HERZEGOVINA u8 [0, 0, 0, 0, 0, ...]
country_BOTSWANA u8 [0, 0, 0, 0, 0, ...]
country_BRAZIL u8 [0, 0, 0, 0, 0, ...]
country_BRITISH VIRGIN ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_BRUNEI (DARUSSALAM) u8 [0, 0, 0, 0, 0, ...]
country_BULGARIA u8 [0, 0, 0, 0, 0, ...]
country_BURKINA FASO u8 [0, 0, 0, 0, 0, ...]
country_BURUNDI u8 [0, 0, 0, 0, 0, ...]
country_CAMBODIA u8 [0, 0, 0, 0, 0, ...]
country_CANADA u8 [0, 0, 0, 0, 0, ...]
country_CAPE VERDE u8 [0, 0, 0, 0, 0, ...]
country_CAYMAN ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_CENTRAL AFRICAN REPUBLIC u8 [0, 0, 0, 0, 0, ...]
country_CHAD u8 [0, 0, 0, 0, 0, ...]
country_CHILE u8 [0, 0, 0, 0, 0, ...]
country_CHINA (MAINLAND) u8 [0, 0, 0, 0, 0, ...]
country_COLOMBIA u8 [0, 0, 0, 0, 0, ...]
country_COMOROS u8 [0, 0, 0, 0, 0, ...]
country_CONGO u8 [0, 0, 0, 0, 0, ...]
country_COOK ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_COSTA RICA u8 [0, 0, 0, 0, 0, ...]
country_COTE D IVOIRE u8 [0, 0, 0, 0, 0, ...]
country_CROATIA u8 [0, 0, 0, 0, 0, ...]
country_CUBA u8 [0, 0, 0, 0, 0, ...]
country_CYPRUS u8 [0, 0, 0, 0, 0, ...]
country_CZECH REPUBLIC u8 [0, 0, 0, 0, 0, ...]
country_DEMOCRATIC PEOPLE S REPUBLIC OF KOREA u8 [0, 0, 0, 0, 0, ...]
country_DEMOCRATIC REPUBLIC OF THE CONGO (FORMERLY ZAIRE) u8 [0, 0, 0, 0, 0, ...]
country_DENMARK u8 [0, 0, 0, 0, 0, ...]
country_DJIBOUTI u8 [0, 0, 0, 0, 0, ...]
country_DOMINICA u8 [0, 0, 0, 0, 0, ...]
country_DOMINICAN REPUBLIC u8 [0, 0, 0, 0, 0, ...]
country_ECUADOR u8 [0, 0, 0, 0, 0, ...]
country_EGYPT u8 [0, 0, 0, 0, 0, ...]
country_EL SALVADOR u8 [0, 0, 0, 0, 0, ...]
country_EQUATORIAL GUINEA u8 [0, 0, 0, 0, 0, ...]
country_ERITREA u8 [0, 0, 0, 0, 0, ...]
country_ESTONIA u8 [0, 0, 0, 0, 0, ...]
country_ETHIOPIA u8 [0, 0, 0, 0, 0, ...]
country_FAEROE ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_FALKLAND ISLANDS (MALVINAS) u8 [0, 0, 0, 0, 0, ...]
country_FEDERATED STATES OF MICRONESIA u8 [0, 0, 0, 0, 0, ...]
country_FIJI u8 [0, 0, 0, 0, 0, ...]
country_FINLAND u8 [0, 0, 0, 0, 0, ...]
country_FRANCE (INCLUDING MONACO) u8 [0, 0, 0, 0, 0, ...]
country_FRENCH GUIANA u8 [0, 0, 0, 0, 0, ...]
country_FRENCH POLYNESIA u8 [0, 0, 0, 0, 0, ...]
country_GABON u8 [0, 0, 0, 0, 0, ...]
country_GAMBIA u8 [0, 0, 0, 0, 0, ...]
country_GEORGIA u8 [0, 0, 0, 0, 0, ...]
country_GERMANY u8 [0, 0, 0, 0, 0, ...]
country_GHANA u8 [0, 0, 0, 0, 0, ...]
country_GIBRALTAR u8 [0, 0, 0, 0, 0, ...]
country_GREECE u8 [0, 0, 0, 0, 0, ...]
country_GREENLAND u8 [0, 0, 0, 0, 0, ...]
country_GRENADA u8 [0, 0, 0, 0, 0, ...]
country_GUADELOUPE u8 [0, 0, 0, 0, 0, ...]
country_GUATEMALA u8 [0, 0, 0, 0, 0, ...]
country_GUINEA u8 [0, 0, 0, 0, 0, ...]
country_GUINEA BISSAU u8 [0, 0, 0, 0, 0, ...]
country_GUYANA u8 [0, 0, 0, 0, 0, ...]
country_HAITI u8 [0, 0, 0, 0, 0, ...]
country_HONDURAS u8 [0, 0, 0, 0, 0, ...]
country_HONG KONG SPECIAL ADMINSTRATIVE REGION OF CHINA u8 [0, 0, 0, 0, 0, ...]
country_HUNGARY u8 [0, 0, 0, 0, 0, ...]
country_ICELAND u8 [0, 0, 0, 0, 0, ...]
country_INDIA u8 [0, 0, 0, 0, 0, ...]
country_INDONESIA u8 [0, 0, 0, 0, 0, ...]
country_IRAQ u8 [0, 0, 0, 0, 0, ...]
country_IRELAND u8 [0, 0, 0, 0, 0, ...]
country_ISLAMIC REPUBLIC OF IRAN u8 [0, 0, 0, 0, 0, ...]
country_ISRAEL u8 [0, 0, 0, 0, 0, ...]
country_ITALY (INCLUDING SAN MARINO) u8 [0, 0, 0, 0, 0, ...]
country_JAMAICA u8 [0, 0, 0, 0, 0, ...]
country_JAPAN u8 [0, 0, 0, 0, 0, ...]
country_JORDAN u8 [0, 0, 0, 0, 0, ...]
country_KAZAKHSTAN u8 [0, 0, 0, 0, 0, ...]
country_KENYA u8 [0, 0, 0, 0, 0, ...]
country_KIRIBATI u8 [0, 0, 0, 0, 0, ...]
country_KUWAIT u8 [0, 0, 0, 0, 0, ...]
country_KYRGYZSTAN u8 [0, 0, 0, 0, 0, ...]
country_LAO PEOPLE S DEMOCRATIC REPUBLIC u8 [0, 0, 0, 0, 0, ...]
country_LATVIA u8 [0, 0, 0, 0, 0, ...]
country_LEBANON u8 [0, 0, 0, 0, 0, ...]
country_LESOTHO u8 [0, 0, 0, 0, 0, ...]
country_LIBERIA u8 [0, 0, 0, 0, 0, ...]
country_LIBYAN ARAB JAMAHIRIYAH u8 [0, 0, 0, 0, 0, ...]
country_LIECHTENSTEIN u8 [0, 0, 0, 0, 0, ...]
country_LITHUANIA u8 [0, 0, 0, 0, 0, ...]
country_LUXEMBOURG u8 [0, 0, 0, 0, 0, ...]
country_MACAU SPECIAL ADMINSTRATIVE REGION OF CHINA u8 [0, 0, 0, 0, 0, ...]
country_MACEDONIA u8 [0, 0, 0, 0, 0, ...]
country_MADAGASCAR u8 [0, 0, 0, 0, 0, ...]
country_MALAWI u8 [0, 0, 0, 0, 0, ...]
country_MALAYSIA u8 [0, 0, 0, 0, 0, ...]
country_MALDIVES u8 [0, 0, 0, 0, 0, ...]
country_MALI u8 [0, 0, 0, 0, 0, ...]
country_MALTA u8 [0, 0, 0, 0, 0, ...]
country_MARSHALL ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_MARTINIQUE u8 [0, 0, 0, 0, 0, ...]
country_MAURITANIA u8 [0, 0, 0, 0, 0, ...]
country_MAURITIUS u8 [0, 0, 0, 0, 0, ...]
country_MEXICO u8 [0, 0, 0, 0, 0, ...]
country_MONGOLIA u8 [0, 0, 0, 0, 0, ...]
country_MONTENEGRO u8 [0, 0, 0, 0, 0, ...]
country_MONTSERRAT u8 [0, 0, 0, 0, 0, ...]
country_MOROCCO u8 [0, 0, 0, 0, 0, ...]
country_MOZAMBIQUE u8 [0, 0, 0, 0, 0, ...]
country_MYANMAR (FORMERLY BURMA) u8 [0, 0, 0, 0, 0, ...]
country_NAMIBIA u8 [0, 0, 0, 0, 0, ...]
country_NAURU u8 [0, 0, 0, 0, 0, ...]
country_NEPAL u8 [0, 0, 0, 0, 0, ...]
country_NETHERLAND ANTILLES u8 [0, 0, 0, 0, 0, ...]
country_NETHERLANDS u8 [0, 0, 0, 0, 0, ...]
country_NEW CALEDONIA u8 [0, 0, 0, 0, 0, ...]
country_NEW ZEALAND u8 [0, 0, 0, 0, 0, ...]
country_NICARAGUA u8 [0, 0, 0, 0, 0, ...]
country_NIGER u8 [0, 0, 0, 0, 0, ...]
country_NIGERIA u8 [0, 0, 0, 0, 0, ...]
country_NIUE u8 [0, 0, 0, 0, 0, ...]
country_NORWAY u8 [0, 0, 0, 0, 0, ...]
country_OCCUPIED PALESTINIAN TERRITORY u8 [0, 0, 0, 0, 0, ...]
country_OMAN u8 [0, 0, 0, 0, 0, ...]
country_PAKISTAN u8 [0, 0, 0, 0, 0, ...]
country_PALAU u8 [0, 0, 0, 0, 0, ...]
country_PANAMA u8 [0, 0, 0, 0, 0, ...]
country_PAPUA NEW GUINEA u8 [0, 0, 0, 0, 0, ...]
country_PARAGUAY u8 [0, 0, 0, 0, 0, ...]
country_PERU u8 [0, 0, 0, 0, 0, ...]
country_PHILIPPINES u8 [0, 0, 0, 0, 0, ...]
country_PLURINATIONAL STATE OF BOLIVIA u8 [0, 0, 0, 0, 0, ...]
country_POLAND u8 [0, 0, 0, 0, 0, ...]
country_PORTUGAL u8 [0, 0, 0, 0, 0, ...]
country_QATAR u8 [0, 0, 0, 0, 0, ...]
country_REPUBLIC OF CAMEROON u8 [0, 0, 0, 0, 0, ...]
country_REPUBLIC OF KOREA u8 [0, 0, 0, 0, 0, ...]
country_REPUBLIC OF MOLDOVA u8 [0, 0, 0, 0, 0, ...]
country_REUNION u8 [0, 0, 0, 0, 0, ...]
country_ROMANIA u8 [0, 0, 0, 0, 0, ...]
country_RUSSIAN FEDERATION u8 [0, 0, 0, 0, 0, ...]
country_RWANDA u8 [0, 0, 0, 0, 0, ...]
country_SAINT HELENA u8 [0, 0, 0, 0, 0, ...]
country_SAINT LUCIA u8 [0, 0, 0, 0, 0, ...]
country_SAMOA u8 [0, 0, 0, 0, 0, ...]
country_SAO TOME & PRINCIPE u8 [0, 0, 0, 0, 0, ...]
country_SAUDI ARABIA u8 [0, 0, 0, 0, 0, ...]
country_SENEGAL u8 [0, 0, 0, 0, 0, ...]
country_SERBIA u8 [0, 0, 0, 0, 0, ...]
country_SEYCHELLES u8 [0, 0, 0, 0, 0, ...]
country_SIERRA LEONE u8 [0, 0, 0, 0, 0, ...]
country_SINGAPORE u8 [0, 0, 0, 0, 0, ...]
country_SLOVAKIA u8 [0, 0, 0, 0, 0, ...]
country_SLOVENIA u8 [0, 0, 0, 0, 0, ...]
country_SOLOMON ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_SOMALIA u8 [0, 0, 0, 0, 0, ...]
country_SOUTH AFRICA u8 [0, 0, 0, 0, 0, ...]
country_SPAIN u8 [0, 0, 0, 0, 0, ...]
country_SRI LANKA u8 [0, 0, 0, 0, 0, ...]
country_ST. KITTS-NEVIS u8 [0, 0, 0, 0, 0, ...]
country_ST. PIERRE & MIQUELON u8 [0, 0, 0, 0, 0, ...]
country_ST. VINCENT & THE GRENADINES u8 [0, 0, 0, 0, 0, ...]
country_SUDAN u8 [0, 0, 0, 0, 0, ...]
country_SURINAME u8 [0, 0, 0, 0, 0, ...]
country_SWAZILAND u8 [0, 0, 0, 0, 0, ...]
country_SWEDEN u8 [0, 0, 0, 0, 0, ...]
country_SWITZERLAND u8 [0, 0, 0, 0, 0, ...]
country_SYRIAN ARAB REPUBLIC u8 [0, 0, 0, 0, 0, ...]
country_TAIWAN u8 [0, 0, 0, 0, 0, ...]
country_TAJIKISTAN u8 [0, 0, 0, 0, 0, ...]
country_THAILAND u8 [0, 0, 0, 0, 0, ...]
country_TIMOR-LESTE (FORMERLY EAST TIMOR) u8 [0, 0, 0, 0, 0, ...]
country_TOGO u8 [0, 0, 0, 0, 0, ...]
country_TONGA u8 [0, 0, 0, 0, 0, ...]
country_TRINIDAD AND TOBAGO u8 [0, 0, 0, 0, 0, ...]
country_TUNISIA u8 [0, 0, 0, 0, 0, ...]
country_TURKEY u8 [0, 0, 0, 0, 0, ...]
country_TURKMENISTAN u8 [0, 0, 0, 0, 0, ...]
country_TURKS AND CAICOS ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_TUVALU u8 [0, 0, 0, 0, 0, ...]
country_UGANDA u8 [0, 0, 0, 0, 0, ...]
country_UKRAINE u8 [0, 0, 0, 0, 0, ...]
country_UNITED ARAB EMIRATES u8 [0, 0, 0, 0, 0, ...]
country_UNITED KINGDOM u8 [0, 0, 0, 0, 0, ...]
country_UNITED REPUBLIC OF TANZANIA u8 [0, 0, 0, 0, 0, ...]
country_UNITED STATES OF AMERICA u8 [0, 0, 0, 0, 0, ...]
country_URUGUAY u8 [0, 0, 0, 0, 0, ...]
country_UZBEKISTAN u8 [0, 0, 0, 0, 0, ...]
country_VANUATU u8 [0, 0, 0, 0, 0, ...]
country_VENEZUELA u8 [0, 0, 0, 0, 0, ...]
country_VIET NAM u8 [0, 0, 0, 0, 0, ...]
country_WALLIS AND FUTUNA ISLANDS u8 [0, 0, 0, 0, 0, ...]
country_YEMEN u8 [0, 0, 0, 0, 0, ...]
country_ZAMBIA u8 [0, 0, 0, 0, 0, ...]
country_ZIMBABWE u8 [0, 0, 0, 0, 0, ...]
country_BONAIRE, SAINT EUSTATIUS, AND SABA u8 [0, 0, 0, 0, 0, ...]
country_CURACAO u8 [0, 0, 0, 0, 0, ...]
country_REPUBLIC OF SOUTH SUDAN u8 [0, 0, 0, 0, 0, ...]
country_REPUBLIC OF SUDAN u8 [0, 0, 0, 0, 0, ...]
country_SAINT MARTIN (DUTCH PORTION) u8 [0, 0, 0, 0, 0, ...]
>
Sampling
Random samples can give us a percent or a specific number of samples, with or without replacement, and the function is seedable.
DF.sample(df, 10)
#Explorer.DataFrame<
Polars[10 x 10]
year s64 [2011, 2012, 2012, 2010, 2012, ...]
country string ["JAPAN", "FAEROE ISLANDS", "MALAWI", "LAO PEOPLE S DEMOCRATIC REPUBLIC",
"LIBYAN ARAB JAMAHIRIYAH", ...]
total s64 [324809, 161, 299, 447, 14367, ...]
solid_fuel s64 [111852, 0, 59, 181, 0, ...]
liquid_fuel s64 [143467, 161, 217, 104, 9113, ...]
gas_fuel s64 [62514, 0, 0, 0, 2923, ...]
cement s64 [6976, 0, 23, 163, 272, ...]
gas_flaring s64 [0, 0, 0, 0, 2059, ...]
per_capita f64 [2.55, 3.32, 0.02, 0.07, 2.29, ...]
bunker_fuels s64 [8406, 27, 6, 9, 484, ...]
>
DF.sample(df, 0.4)
#Explorer.DataFrame<
Polars[437 x 10]
year s64 [2011, 2013, 2012, 2012, 2011, ...]
country string ["GHANA", "GIBRALTAR", "KAZAKHSTAN", "LIBYAN ARAB JAMAHIRIYAH", "CONGO", ...]
total s64 [2681, 134, 66259, 14367, 616, ...]
solid_fuel s64 [0, 0, 39243, 0, 0, ...]
liquid_fuel s64 [1909, 134, 8834, 9113, 529, ...]
gas_fuel s64 [432, 0, 17223, 2923, 77, ...]
cement s64 [340, 0, 959, 272, 10, ...]
gas_flaring s64 [0, 0, 0, 2059, 0, ...]
per_capita f64 [0.11, 4.22, 3.94, 2.29, 0.15, ...]
bunker_fuels s64 [178, 3370, 168, 484, 42, ...]
>
Trying for those helpful error messages again.
DF.sample(df, 10000)
DF.sample(df, 10000, replace: true)
#Explorer.DataFrame<
Polars[10000 x 10]
year s64 [2012, 2011, 2010, 2011, 2012, ...]
country string ["PARAGUAY", "REUNION", "KUWAIT", "SOUTH AFRICA", "SEYCHELLES", ...]
total s64 [1441, 1165, 24441, 128329, 120, ...]
solid_fuel s64 [1, 495, 0, 109120, 0, ...]
liquid_fuel s64 [1330, 636, 16377, 15279, 120, ...]
gas_fuel s64 [0, 0, 7678, 2402, 0, ...]
cement s64 [109, 34, 272, 1528, 0, ...]
gas_flaring s64 [0, 0, 114, 0, 0, ...]
per_capita f64 [0.23, 1.39, 7.99, 2.46, 1.27, ...]
bunker_fuels s64 [23, 61, 1072, 3458, 162, ...]
>
Pull and slice
Slicing and dicing can be done with the Access
protocol or with explicit pull/slice/take functions.
df["year"]
#Explorer.Series<
Polars[1094]
s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, ...]
>
DF.pull(df, "year")
#Explorer.Series<
Polars[1094]
s64 [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, ...]
>
df[["year", "country"]]
#Explorer.DataFrame<
Polars[1094 x 2]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
>
DF.slice(df, [1, 20, 50])
#Explorer.DataFrame<
Polars[3 x 10]
year s64 [2010, 2010, 2010]
country string ["ALBANIA", "BENIN", "DEMOCRATIC REPUBLIC OF THE CONGO (FORMERLY ZAIRE)"]
total s64 [1254, 1388, 551]
solid_fuel s64 [117, 0, 0]
liquid_fuel s64 [953, 1211, 471]
gas_fuel s64 [7, 0, 12]
cement s64 [177, 177, 67]
gas_flaring s64 [0, 0, 0]
per_capita f64 [0.43, 0.15, 0.01]
bunker_fuels s64 [7, 127, 126]
>
Negative offsets work for slice!
DF.slice(df, -10, 5)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2014, 2014, 2014, 2014, 2014]
country string ["UNITED STATES OF AMERICA", "URUGUAY", "UZBEKISTAN", "VANUATU", "VENEZUELA"]
total s64 [1432855, 1840, 28692, 42, 50510]
solid_fuel s64 [450047, 2, 1677, 0, 204]
liquid_fuel s64 [576531, 1700, 2086, 42, 28445]
gas_fuel s64 [390719, 25, 23929, 0, 12731]
cement s64 [11314, 112, 1000, 0, 1088]
gas_flaring s64 [4244, 0, 0, 0, 8042]
per_capita f64 [4.43, 0.54, 0.97, 0.16, 1.65]
bunker_fuels s64 [30722, 251, 0, 10, 1256]
>
DF.slice(df, 10, 5)
#Explorer.DataFrame<
Polars[5 x 10]
year s64 [2010, 2010, 2010, 2010, 2010]
country string ["AUSTRALIA", "AUSTRIA", "AZERBAIJAN", "BAHAMAS", "BAHRAIN"]
total s64 [106589, 18408, 8366, 451, 7981]
solid_fuel s64 [56257, 3537, 6, 1, 0]
liquid_fuel s64 [31308, 9218, 2373, 450, 1123]
gas_fuel s64 [17763, 5073, 4904, 0, 6696]
cement s64 [1129, 579, 174, 0, 163]
gas_flaring s64 [132, 0, 909, 0, 0]
per_capita f64 [4.81, 2.19, 0.92, 1.25, 6.33]
bunker_fuels s64 [3307, 575, 398, 179, 545]
>
Slice also works with ranges:
DF.slice(df, 12..42)
#Explorer.DataFrame<
Polars[31 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AZERBAIJAN", "BAHAMAS", "BAHRAIN", "BANGLADESH", "BARBADOS", ...]
total s64 [8366, 451, 7981, 16345, 403, ...]
solid_fuel s64 [6, 1, 0, 839, 0, ...]
liquid_fuel s64 [2373, 450, 1123, 2881, 363, ...]
gas_fuel s64 [4904, 0, 6696, 10753, 8, ...]
cement s64 [174, 0, 163, 1873, 31, ...]
gas_flaring s64 [909, 0, 0, 0, 1, ...]
per_capita f64 [0.92, 1.25, 6.33, 0.11, 1.44, ...]
bunker_fuels s64 [398, 179, 545, 313, 108, ...]
>
Pivot
We can pivot_longer/3
and pivot_wider/4
. These are inspired by tidyr.
There are some shortcomings in pivot_wider/4
related to polars
. The select
option must select only columns of numeric type.
DF.pivot_longer(df, ["year", "country"], select: &String.ends_with?(&1, "fuel"))
DF.pivot_wider(df, "country", "total", id_columns: ["year"])
#Explorer.DataFrame<
Polars[5 x 223]
year s64 [2010, 2011, 2012, 2013, 2014]
AFGHANISTAN s64 [2308, 3338, 2933, 2731, 2675]
ALBANIA s64 [1254, 1429, 1339, 1381, 1559]
ALGERIA s64 [32500, 33048, 35448, 36669, 39651]
ANDORRA s64 [141, 134, 133, 130, 126]
ANGOLA s64 [7924, 8274, 9108, 8895, 9480]
ANGUILLA s64 [41, 39, 39, 37, 39]
ANTIGUA & BARBUDA s64 [143, 140, 143, 143, 145]
ARGENTINA s64 [51246, 52259, 52456, 51773, 55638]
ARMENIA s64 [1150, 1341, 1553, 1499, 1508]
ARUBA s64 [684, 682, 368, 235, 238]
AUSTRALIA s64 [106589, 106850, 105843, 101518, 98517]
AUSTRIA s64 [18408, 17731, 16982, 17040, 16011]
AZERBAIJAN s64 [8366, 9121, 9696, 9720, 10223]
BAHAMAS s64 [451, 509, 537, 764, 659]
BAHRAIN s64 [7981, 7813, 7274, 8539, 8546]
BANGLADESH s64 [16345, 17293, 18409, 19010, 19959]
BARBADOS s64 [403, 417, 401, 395, 347]
BELARUS s64 [17192, 17470, 17241, 17390, 17316]
BELGIUM s64 [30222, 27255, 25936, 26444, 25457]
BELIZE s64 [147, 164, 130, 140, 135]
BENIN s64 [1388, 1444, 1492, 1585, 1723]
BERMUDA s64 [166, 121, 130, 125, 157]
BHUTAN s64 [133, 200, 223, 251, 273]
BOSNIA & HERZEGOVINA s64 [5802, 6514, 6070, 5978, 6063]
BOTSWANA s64 [1278, 1139, 1154, 1426, 1918]
BRAZIL s64 [114468, 119829, 128178, 137354, 144480]
BRITISH VIRGIN ISLANDS s64 [47, 48, 48, 48, 49]
BRUNEI (DARUSSALAM) s64 [2237, 2644, 2636, 2128, 2484]
BULGARIA s64 [12030, 13457, 12192, 10799, 11567]
BURKINA FASO s64 [535, 603, 717, 834, 777]
BURUNDI s64 [58, 66, 77, 79, 120]
CAMBODIA s64 [1367, 1420, 1488, 1528, 1823]
CANADA s64 [145806, 146472, 141112, 141031, 146494]
CAPE VERDE s64 [152, 168, 138, 136, 134]
CAYMAN ISLANDS s64 [152, 160, 146, 146, 148]
CENTRAL AFRICAN REPUBLIC s64 [72, 76, 80, 81, 82]
CHAD s64 [141, 147, 167, 191, 199]
CHILE s64 [19703, 21610, 22082, 22696, 22515]
CHINA (MAINLAND) s64 [2393248, 2654360, 2734817, 2797384, 2806634]
COLOMBIA s64 [20773, 20870, 21803, 24441, 22932]
COMOROS s64 [44, 37, 39, 48, 42]
CONGO s64 [540, 616, 810, 842, 844]
COOK ISLANDS s64 [19, 19, 19, 19, 19]
COSTA RICA s64 [2064, 2111, 2118, 2072, 2116]
COTE D IVOIRE s64 [1900, 1977, 2535, 2914, 3012]
CROATIA s64 [5501, 5402, 4907, 4786, 4593]
CUBA s64 [10465, 9814, 9860, 9490, 9500]
CYPRUS s64 [2102, 2025, 1887, 1622, 1653]
CZECH REPUBLIC s64 [30428, 29154, 27551, 26909, 26309]
DEMOCRATIC PEOPLE S REPUBLIC OF KOREA s64 [18122, 13099, 13378, 9820, 11052]
DEMOCRATIC REPUBLIC OF THE CONGO (FORMERLY ZAIRE) s64 [551, 680, 655, 979, 1274]
DENMARK s64 [12719, 11084, 9934, 10508, 9135]
DJIBOUTI s64 [141, 129, 141, 166, 197]
DOMINICA s64 [38, 35, 37, 36, 37]
DOMINICAN REPUBLIC s64 [5733, 5789, 6033, 5847, 5874]
ECUADOR s64 [9943, 10529, 10401, 11180, 11977]
EGYPT s64 [55281, 59221, 59195, 58198, 55057]
EL SALVADOR s64 [1761, 1813, 1817, 1699, 1714]
EQUATORIAL GUINEA s64 [1276, 1671, 1395, 1408, 1458]
ERITREA s64 [140, 162, 180, 182, 190]
ESTONIA s64 [4938, 5074, 4806, 5425, 5323]
ETHIOPIA s64 [1796, 2107, 2335, 2900, 3163]
FAEROE ISLANDS s64 [172, 155, 161, 185, 163]
FALKLAND ISLANDS (MALVINAS) s64 [15, 15, 15, 15, 15]
FEDERATED STATES OF MICRONESIA s64 [31, 33, 37, 39, 41]
FIJI s64 [333, 297, 289, 314, 319]
FINLAND s64 [16930, 15494, 13399, 12877, 12899]
FRANCE (INCLUDING MONACO) s64 [96273, 90484, 90872, 91109, 82704]
FRENCH GUIANA s64 [174, 175, 161, 172, 200]
FRENCH POLYNESIA s64 [234, 227, 222, 224, 219]
GABON s64 [1312, 1356, 1392, 1440, 1416]
GAMBIA s64 [118, 122, 124, 118, 140]
GEORGIA s64 [1722, 2174, 2302, 2143, 2451]
GERMANY s64 [206943, 199754, 201762, 206521, 196314]
GHANA s64 [2715, 2681, 3239, 3987, 3945]
GIBRALTAR s64 [127, 123, 126, 134, 144]
GREECE s64 [22868, 21773, 21828, 18948, 18358]
GREENLAND s64 [181, 193, 155, 151, 138]
GRENADA s64 [71, 69, 74, 83, 66]
GUADELOUPE s64 [627, 663, 694, 697, 700]
GUATEMALA s64 [3181, 3228, 3265, 3718, 4998]
GUINEA s64 [710, 758, 704, 627, 668]
GUINEA BISSAU s64 [65, 67, 69, 70, 74]
GUYANA s64 [469, 486, 544, 528, 548]
HAITI s64 [580, 605, 631, 656, 780]
HONDURAS s64 [2175, 2442, 2450, 2472, 2583]
HONG KONG SPECIAL ADMINSTRATIVE REGION OF CHINA s64 [11093, 11943, 11842, 12273, 12605]
HUNGARY s64 [13696, 13047, 12158, 11492, 11477]
ICELAND s64 [535, 513, 491, 518, 541]
INDIA s64 [468964, 502257, 550451, 554882, 610411]
INDONESIA s64 [116924, 164621, 173733, 133686, 126582]
IRAQ s64 [30596, 36647, 41648, 45134, 45935]
IRELAND s64 [10923, 9717, 9706, 9505, 9290]
ISLAMIC REPUBLIC OF IRAN s64 [156267, 160637, 166828, 169015, 177115]
ISRAEL s64 [18784, 18852, 20597, 18290, 17617]
ITALY (INCLUDING SAN MARINO) s64 [110543, 108534, 100755, 94169, 87377]
JAMAICA s64 [1990, 2143, 2035, 2207, 2024]
JAPAN s64 [319505, 324809, 335470, 339928, 331074]
JORDAN s64 [5776, 5909, 6666, 6651, 7213]
KAZAKHSTAN s64 [67780, 70646, 66259, 71679, 67716]
KENYA s64 [3320, 3670, 3413, 3636, 3896]
KIRIBATI s64 [17, 17, 17, 17, 17]
KUWAIT s64 [24441, 24824, 27907, 26819, 26018]
KYRGYZSTAN s64 [1741, 2088, 2763, 2684, 2620]
LAO PEOPLE S DEMOCRATIC REPUBLIC s64 [447, 443, 463, 430, 533]
LATVIA s64 [2202, 1989, 1926, 1931, 1902]
LEBANON s64 [5467, 5575, 6172, 6158, 6564]
LESOTHO s64 [621, 636, 656, 664, 673]
LIBERIA s64 [216, 243, 280, 261, 255]
LIBYAN ARAB JAMAHIRIYAH s64 [16897, 10827, 14367, 15344, 15543]
LIECHTENSTEIN s64 [15, 13, 13, 14, 12]
LITHUANIA s64 [3673, 3760, 3772, 3447, 3501]
LUXEMBOURG s64 [2991, 2983, 2908, 2741, 2634]
MACAU SPECIAL ADMINSTRATIVE REGION OF CHINA s64 [384, 395, 358, 322, 350]
MACEDONIA s64 [2346, 2563, 2445, 2140, 2048]
MADAGASCAR s64 [534, 638, 738, 850, 839]
MALAWI s64 [312, 322, 299, 334, 348]
MALAYSIA s64 [59579, 60105, 59642, 64497, 66218]
MALDIVES s64 [255, 269, 303, 298, 364]
MALI s64 [263, 285, 271, 280, 385]
MALTA s64 [698, 693, 731, 638, 640]
MARSHALL ISLANDS s64 [28, 28, 28, 28, 28]
MARTINIQUE s64 [548, 605, 604, 603, 627]
MAURITANIA s64 [610, 653, 724, 728, 739]
MAURITIUS s64 [1068, 1069, 1082, 1110, 1153]
MEXICO s64 [126618, 132105, 135349, 133717, 130971]
MONGOLIA s64 [3769, 5863, 7152, 10568, 5683]
MONTENEGRO s64 [704, 701, 637, 613, 603]
MONTSERRAT s64 [18, 11, 12, 14, 13]
MOROCCO s64 [15260, 15731, 17107, 16112, 16325]
MOZAMBIQUE s64 [746, 879, 851, 1096, 2298]
MYANMAR (FORMERLY BURMA) s64 [3413, 3899, 3019, 3507, 5899]
NAMIBIA s64 [846, 772, 923, 717, 1024]
NAURU s64 [12, 11, 11, 12, 13]
NEPAL s64 [1379, 1509, 1597, 1810, 2190]
NETHERLAND ANTILLES s64 [1244, 1587, nil, nil, nil]
NETHERLANDS s64 [49919, 47496, 46444, 47247, 45624]
NEW CALEDONIA s64 [966, 995, 990, 1157, 1170]
NEW ZEALAND s64 [8667, 8591, 9313, 9124, 9453]
NICARAGUA s64 [1237, 1331, 1260, 1241, 1326]
NIGER s64 [320, 362, 509, 529, 580]
NIGERIA s64 [24957, 26096, 26862, 26762, 26256]
NIUE s64 [1, 2, 2, 2, 3]
NORWAY s64 [16391, 12325, 13605, 15861, 12988]
OCCUPIED PALESTINIAN TERRITORY s64 [555, 613, 600, 665, 774]
OMAN s64 [12931, 14734, 16133, 16738, 16681]
PAKISTAN s64 [44013, 44166, 44586, 44812, 45350]
PALAU s64 [69, 69, 69, 70, 71]
PANAMA s64 [2499, 2754, 2758, 2923, 2400]
PAPUA NEW GUINEA s64 [1299, 1453, 1385, 1687, 1723]
PARAGUAY s64 [1390, 1451, 1441, 1482, 1555]
PERU s64 [15706, 13535, 15018, 15586, 16838]
PHILIPPINES s64 [23144, 23315, 24872, 26760, 28812]
PLURINATIONAL STATE OF BOLIVIA s64 [4146, 4403, 5125, 5159, 5566]
POLAND s64 [86246, 86446, 81792, 82432, 77922]
PORTUGAL s64 [13127, 12987, 12548, 12388, 12286]
QATAR s64 [19773, 21935, 25668, 23186, 29412]
REPUBLIC OF CAMEROON s64 [1849, 1573, 1671, 1847, 1910]
REPUBLIC OF KOREA s64 [154545, 160731, 159249, 161576, 160119]
REPUBLIC OF MOLDOVA s64 [1345, 1374, 1343, 1363, 1345]
REUNION s64 [1137, 1165, 1159, 1118, 1138]
ROMANIA s64 [21656, 23147, 22286, 19347, 19090]
RUSSIAN FEDERATION s64 [455558, 480885, 499272, 485018, 465052]
RWANDA s64 [161, 181, 201, 219, 229]
SAINT HELENA s64 [3, 3, 3, 3, 3]
SAINT LUCIA s64 [110, 111, 111, 111, 111]
SAMOA s64 [51, 55, 54, 54, 54]
SAO TOME & PRINCIPE s64 [27, 28, 31, 31, 31]
SAUDI ARABIA s64 [141394, 136318, 154034, 147545, 163907]
SENEGAL s64 [2112, 2282, 2158, 2297, 2415]
SERBIA s64 [12532, 13422, 12016, 12240, 10272]
SEYCHELLES s64 [121, 93, 120, 110, 135]
SIERRA LEONE s64 [198, 245, 281, 325, 357]
SINGAPORE s64 [15174, 12332, 9919, 15183, 15373]
SLOVAKIA s64 [9883, 9415, 8935, 9024, 8366]
SLOVENIA s64 [4182, 4115, 4031, 3859, 3494]
SOLOMON ISLANDS s64 [54, 54, 54, 55, 55]
SOMALIA s64 [167, 165, 166, 166, 166]
SOUTH AFRICA s64 [129288, 128329, 127835, 127182, 133562]
SPAIN s64 [73878, 73779, 72206, 64640, 63806]
SRI LANKA s64 [3617, 4128, 4372, 4224, 5016]
ST. KITTS-NEVIS s64 [60, 63, 60, 61, 63]
ST. PIERRE & MIQUELON s64 [19, 19, 19, 20, 21]
ST. VINCENT & THE GRENADINES s64 [60, 54, 69, 57, 57]
SUDAN s64 [4347, 4270, nil, nil, nil]
SURINAME s64 [655, 537, 616, 523, 543]
SWAZILAND s64 [283, 286, 329, 297, 328]
SWEDEN s64 [14187, 14108, 12830, 12230, 11841]
SWITZERLAND s64 [10634, 10081, 10301, 10970, 9628]
SYRIAN ARAB REPUBLIC s64 [16800, 15519, 12198, 9937, 8373]
TAIWAN s64 [73629, 73406, 70393, 71022, 72013]
TAJIKISTAN s64 [694, 641, 800, 949, 1415]
THAILAND s64 [76882, 75898, 80883, 81835, 86232]
TIMOR-LESTE (FORMERLY EAST TIMOR) s64 [64, 67, 80, 120, 128]
TOGO s64 [720, 672, 678, 725, 715]
TONGA s64 [32, 28, 29, 31, 33]
TRINIDAD AND TOBAGO s64 [13072, 12799, 12386, 12692, 12619]
TUNISIA s64 [7543, 7096, 7364, 7545, 7862]
TURKEY s64 [81266, 87494, 89872, 88566, 94350]
TURKMENISTAN s64 [15623, 17035, 17691, 18199, 18659]
TURKS AND CAICOS ISLANDS s64 [52, 52, 54, 54, 56]
TUVALU s64 [2, 2, 3, 3, 3]
UGANDA s64 [1069, 1163, 1110, 1328, 1426]
UKRAINE s64 [83077, 78100, 80663, 74141, 61985]
UNITED ARAB EMIRATES s64 [43854, 45116, 48101, 46552, 57641]
UNITED KINGDOM s64 [134499, 122124, 127781, 124966, 114486]
UNITED REPUBLIC OF TANZANIA s64 [1938, 2207, 2603, 3048, 3153]
UNITED STATES OF AMERICA s64 [1471375, 1442509, 1396083, 1406916, 1432855]
URUGUAY s64 [1742, 2117, 2371, 2069, 1840]
UZBEKISTAN s64 [28407, 31002, 31583, 28185, 28692]
VANUATU s64 [33, 36, 31, 29, 42]
VENEZUELA s64 [51560, 48220, 54204, 50156, 50510]
VIET NAM s64 [38925, 41497, 38784, 40150, 45517]
WALLIS AND FUTUNA ISLANDS s64 [8, 7, 7, 6, 6]
YEMEN s64 [6390, 5363, 5091, 6953, 6190]
ZAMBIA s64 [734, 801, 1000, 1079, 1228]
ZIMBABWE s64 [2121, 2608, 2125, 3184, 3278]
BONAIRE, SAINT EUSTATIUS, AND SABA s64 [nil, nil, 85, 88, 88]
CURACAO s64 [nil, nil, 1636, 1422, 1604]
REPUBLIC OF SOUTH SUDAN s64 [nil, nil, 363, 395, 408]
REPUBLIC OF SUDAN s64 [nil, nil, 3993, 4220, 4190]
SAINT MARTIN (DUTCH PORTION) s64 [nil, nil, 190, 195, 200]
>
Let's make those names look nicer!
tidy_names = fn name ->
name
|> String.downcase()
|> String.replace(~r/\s/, " ")
|> String.replace(~r/[^A-Za-z\s]/, "")
|> String.replace(" ", "_")
end
df
|> DF.pivot_wider("country", "total", id_columns: ["year"])
|> DF.rename_with(tidy_names)
#Explorer.DataFrame<
Polars[5 x 223]
year s64 [2010, 2011, 2012, 2013, 2014]
afghanistan s64 [2308, 3338, 2933, 2731, 2675]
albania s64 [1254, 1429, 1339, 1381, 1559]
algeria s64 [32500, 33048, 35448, 36669, 39651]
andorra s64 [141, 134, 133, 130, 126]
angola s64 [7924, 8274, 9108, 8895, 9480]
anguilla s64 [41, 39, 39, 37, 39]
antigua__barbuda s64 [143, 140, 143, 143, 145]
argentina s64 [51246, 52259, 52456, 51773, 55638]
armenia s64 [1150, 1341, 1553, 1499, 1508]
aruba s64 [684, 682, 368, 235, 238]
australia s64 [106589, 106850, 105843, 101518, 98517]
austria s64 [18408, 17731, 16982, 17040, 16011]
azerbaijan s64 [8366, 9121, 9696, 9720, 10223]
bahamas s64 [451, 509, 537, 764, 659]
bahrain s64 [7981, 7813, 7274, 8539, 8546]
bangladesh s64 [16345, 17293, 18409, 19010, 19959]
barbados s64 [403, 417, 401, 395, 347]
belarus s64 [17192, 17470, 17241, 17390, 17316]
belgium s64 [30222, 27255, 25936, 26444, 25457]
belize s64 [147, 164, 130, 140, 135]
benin s64 [1388, 1444, 1492, 1585, 1723]
bermuda s64 [166, 121, 130, 125, 157]
bhutan s64 [133, 200, 223, 251, 273]
bosnia__herzegovina s64 [5802, 6514, 6070, 5978, 6063]
botswana s64 [1278, 1139, 1154, 1426, 1918]
brazil s64 [114468, 119829, 128178, 137354, 144480]
british_virgin_islands s64 [47, 48, 48, 48, 49]
brunei_darussalam s64 [2237, 2644, 2636, 2128, 2484]
bulgaria s64 [12030, 13457, 12192, 10799, 11567]
burkina_faso s64 [535, 603, 717, 834, 777]
burundi s64 [58, 66, 77, 79, 120]
cambodia s64 [1367, 1420, 1488, 1528, 1823]
canada s64 [145806, 146472, 141112, 141031, 146494]
cape_verde s64 [152, 168, 138, 136, 134]
cayman_islands s64 [152, 160, 146, 146, 148]
central_african_republic s64 [72, 76, 80, 81, 82]
chad s64 [141, 147, 167, 191, 199]
chile s64 [19703, 21610, 22082, 22696, 22515]
china_mainland s64 [2393248, 2654360, 2734817, 2797384, 2806634]
colombia s64 [20773, 20870, 21803, 24441, 22932]
comoros s64 [44, 37, 39, 48, 42]
congo s64 [540, 616, 810, 842, 844]
cook_islands s64 [19, 19, 19, 19, 19]
costa_rica s64 [2064, 2111, 2118, 2072, 2116]
cote_d_ivoire s64 [1900, 1977, 2535, 2914, 3012]
croatia s64 [5501, 5402, 4907, 4786, 4593]
cuba s64 [10465, 9814, 9860, 9490, 9500]
cyprus s64 [2102, 2025, 1887, 1622, 1653]
czech_republic s64 [30428, 29154, 27551, 26909, 26309]
democratic_people_s_republic_of_korea s64 [18122, 13099, 13378, 9820, 11052]
democratic_republic_of_the_congo_formerly_zaire s64 [551, 680, 655, 979, 1274]
denmark s64 [12719, 11084, 9934, 10508, 9135]
djibouti s64 [141, 129, 141, 166, 197]
dominica s64 [38, 35, 37, 36, 37]
dominican_republic s64 [5733, 5789, 6033, 5847, 5874]
ecuador s64 [9943, 10529, 10401, 11180, 11977]
egypt s64 [55281, 59221, 59195, 58198, 55057]
el_salvador s64 [1761, 1813, 1817, 1699, 1714]
equatorial_guinea s64 [1276, 1671, 1395, 1408, 1458]
eritrea s64 [140, 162, 180, 182, 190]
estonia s64 [4938, 5074, 4806, 5425, 5323]
ethiopia s64 [1796, 2107, 2335, 2900, 3163]
faeroe_islands s64 [172, 155, 161, 185, 163]
falkland_islands_malvinas s64 [15, 15, 15, 15, 15]
federated_states_of_micronesia s64 [31, 33, 37, 39, 41]
fiji s64 [333, 297, 289, 314, 319]
finland s64 [16930, 15494, 13399, 12877, 12899]
france_including_monaco s64 [96273, 90484, 90872, 91109, 82704]
french_guiana s64 [174, 175, 161, 172, 200]
french_polynesia s64 [234, 227, 222, 224, 219]
gabon s64 [1312, 1356, 1392, 1440, 1416]
gambia s64 [118, 122, 124, 118, 140]
georgia s64 [1722, 2174, 2302, 2143, 2451]
germany s64 [206943, 199754, 201762, 206521, 196314]
ghana s64 [2715, 2681, 3239, 3987, 3945]
gibraltar s64 [127, 123, 126, 134, 144]
greece s64 [22868, 21773, 21828, 18948, 18358]
greenland s64 [181, 193, 155, 151, 138]
grenada s64 [71, 69, 74, 83, 66]
guadeloupe s64 [627, 663, 694, 697, 700]
guatemala s64 [3181, 3228, 3265, 3718, 4998]
guinea s64 [710, 758, 704, 627, 668]
guinea_bissau s64 [65, 67, 69, 70, 74]
guyana s64 [469, 486, 544, 528, 548]
haiti s64 [580, 605, 631, 656, 780]
honduras s64 [2175, 2442, 2450, 2472, 2583]
hong_kong_special_adminstrative_region_of_china s64 [11093, 11943, 11842, 12273, 12605]
hungary s64 [13696, 13047, 12158, 11492, 11477]
iceland s64 [535, 513, 491, 518, 541]
india s64 [468964, 502257, 550451, 554882, 610411]
indonesia s64 [116924, 164621, 173733, 133686, 126582]
iraq s64 [30596, 36647, 41648, 45134, 45935]
ireland s64 [10923, 9717, 9706, 9505, 9290]
islamic_republic_of_iran s64 [156267, 160637, 166828, 169015, 177115]
israel s64 [18784, 18852, 20597, 18290, 17617]
italy_including_san_marino s64 [110543, 108534, 100755, 94169, 87377]
jamaica s64 [1990, 2143, 2035, 2207, 2024]
japan s64 [319505, 324809, 335470, 339928, 331074]
jordan s64 [5776, 5909, 6666, 6651, 7213]
kazakhstan s64 [67780, 70646, 66259, 71679, 67716]
kenya s64 [3320, 3670, 3413, 3636, 3896]
kiribati s64 [17, 17, 17, 17, 17]
kuwait s64 [24441, 24824, 27907, 26819, 26018]
kyrgyzstan s64 [1741, 2088, 2763, 2684, 2620]
lao_people_s_democratic_republic s64 [447, 443, 463, 430, 533]
latvia s64 [2202, 1989, 1926, 1931, 1902]
lebanon s64 [5467, 5575, 6172, 6158, 6564]
lesotho s64 [621, 636, 656, 664, 673]
liberia s64 [216, 243, 280, 261, 255]
libyan_arab_jamahiriyah s64 [16897, 10827, 14367, 15344, 15543]
liechtenstein s64 [15, 13, 13, 14, 12]
lithuania s64 [3673, 3760, 3772, 3447, 3501]
luxembourg s64 [2991, 2983, 2908, 2741, 2634]
macau_special_adminstrative_region_of_china s64 [384, 395, 358, 322, 350]
macedonia s64 [2346, 2563, 2445, 2140, 2048]
madagascar s64 [534, 638, 738, 850, 839]
malawi s64 [312, 322, 299, 334, 348]
malaysia s64 [59579, 60105, 59642, 64497, 66218]
maldives s64 [255, 269, 303, 298, 364]
mali s64 [263, 285, 271, 280, 385]
malta s64 [698, 693, 731, 638, 640]
marshall_islands s64 [28, 28, 28, 28, 28]
martinique s64 [548, 605, 604, 603, 627]
mauritania s64 [610, 653, 724, 728, 739]
mauritius s64 [1068, 1069, 1082, 1110, 1153]
mexico s64 [126618, 132105, 135349, 133717, 130971]
mongolia s64 [3769, 5863, 7152, 10568, 5683]
montenegro s64 [704, 701, 637, 613, 603]
montserrat s64 [18, 11, 12, 14, 13]
morocco s64 [15260, 15731, 17107, 16112, 16325]
mozambique s64 [746, 879, 851, 1096, 2298]
myanmar_formerly_burma s64 [3413, 3899, 3019, 3507, 5899]
namibia s64 [846, 772, 923, 717, 1024]
nauru s64 [12, 11, 11, 12, 13]
nepal s64 [1379, 1509, 1597, 1810, 2190]
netherland_antilles s64 [1244, 1587, nil, nil, nil]
netherlands s64 [49919, 47496, 46444, 47247, 45624]
new_caledonia s64 [966, 995, 990, 1157, 1170]
new_zealand s64 [8667, 8591, 9313, 9124, 9453]
nicaragua s64 [1237, 1331, 1260, 1241, 1326]
niger s64 [320, 362, 509, 529, 580]
nigeria s64 [24957, 26096, 26862, 26762, 26256]
niue s64 [1, 2, 2, 2, 3]
norway s64 [16391, 12325, 13605, 15861, 12988]
occupied_palestinian_territory s64 [555, 613, 600, 665, 774]
oman s64 [12931, 14734, 16133, 16738, 16681]
pakistan s64 [44013, 44166, 44586, 44812, 45350]
palau s64 [69, 69, 69, 70, 71]
panama s64 [2499, 2754, 2758, 2923, 2400]
papua_new_guinea s64 [1299, 1453, 1385, 1687, 1723]
paraguay s64 [1390, 1451, 1441, 1482, 1555]
peru s64 [15706, 13535, 15018, 15586, 16838]
philippines s64 [23144, 23315, 24872, 26760, 28812]
plurinational_state_of_bolivia s64 [4146, 4403, 5125, 5159, 5566]
poland s64 [86246, 86446, 81792, 82432, 77922]
portugal s64 [13127, 12987, 12548, 12388, 12286]
qatar s64 [19773, 21935, 25668, 23186, 29412]
republic_of_cameroon s64 [1849, 1573, 1671, 1847, 1910]
republic_of_korea s64 [154545, 160731, 159249, 161576, 160119]
republic_of_moldova s64 [1345, 1374, 1343, 1363, 1345]
reunion s64 [1137, 1165, 1159, 1118, 1138]
romania s64 [21656, 23147, 22286, 19347, 19090]
russian_federation s64 [455558, 480885, 499272, 485018, 465052]
rwanda s64 [161, 181, 201, 219, 229]
saint_helena s64 [3, 3, 3, 3, 3]
saint_lucia s64 [110, 111, 111, 111, 111]
samoa s64 [51, 55, 54, 54, 54]
sao_tome__principe s64 [27, 28, 31, 31, 31]
saudi_arabia s64 [141394, 136318, 154034, 147545, 163907]
senegal s64 [2112, 2282, 2158, 2297, 2415]
serbia s64 [12532, 13422, 12016, 12240, 10272]
seychelles s64 [121, 93, 120, 110, 135]
sierra_leone s64 [198, 245, 281, 325, 357]
singapore s64 [15174, 12332, 9919, 15183, 15373]
slovakia s64 [9883, 9415, 8935, 9024, 8366]
slovenia s64 [4182, 4115, 4031, 3859, 3494]
solomon_islands s64 [54, 54, 54, 55, 55]
somalia s64 [167, 165, 166, 166, 166]
south_africa s64 [129288, 128329, 127835, 127182, 133562]
spain s64 [73878, 73779, 72206, 64640, 63806]
sri_lanka s64 [3617, 4128, 4372, 4224, 5016]
st_kittsnevis s64 [60, 63, 60, 61, 63]
st_pierre__miquelon s64 [19, 19, 19, 20, 21]
st_vincent__the_grenadines s64 [60, 54, 69, 57, 57]
sudan s64 [4347, 4270, nil, nil, nil]
suriname s64 [655, 537, 616, 523, 543]
swaziland s64 [283, 286, 329, 297, 328]
sweden s64 [14187, 14108, 12830, 12230, 11841]
switzerland s64 [10634, 10081, 10301, 10970, 9628]
syrian_arab_republic s64 [16800, 15519, 12198, 9937, 8373]
taiwan s64 [73629, 73406, 70393, 71022, 72013]
tajikistan s64 [694, 641, 800, 949, 1415]
thailand s64 [76882, 75898, 80883, 81835, 86232]
timorleste_formerly_east_timor s64 [64, 67, 80, 120, 128]
togo s64 [720, 672, 678, 725, 715]
tonga s64 [32, 28, 29, 31, 33]
trinidad_and_tobago s64 [13072, 12799, 12386, 12692, 12619]
tunisia s64 [7543, 7096, 7364, 7545, 7862]
turkey s64 [81266, 87494, 89872, 88566, 94350]
turkmenistan s64 [15623, 17035, 17691, 18199, 18659]
turks_and_caicos_islands s64 [52, 52, 54, 54, 56]
tuvalu s64 [2, 2, 3, 3, 3]
uganda s64 [1069, 1163, 1110, 1328, 1426]
ukraine s64 [83077, 78100, 80663, 74141, 61985]
united_arab_emirates s64 [43854, 45116, 48101, 46552, 57641]
united_kingdom s64 [134499, 122124, 127781, 124966, 114486]
united_republic_of_tanzania s64 [1938, 2207, 2603, 3048, 3153]
united_states_of_america s64 [1471375, 1442509, 1396083, 1406916, 1432855]
uruguay s64 [1742, 2117, 2371, 2069, 1840]
uzbekistan s64 [28407, 31002, 31583, 28185, 28692]
vanuatu s64 [33, 36, 31, 29, 42]
venezuela s64 [51560, 48220, 54204, 50156, 50510]
viet_nam s64 [38925, 41497, 38784, 40150, 45517]
wallis_and_futuna_islands s64 [8, 7, 7, 6, 6]
yemen s64 [6390, 5363, 5091, 6953, 6190]
zambia s64 [734, 801, 1000, 1079, 1228]
zimbabwe s64 [2121, 2608, 2125, 3184, 3278]
bonaire_saint_eustatius_and_saba s64 [nil, nil, 85, 88, 88]
curacao s64 [nil, nil, 1636, 1422, 1604]
republic_of_south_sudan s64 [nil, nil, 363, 395, 408]
republic_of_sudan s64 [nil, nil, 3993, 4220, 4190]
saint_martin_dutch_portion s64 [nil, nil, 190, 195, 200]
>
Joins
Joining is fast and easy. You can specify the columns to join on and how to join. Polars even supports cartesian (cross) joins, so Explorer
does too.
df1 = DF.select(df, ["year", "country", "total"])
df2 = DF.select(df, ["year", "country", "cement"])
DF.join(df1, df2)
#Explorer.DataFrame<
Polars[1094 x 4]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
>
df3 = df |> DF.select(["year", "cement"]) |> DF.slice(0, 500)
DF.join(df1, df3, how: :left)
#Explorer.DataFrame<
Polars[109138 x 4]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", ...]
total s64 [2308, 2308, 2308, 2308, 2308, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
>
Grouping
Explorer
supports groupby operations. They're limited based on what's possible in Polars, but they do most of what you need to do.
grouped = DF.group_by(df, ["country"])
#Explorer.DataFrame<
Polars[1094 x 10]
Groups: ["country"]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
Notice that the Inspect
call now shows groups
as well as rows
and columns
. You can, of course, get them explicitly.
DF.groups(grouped)
["country"]
And you can ungroup explicitly.
DF.ungroup(grouped)
#Explorer.DataFrame<
Polars[1094 x 10]
year s64 [2010, 2010, 2010, 2010, 2010, ...]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total s64 [2308, 1254, 32500, 141, 7924, ...]
solid_fuel s64 [627, 117, 332, 0, 0, ...]
liquid_fuel s64 [1601, 953, 12381, 141, 3649, ...]
gas_fuel s64 [74, 7, 14565, 0, 374, ...]
cement s64 [5, 177, 2598, 0, 204, ...]
gas_flaring s64 [0, 0, 2623, 0, 3697, ...]
per_capita f64 [0.08, 0.43, 0.9, 1.68, 0.37, ...]
bunker_fuels s64 [9, 7, 663, 0, 321, ...]
>
But what we care about the most is aggregating! Let's see which country has the max per_capita
value.
grouped
|> DF.summarise(max_per_capita: max(per_capita))
|> DF.sort_by(desc: max_per_capita)
#Explorer.DataFrame<
Polars[222 x 2]
country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT", "NETHERLAND ANTILLES", ...]
max_per_capita f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>
Qatar it is.
You may noticed that we are using max/1
inside the summarise
macro. This is possible because we expose all functions from the Series
module. You can use the following aggregations inside summarise:
min/1
- Take the minimum value within the group. SeeExplorer.Series.min/1
.max/1
- Take the maximum value within the group. SeeExplorer.Series.max/1
.sum/1
- Take the sum of the series within the group. SeeExplorer.Series.sum/1
.mean/1
- Take the mean of the series within the group. SeeExplorer.Series.mean/1
.median/1
- Take the median of the series within the group. SeeExplorer.Series.median/1
.first/1
- Take the first value within the group. SeeExplorer.Series.first/1
.last/1
- Take the last value within the group. SeeExplorer.Series.last/1
.count/1
- Count the number of rows per group.n_unique/1
- Count the number of unique rows per group.
The API is similar to mutate
: you can use keyword args or a map and specify aggregations to use.
DF.summarise(grouped, min_per_capita: min(per_capita), min_total: min(total))
#Explorer.DataFrame<
Polars[222 x 3]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
min_per_capita f64 [0.08, 0.43, 0.9, 1.63, 0.37, ...]
min_total s64 [2308, 1254, 32500, 126, 7924, ...]
>
Speaking of mutate
, it's 'group-aware'. As are sort_by
, distinct
, and n_rows
.
DF.mutate(grouped, total_window_sum: window_sum(total, 3), rows_in_group: count(country))
#Explorer.DataFrame<
Polars[1094 x 12]
Groups: ["country"]
year s64 [2010, 2011, 2012, 2013, 2014, ...]
country string ["AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", "AFGHANISTAN", ...]
total s64 [2308, 3338, 2933, 2731, 2675, ...]
solid_fuel s64 [627, 1174, 1000, 1075, 1194, ...]
liquid_fuel s64 [1601, 2075, 1844, 1568, 1393, ...]
gas_fuel s64 [74, 84, 84, 81, 74, ...]
cement s64 [5, 5, 5, 7, 14, ...]
gas_flaring s64 [0, 0, 0, 0, 0, ...]
per_capita f64 [0.08, 0.12, 0.1, 0.09, 0.08, ...]
bunker_fuels s64 [9, 9, 9, 9, 9, ...]
total_window_sum s64 [2308, 5646, 8579, 9002, 8339, ...]
rows_in_group u32 [5, 5, 5, 5, 5, ...]
>
It's also possible to use aggregations inside other functions:
grouped
|> DF.summarise(greater_than_9: greater(max(per_capita), 9.0), per_capita_max: max(per_capita))
|> DataFrame.sort_by(desc: per_capita_max)
#Explorer.DataFrame<
Polars[222 x 3]
country string ["QATAR", "CURACAO", "TRINIDAD AND TOBAGO", "KUWAIT", "NETHERLAND ANTILLES", ...]
greater_than_9 boolean [true, true, true, false, false, ...]
per_capita_max f64 [13.54, 10.72, 9.84, 8.16, 7.45, ...]
>
That's it!
And not. This is certainly not exhaustive, but I hope it gives you a good idea of what can be done and what the 'flavour' of the API is like. I'd love contributions and issues raised where you find them!