Mix.install([{:dux, "~> 0.2.0"}])Meet the Penguins
Dux is a DuckDB-native dataframe library for Elixir. Let's explore it with a real dataset — 344 penguins from Palmer Station, Antarctica.
require Dux
penguins = Dux.Datasets.penguins()Lazy by default
Nothing has hit DuckDB yet. penguins is a %Dux{} struct holding a
source and zero operations. Everything is lazy until you call compute/1,
to_rows/1, or to_columns/1.
Your First Pipeline
Pipelines chain verbs — each returns a new %Dux{}. The require Dux above
unlocks expression macros where bare identifiers become column names.
penguins
|> Dux.filter(species == "Gentoo" and body_mass_g > 5000)
|> Dux.select([:species, :island, :body_mass_g, :sex])
|> Dux.sort_by(desc: :body_mass_g)
|> Dux.head(5)Materialization
Pipelines are lazy — nothing executes until you materialise. compute/1
executes and returns a %Dux{} with the data. to_rows/1 and to_columns/1
extract raw Elixir data structures:
penguins
|> Dux.filter(species == "Adelie")
|> Dux.select([:island, :bill_length_mm])
|> Dux.head(3)
|> Dux.to_rows()penguins
|> Dux.filter(species == "Chinstrap")
|> Dux.select([:species, :body_mass_g])
|> Dux.head(5)
|> Dux.to_columns()Aggregation
Group rows and compute summaries. The expression syntax lets you write
sum(body_mass_g) directly:
penguins
|> Dux.drop_nil([:body_mass_g, :sex])
|> Dux.group_by([:species, :sex])
|> Dux.summarise(
count: count(body_mass_g),
avg_mass: avg(body_mass_g),
max_mass: max(body_mass_g)
)
|> Dux.sort_by([:species, :sex])
|> Dux.compute()Raw SQL alternative
Every expression verb has a _with variant that accepts raw DuckDB SQL strings:
Dux.summarise_with(df, avg_mass: "AVG(body_mass_g)")Use this for DuckDB functions the expression macro doesn't cover.
Computed Columns
mutate/2 adds or replaces columns:
penguins
|> Dux.drop_nil([:bill_length_mm, :bill_depth_mm])
|> Dux.mutate(bill_ratio: bill_length_mm / bill_depth_mm)
|> Dux.select([:species, :bill_length_mm, :bill_depth_mm, :bill_ratio])
|> Dux.sort_by(desc: :bill_ratio)
|> Dux.head(5)
|> Dux.compute()Interpolation
Use ^ to interpolate Elixir values into expressions as parameter bindings
(safe from SQL injection):
min_mass = 4500
penguins
|> Dux.filter(body_mass_g > ^min_mass)
|> Dux.group_by(:species)
|> Dux.summarise(n: count(species))
|> Dux.sort_by(desc: :n)
|> Dux.compute()See the SQL
Every pipeline compiles to SQL CTEs. Use sql_preview/1 to inspect
what DuckDB will execute:
penguins
|> Dux.filter(species == "Gentoo")
|> Dux.mutate(mass_kg: body_mass_g / 1000.0)
|> Dux.group_by(:island)
|> Dux.summarise(avg_kg: avg(mass_kg))
|> Dux.sql_preview(pretty: true)
|> IO.puts()Reading & Writing Files
Dux reads CSV, Parquet, and NDJSON via DuckDB — including S3 and HTTP URLs:
# Write the filtered penguins to Parquet
path = Path.join(System.tmp_dir!(), "gentoo.parquet")
penguins
|> Dux.filter(species == "Gentoo")
|> Dux.to_parquet(path)
# Read it back
Dux.from_parquet(path)
|> Dux.head(3)
|> Dux.compute()# Cleanup
File.rm(path)Joins
Combine datasets by matching on shared columns:
# Create a small lookup table
habitat = Dux.from_list([
%{island: "Biscoe", climate: "maritime"},
%{island: "Dream", climate: "maritime"},
%{island: "Torgersen", climate: "polar"}
])
penguins
|> Dux.join(habitat, on: :island)
|> Dux.group_by(:climate)
|> Dux.summarise(n: count(species))
|> Dux.compute()What's Next
- Data IO — CSV, Parquet, NDJSON, S3, glob patterns
- Transformations — deep dive on filter, mutate, window functions
- Joins & Reshape — star schema joins, pivot_wider/longer
- Distributed Execution — scale across BEAM nodes with
distribute/2 - Graph Analytics — PageRank, shortest paths, connected components