Mix.install([{:dux, "~> 0.2.0"}])

Meet the Penguins

Dux is a DuckDB-native dataframe library for Elixir. Let's explore it with a real dataset — 344 penguins from Palmer Station, Antarctica.

require Dux

penguins = Dux.Datasets.penguins()

Lazy by default

Nothing has hit DuckDB yet. penguins is a %Dux{} struct holding a source and zero operations. Everything is lazy until you call compute/1, to_rows/1, or to_columns/1.

Your First Pipeline

Pipelines chain verbs — each returns a new %Dux{}. The require Dux above unlocks expression macros where bare identifiers become column names.

penguins
|> Dux.filter(species == "Gentoo" and body_mass_g > 5000)
|> Dux.select([:species, :island, :body_mass_g, :sex])
|> Dux.sort_by(desc: :body_mass_g)
|> Dux.head(5)

Materialization

Pipelines are lazy — nothing executes until you materialise. compute/1 executes and returns a %Dux{} with the data. to_rows/1 and to_columns/1 extract raw Elixir data structures:

penguins
|> Dux.filter(species == "Adelie")
|> Dux.select([:island, :bill_length_mm])
|> Dux.head(3)
|> Dux.to_rows()
penguins
|> Dux.filter(species == "Chinstrap")
|> Dux.select([:species, :body_mass_g])
|> Dux.head(5)
|> Dux.to_columns()

Aggregation

Group rows and compute summaries. The expression syntax lets you write sum(body_mass_g) directly:

penguins
|> Dux.drop_nil([:body_mass_g, :sex])
|> Dux.group_by([:species, :sex])
|> Dux.summarise(
  count: count(body_mass_g),
  avg_mass: avg(body_mass_g),
  max_mass: max(body_mass_g)
)
|> Dux.sort_by([:species, :sex])
|> Dux.compute()

Raw SQL alternative

Every expression verb has a _with variant that accepts raw DuckDB SQL strings:

Dux.summarise_with(df, avg_mass: "AVG(body_mass_g)")

Use this for DuckDB functions the expression macro doesn't cover.

Computed Columns

mutate/2 adds or replaces columns:

penguins
|> Dux.drop_nil([:bill_length_mm, :bill_depth_mm])
|> Dux.mutate(bill_ratio: bill_length_mm / bill_depth_mm)
|> Dux.select([:species, :bill_length_mm, :bill_depth_mm, :bill_ratio])
|> Dux.sort_by(desc: :bill_ratio)
|> Dux.head(5)
|> Dux.compute()

Interpolation

Use ^ to interpolate Elixir values into expressions as parameter bindings (safe from SQL injection):

min_mass = 4500

penguins
|> Dux.filter(body_mass_g > ^min_mass)
|> Dux.group_by(:species)
|> Dux.summarise(n: count(species))
|> Dux.sort_by(desc: :n)
|> Dux.compute()

See the SQL

Every pipeline compiles to SQL CTEs. Use sql_preview/1 to inspect what DuckDB will execute:

penguins
|> Dux.filter(species == "Gentoo")
|> Dux.mutate(mass_kg: body_mass_g / 1000.0)
|> Dux.group_by(:island)
|> Dux.summarise(avg_kg: avg(mass_kg))
|> Dux.sql_preview(pretty: true)
|> IO.puts()

Reading & Writing Files

Dux reads CSV, Parquet, and NDJSON via DuckDB — including S3 and HTTP URLs:

# Write the filtered penguins to Parquet
path = Path.join(System.tmp_dir!(), "gentoo.parquet")

penguins
|> Dux.filter(species == "Gentoo")
|> Dux.to_parquet(path)

# Read it back
Dux.from_parquet(path)
|> Dux.head(3)
|> Dux.compute()
# Cleanup
File.rm(path)

Joins

Combine datasets by matching on shared columns:

# Create a small lookup table
habitat = Dux.from_list([
  %{island: "Biscoe", climate: "maritime"},
  %{island: "Dream", climate: "maritime"},
  %{island: "Torgersen", climate: "polar"}
])

penguins
|> Dux.join(habitat, on: :island)
|> Dux.group_by(:climate)
|> Dux.summarise(n: count(species))
|> Dux.compute()

What's Next