ExArrow.Compute (ex_arrow v0.4.0)

View Source

Arrow compute kernels: filter rows, project (select) columns, and sort — all entirely in native memory.

Column buffers are never copied into BEAM terms. Every function takes one or more ExArrow.RecordBatch handles and returns a new handle. The result can be passed directly to ExArrow.IPC.Writer, ExArrow.Flight.Client.do_put/4, or further compute operations without any intermediate serialisation.

Quick example

Given a batch from an ADBC query or IPC file:

# Select only the columns you need
{:ok, slim}  = ExArrow.Compute.project(batch, ["user_id", "score"])

# Sort by score descending
{:ok, sorted} = ExArrow.Compute.sort(slim, "score", ascending: false)

Building a boolean predicate for filter/2

filter/2 expects the first column of a second record batch to be a boolean Arrow array. The most common source is a query result that already contains a boolean column:

# e.g. "SELECT id, score, is_active FROM users"
{:ok, stream}  = ExArrow.ADBC.Statement.execute(stmt)
batch          = ExArrow.Stream.next(stream)

# Project the boolean column into its own batch
{:ok, mask}     = ExArrow.Compute.project(batch, ["is_active"])
{:ok, filtered} = ExArrow.Compute.filter(batch, mask)

You can also write a Parquet/IPC file that contains a pre-computed boolean column and read it back as the predicate.

Summary

Functions

Filter rows from batch using the first (boolean) column of predicate_batch.

Project (select) a subset of columns from batch by name.

Sort batch by column_name.

Functions

filter(batch, predicate_batch)

@spec filter(ExArrow.RecordBatch.t(), ExArrow.RecordBatch.t()) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Filter rows from batch using the first (boolean) column of predicate_batch.

predicate_batch must have at least one column and its first column must be a boolean Arrow array with the same row count as batch. Rows where the predicate is true are kept; rows where it is false or null are dropped.

Returns {:ok, filtered_batch} or {:error, message}.

Example

# Keep only rows where "is_active" is true.
# batch has columns [id, score, is_active]; extract the bool column first.
{:ok, mask}     = ExArrow.Compute.project(batch, ["is_active"])
{:ok, filtered} = ExArrow.Compute.filter(batch, mask)
# filtered has the same columns as batch but only the rows where is_active = true

project(batch, column_names)

@spec project(ExArrow.RecordBatch.t(), [String.t()]) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Project (select) a subset of columns from batch by name.

Columns appear in the result in the order given by column_names. Requesting a name that does not exist returns {:error, "column 'x' not found"}.

Returns {:ok, projected_batch} or {:error, message}.

Examples

# Select two columns; result schema has only [user_id, score]
{:ok, slim} = ExArrow.Compute.project(batch, ["user_id", "score"])

# Reorder: result schema is [score, user_id]
{:ok, reordered} = ExArrow.Compute.project(batch, ["score", "user_id"])

# Unknown column
{:error, "column 'missing' not found"} =
  ExArrow.Compute.project(batch, ["missing"])

sort(batch, column_name, opts \\ [])

@spec sort(ExArrow.RecordBatch.t(), String.t(), keyword()) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Sort batch by column_name.

All columns in the batch are reordered together — the sort is applied to the full batch, not just the key column. Nulls are always placed first regardless of sort direction.

Options

  • :ascendingtrue (default) for A→Z / small→large order; false for descending.

Returns {:ok, sorted_batch} or {:error, message}.

Examples

# Sort by score, lowest first (default)
{:ok, sorted} = ExArrow.Compute.sort(batch, "score")

# Sort by score, highest first
{:ok, sorted} = ExArrow.Compute.sort(batch, "score", ascending: false)

# Sort by a string column alphabetically
{:ok, sorted} = ExArrow.Compute.sort(batch, "name")

# Unknown column
{:error, msg} = ExArrow.Compute.sort(batch, "nonexistent")