ExArrow.Explorer (ex_arrow v0.4.0)

View Source

Bridge between ExArrow and Explorer DataFrames.

Converts between ExArrow.Stream / ExArrow.RecordBatch and Explorer.DataFrame via an in-memory Arrow IPC round-trip. No CSV or row-by-row conversion is performed — the path is always columnar binary.

Requires {:explorer, "~> 0.11"} in your mix.exs dependencies. When Explorer is absent every function returns {:error, "Explorer is not available..."}.

Typical usage

ExArrow → Explorer (e.g. after a Flight or ADBC query):

{:ok, stream} = ExArrow.Flight.Client.do_get(client, "sales_2024")
{:ok, df}     = ExArrow.Explorer.from_stream(stream)
Explorer.DataFrame.filter(df, score > 0.9)

Explorer → ExArrow (e.g. to write to Parquet or send via Flight):

df = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
{:ok, stream} = ExArrow.Explorer.to_stream(df)
:ok = ExArrow.Flight.Client.do_put(client, stream_schema, batches,
        descriptor: {:cmd, "enriched"})

C Data Interface (CDI) — future zero-copy path

The current implementation serialises through an IPC binary. ExArrow.CDI provides CDI export/import that completely bypasses serialisation. When Explorer exposes a CDI import API the bridge here will use it automatically, making from_record_batch/1 and from_stream/1 truly zero-copy. See ExArrow.CDI for the low-level interface.

Summary

Functions

from_record_batch(batch)

@spec from_record_batch(ExArrow.RecordBatch.t()) ::
  {:ok, Explorer.DataFrame.t()} | {:error, String.t()}

Convert a single ExArrow.RecordBatch to an Explorer.DataFrame.

Returns {:ok, dataframe} or {:error, message}.

Example

{:ok, stream} = ExArrow.IPC.Reader.from_file("/data/chunk.arrow")
batch = ExArrow.Stream.next(stream)
{:ok, df} = ExArrow.Explorer.from_record_batch(batch)
Explorer.DataFrame.names(df)
#=> ["id", "name", "score"]

from_stream(stream)

@spec from_stream(ExArrow.Stream.t()) ::
  {:ok, Explorer.DataFrame.t()} | {:error, String.t()}

Convert an ExArrow.Stream to an Explorer.DataFrame.

Collects all batches from stream, serialises them to Arrow IPC, then loads the binary with Explorer.DataFrame.load_ipc_stream!/1.

Returns {:ok, dataframe} or {:error, message}.

Example

{:ok, stream} = ExArrow.IPC.Reader.from_file("/data/events.arrow")
{:ok, df}     = ExArrow.Explorer.from_stream(stream)
Explorer.DataFrame.n_rows(df)
#=> 1_000_000

to_record_batches(df)

@spec to_record_batches(Explorer.DataFrame.t()) ::
  {:ok, [ExArrow.RecordBatch.t()]} | {:error, String.t()}

Convert an Explorer.DataFrame to a list of ExArrow.RecordBatch handles.

Returns {:ok, [batch]} or {:error, message}.

Example

df = Explorer.DataFrame.new(a: [10, 20], b: [1.0, 2.0])
{:ok, batches} = ExArrow.Explorer.to_record_batches(df)
total_rows = Enum.sum(Enum.map(batches, &ExArrow.RecordBatch.num_rows/1))
#=> 2

to_stream(df)

@spec to_stream(Explorer.DataFrame.t()) ::
  {:ok, ExArrow.Stream.t()} | {:error, String.t()}

Convert an Explorer.DataFrame to an ExArrow.Stream.

Serialises the dataframe to Arrow IPC via Explorer.DataFrame.dump_ipc_stream!/1, then opens an ExArrow.Stream from the resulting binary.

Returns {:ok, stream} or {:error, message}.

Example

df = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
{:ok, stream} = ExArrow.Explorer.to_stream(df)
{:ok, schema} = ExArrow.Stream.schema(stream)
ExArrow.Schema.field_names(schema)
#=> ["x", "y"]