ExArrow.Parquet.Reader (ex_arrow v0.4.0)

View Source

Parquet file reader: open a .parquet file or an in-memory binary and receive an ExArrow.Stream that yields record batches.

The stream interface is identical to ExArrow.IPC.Reader and ADBC streams — use ExArrow.Stream.schema/1, ExArrow.Stream.next/1, and ExArrow.Stream.to_list/1 to consume it.

How Parquet is read (lazy row-group streaming)

Parquet has a footer that is scanned once when the stream is opened, making the schema immediately available via ExArrow.Stream.schema/1. Row groups are then decoded on demand: each call to ExArrow.Stream.next/1 reads and decodes the next row group without touching the rest of the file.

This means:

  • Peak memory scales with the largest single row group, not the full file.
  • You can stop consuming after N batches and the remaining row groups are never decoded.
  • For file-backed streams (from_file/1) the underlying file handle stays open until the stream resource is garbage-collected.
  • For binary-backed streams (from_binary/1) the bytes are held in native memory and released when the resource is collected.

Examples

# Consume lazily — only the requested row groups are decoded
{:ok, stream}  = ExArrow.Parquet.Reader.from_file("/data/events.parquet")
{:ok, schema}  = ExArrow.Stream.schema(stream)
IO.inspect ExArrow.Schema.field_names(schema)
first_batch = ExArrow.Stream.next(stream)   # decodes row-group 0
next_batch  = ExArrow.Stream.next(stream)   # decodes row-group 1
nil         = ExArrow.Stream.next(stream)   # nil when exhausted

# Collect all row groups at once
{:ok, stream} = ExArrow.Parquet.Reader.from_file("/data/events.parquet")
batches = ExArrow.Stream.to_list(stream)

# Read from an in-memory binary (e.g. fetched from object storage)
parquet_bytes = File.read!("/data/events.parquet")
{:ok, stream} = ExArrow.Parquet.Reader.from_binary(parquet_bytes)
batch = ExArrow.Stream.next(stream)

# Pipe into Explorer
{:ok, stream} = ExArrow.Parquet.Reader.from_file("/data/report.parquet")
{:ok, df}     = ExArrow.Explorer.from_stream(stream)

Summary

Functions

Open a Parquet file from an in-memory binary.

Open a Parquet file at path for lazy row-group streaming.

Functions

from_binary(binary)

@spec from_binary(binary()) :: {:ok, ExArrow.Stream.t()} | {:error, String.t()}

Open a Parquet file from an in-memory binary.

Useful when the Parquet data has already been downloaded (e.g. from S3, an HTTP endpoint, or another process).

Returns {:ok, stream} or {:error, message}.

Example

parquet_bytes = File.read!("/data/events.parquet")
{:ok, stream} = ExArrow.Parquet.Reader.from_binary(parquet_bytes)
{:ok, schema} = ExArrow.Stream.schema(stream)
batch         = ExArrow.Stream.next(stream)
rows          = ExArrow.RecordBatch.num_rows(batch)

from_file(path)

@spec from_file(Path.t()) :: {:ok, ExArrow.Stream.t()} | {:error, String.t()}

Open a Parquet file at path for lazy row-group streaming.

Scans the Parquet footer to make the schema available, then returns a stream whose row groups are decoded on demand by ExArrow.Stream.next/1. The file handle remains open until the stream resource is garbage-collected.

Returns {:ok, stream} where stream is an ExArrow.Stream with :parquet backend, or {:error, message} if the file does not exist or is not valid Parquet.

Example

{:ok, stream} = ExArrow.Parquet.Reader.from_file("/data/events.parquet")
{:ok, schema} = ExArrow.Stream.schema(stream)
field_names   = ExArrow.Schema.field_names(schema)
first_batch   = ExArrow.Stream.next(stream)  # decodes row-group 0
batches       = ExArrow.Stream.to_list(stream)  # collects remaining