SparkEx.Reader (SparkEx v0.1.0)

Copy Markdown View Source

Data source reader APIs for creating DataFrames from external sources.

Examples

df = SparkEx.Reader.table(session, "catalog.db.my_table")
df = SparkEx.Reader.parquet(session, "/data/events.parquet")
df = SparkEx.Reader.csv(session, "/data/users.csv", header: true, infer_schema: true)
df = SparkEx.Reader.json(session, "/data/logs.json")

Summary

Functions

Creates a DataFrame by reading Avro file(s).

Creates a DataFrame by reading binary files.

Creates a DataFrame by reading CSV file(s).

Sets the source format for a reader builder.

Creates a DataFrame by reading from JDBC.

Creates a DataFrame by reading JSON file(s).

Loads data from the configured source using reader builder state.

Creates a stateful reader builder.

Sets a single option on a reader builder.

Merges options into a reader builder.

Creates a DataFrame by reading ORC file(s).

Creates a DataFrame by reading Parquet file(s).

Sets the schema for a reader builder.

Reads a table from the catalog using reader builder options.

Creates a DataFrame from a named table (catalog table).

Creates a DataFrame by reading text file(s).

Creates a DataFrame by reading XML file(s).

Types

t()

@type t() :: %SparkEx.Reader{
  format: String.t() | nil,
  options: %{required(String.t()) => String.t()},
  schema: String.t() | nil,
  session: GenServer.server()
}

Functions

avro(session, paths, opts \\ [])

Creates a DataFrame by reading Avro file(s).

Options

  • :schema — optional schema string
  • :options — map of Avro reader options

binary_file(session, paths, opts \\ [])

@spec binary_file(GenServer.server(), String.t() | [String.t()], keyword()) ::
  SparkEx.DataFrame.t()

Creates a DataFrame by reading binary files.

Options

  • :options — map of BinaryFile reader options

csv(session, paths, opts \\ [])

Creates a DataFrame by reading CSV file(s).

Options

  • :schema — optional schema string
  • :header — whether the CSV has a header row (maps to "header" option)
  • :infer_schema — whether to infer the schema (maps to "inferSchema" option)
  • :separator — field separator (maps to "sep" option)
  • :options — map of additional CSV reader options

Examples

df = SparkEx.Reader.csv(session, "/data/users.csv", header: true, infer_schema: true)

format(reader, source)

@spec format(t(), String.t()) :: t()

Sets the source format for a reader builder.

jdbc(session, opts)

@spec jdbc(
  GenServer.server(),
  keyword()
) :: SparkEx.DataFrame.t()

jdbc(session, url, table, opts \\ [])

Creates a DataFrame by reading from JDBC.

Options

  • :options — map of JDBC reader options (e.g. url, dbtable)
  • :predicates — list of SQL predicate strings for JDBC predicate pushdown

json(session, paths, opts \\ [])

Creates a DataFrame by reading JSON file(s).

Options

  • :schema — optional schema string
  • :options — map of JSON reader options

Examples

df = SparkEx.Reader.json(session, "/data/logs.json")

load(reader)

@spec load(t()) :: SparkEx.DataFrame.t()

Loads data from the configured source using reader builder state.

load(reader, paths_or_opts)

@spec load(t(), String.t() | [String.t()] | nil | keyword()) :: SparkEx.DataFrame.t()
@spec load(GenServer.server(), String.t()) :: SparkEx.DataFrame.t()

load(reader, paths_or_opts, opts)

@spec load(t(), String.t() | [String.t()] | nil | keyword(), keyword()) ::
  SparkEx.DataFrame.t()
@spec load(GenServer.server(), String.t(), String.t() | [String.t()] | keyword()) ::
  SparkEx.DataFrame.t()

load(session, format, paths_or_opts, opts)

new(session)

@spec new(GenServer.server()) :: t()

Creates a stateful reader builder.

Mirrors PySpark's spark.read builder style.

option(reader, key, value)

@spec option(t(), String.t(), term()) :: t()

Sets a single option on a reader builder.

options(reader, opts)

@spec options(t(), map() | keyword()) :: t()

Merges options into a reader builder.

orc(session, paths, opts \\ [])

Creates a DataFrame by reading ORC file(s).

Options

  • :schema — optional schema string
  • :options — map of ORC reader options

Examples

df = SparkEx.Reader.orc(session, "/data/events.orc")

parquet(session, paths, opts \\ [])

@spec parquet(GenServer.server(), String.t() | [String.t()], keyword()) ::
  SparkEx.DataFrame.t()

Creates a DataFrame by reading Parquet file(s).

Options

  • :schema — optional schema string (e.g. "id INT, name STRING")
  • :options — map of Parquet reader options (default: %{})

Examples

df = SparkEx.Reader.parquet(session, "/data/events.parquet")
df = SparkEx.Reader.parquet(session, ["/data/part1.parquet", "/data/part2.parquet"])

schema(reader, schema_ddl)

Sets the schema for a reader builder.

Accepts either a DDL string, a struct type from SparkEx.Types, or a Spark Connect DataType protobuf struct.

Examples

reader |> Reader.schema("id LONG, name STRING")
reader |> Reader.schema(SparkEx.Types.struct_type([
  SparkEx.Types.struct_field("id", :long),
  SparkEx.Types.struct_field("name", :string)
]))

table(reader, table_name)

@spec table(t(), String.t()) :: SparkEx.DataFrame.t()

Reads a table from the catalog using reader builder options.

table(session, table_name, opts \\ [])

Creates a DataFrame from a named table (catalog table).

Options

  • :options — map of string options (default: %{})

Examples

df = SparkEx.Reader.table(session, "my_database.my_table")

text(session, paths, opts \\ [])

Creates a DataFrame by reading text file(s).

Each line becomes a row with a single value column.

Options

  • :options — map of text reader options

Examples

df = SparkEx.Reader.text(session, "/data/lines.txt")

xml(session, paths, opts \\ [])

Creates a DataFrame by reading XML file(s).

Options

  • :schema — optional schema string
  • :options — map of XML reader options