Dsxir. Evaluate
(dsxir v0.1.0)
Copy Markdown
Devset evaluation runner.
Fan-out via Task.Supervisor.async_stream_nolink/4 under
Dsxir.TaskSupervisor. Settings are snapshot once on the caller and replayed
per worker via Dsxir.Settings.run/2 so settings-scoped state (lm, adapter,
metadata, cache) is preserved across workers.
Per-example errors are caught at the worker boundary, classified via
Dsxir.Errors.class_of/1, and counted in
EvaluationResult.errors.by_class. The runner does not abort on individual
row failures; run!/2 raises after the run completes when any row errored.
Telemetry:
[:dsxir, :evaluate, :item]— per row. Measurements:%{duration, metric_value}(metric_value: nilon error). Metadata:%{example, prediction, error_class}(prediction: nilon error,error_class: nilon success).[:dsxir, :evaluate, :stop]— once. Measurements:%{duration, score, total, error_count, save_as}(save_as: nilwhen not set). Metadata:%{evaluator, devset_size, max_errors}.
When :save_as is set, the result rows are written to disk as JSON-Lines
(one row per line) before run/2 returns.
Summary
Functions
Evaluate program over the configured devset. Per-row failures are caught
and reported in the returned EvaluationResult; the run never aborts on a
single error. When :save_as is set, the rows are persisted as JSON Lines
before returning.
Bang variant of run/2. Returns the result when zero rows errored and
otherwise raises Dsxir.Errors.Framework.PredictorError with the per-class
error counts.
Types
@type t() :: %Dsxir.Evaluate{ devset: [Dsxir.Example.t()], failure_score: float(), max_errors: non_neg_integer(), metric: Dsxir.Metric.t(), num_threads: pos_integer(), save_as: nil | Path.t(), timeout: pos_integer() }
Functions
@spec run(t(), Dsxir.Program.t()) :: Dsxir.EvaluationResult.t()
Evaluate program over the configured devset. Per-row failures are caught
and reported in the returned EvaluationResult; the run never aborts on a
single error. When :save_as is set, the rows are persisted as JSON Lines
before returning.
@spec run!(t(), Dsxir.Program.t()) :: Dsxir.EvaluationResult.t()
Bang variant of run/2. Returns the result when zero rows errored and
otherwise raises Dsxir.Errors.Framework.PredictorError with the per-class
error counts.