Dsxir.Evaluate (dsxir v0.1.0)

Copy Markdown

Devset evaluation runner.

Fan-out via Task.Supervisor.async_stream_nolink/4 under Dsxir.TaskSupervisor. Settings are snapshot once on the caller and replayed per worker via Dsxir.Settings.run/2 so settings-scoped state (lm, adapter, metadata, cache) is preserved across workers.

Per-example errors are caught at the worker boundary, classified via Dsxir.Errors.class_of/1, and counted in EvaluationResult.errors.by_class. The runner does not abort on individual row failures; run!/2 raises after the run completes when any row errored.

Telemetry:

  • [:dsxir, :evaluate, :item] — per row. Measurements: %{duration, metric_value} (metric_value: nil on error). Metadata: %{example, prediction, error_class} (prediction: nil on error, error_class: nil on success).
  • [:dsxir, :evaluate, :stop] — once. Measurements: %{duration, score, total, error_count, save_as} (save_as: nil when not set). Metadata: %{evaluator, devset_size, max_errors}.

When :save_as is set, the result rows are written to disk as JSON-Lines (one row per line) before run/2 returns.

Summary

Functions

Evaluate program over the configured devset. Per-row failures are caught and reported in the returned EvaluationResult; the run never aborts on a single error. When :save_as is set, the rows are persisted as JSON Lines before returning.

Bang variant of run/2. Returns the result when zero rows errored and otherwise raises Dsxir.Errors.Framework.PredictorError with the per-class error counts.

Types

t()

@type t() :: %Dsxir.Evaluate{
  devset: [Dsxir.Example.t()],
  failure_score: float(),
  max_errors: non_neg_integer(),
  metric: Dsxir.Metric.t(),
  num_threads: pos_integer(),
  save_as: nil | Path.t(),
  timeout: pos_integer()
}

Functions

run(ev, program)

Evaluate program over the configured devset. Per-row failures are caught and reported in the returned EvaluationResult; the run never aborts on a single error. When :save_as is set, the rows are persisted as JSON Lines before returning.

run!(ev, program)

Bang variant of run/2. Returns the result when zero rows errored and otherwise raises Dsxir.Errors.Framework.PredictorError with the per-class error counts.