Dspy.GEPA (DSPex v0.11.0)

Experimental: This class may change or be removed in a future release without warning (introduced in v3.0.0).

GEPA is an evolutionary optimizer, which uses reflection to evolve text components of complex systems. GEPA is proposed in the paper GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. The GEPA optimization engine is provided by the gepa package, available from https://github.com/gepa-ai/gepa.

GEPA captures full traces of the DSPy module's execution, identifies the parts of the trace corresponding to a specific predictor, and reflects on the behaviour of the predictor to propose a new instruction for the predictor. GEPA allows users to provide textual feedback to the optimizer, which is used to guide the evolution of the predictor. The textual feedback can be provided at the granularity of individual predictors, or at the level of the entire system's execution.

To provide feedback to the GEPA optimizer, implement a metric as follows:

def metric(
    gold: Example,
    pred: Prediction,
    trace: Optional[DSPyTrace] = None,
    pred_name: Optional[str] = None,
    pred_trace: Optional[DSPyTrace] = None,
) -> float | ScoreWithFeedback:
    """
    This function is called with the following arguments:
    - gold: The gold example.
    - pred: The predicted output.
    - trace: Optional. The trace of the program's execution.
    - pred_name: Optional. The name of the target predictor currently being optimized by GEPA, for which
        the feedback is being requested.
    - pred_trace: Optional. The trace of the target predictor's execution GEPA is seeking feedback for.

    Note the `pred_name` and `pred_trace` arguments. During optimization, GEPA will call the metric to obtain
    feedback for individual predictors being optimized. GEPA provides the name of the predictor in `pred_name`
    and the sub-trace (of the trace) corresponding to the predictor in `pred_trace`.
    If available at the predictor level, the metric should return {'score': float, 'feedback': str} corresponding
    to the predictor.
    If not available at the predictor level, the metric can also return a text feedback at the program level
    (using just the gold, pred and trace).
    If no feedback is returned, GEPA will use a simple text feedback consisting of just the score:
    f"This trajectory got a score of {score}."
    """
    ...

GEPA can also be used as a batch inference-time search strategy, by passing valset=trainset, track_stats=True, track_best_outputs=True, and using the detailed_results attribute of the optimized program (returned by compile) to get the Pareto frontier of the batch. optimized_program.detailed_results.best_outputs_valset will contain the best outputs for each task in the batch.

Parameters

metric - The metric function to use for feedback and evaluation.
auto - The auto budget to use for the run. Options: "light", "medium", "heavy".
max_full_evals - The maximum number of full evaluations to perform.
max_metric_calls - The maximum number of metric calls to perform.
reflection_minibatch_size - The number of examples to use for reflection in a single GEPA step. Default is 3.
candidate_selection_strategy - The strategy to use for candidate selection. Default is "pareto", which stochastically selects candidates from the Pareto frontier of all validation scores.
Options - "pareto", "current_best".
reflection_lm - The language model to use for reflection. Required parameter. GEPA benefits from a strong reflection model. Consider using dspy.LM(model='gpt-5', temperature=1.0, max_tokens=32000) for optimal performance.
skip_perfect_score - Whether to skip examples with perfect scores during reflection. Default is True.
instruction_proposer - Optional custom instruction proposer implementing GEPA's ProposalFn protocol. Default: None (recommended for most users) - Uses GEPA's proven instruction proposer from the GEPA library, which implements the ProposalFn. This default proposer is highly capable and was validated across diverse experiments reported in the GEPA paper and tutorials.
Note - When both instruction_proposer and reflection_lm are set, the instruction_proposer is called in the reflection_lm context. However, reflection_lm is optional when using a custom instruction_proposer. Custom instruction proposers can invoke their own LLMs if needed.
component_selector - Custom component selector implementing the ReflectionComponentSelector protocol, or a string specifying a built-in selector strategy. Controls which components (predictors) are selected for optimization at each iteration. Defaults to 'round_robin' strategy which cycles through components one at a time. Available string options: 'round_robin' (cycles through components sequentially), 'all' (selects all components for simultaneous optimization). Custom selectors can implement strategies using LLM-driven selection logic based on optimization state and trajectories. See gepa component selectors for available built-in selectors and the ReflectionComponentSelector protocol for implementing custom selectors.
add_format_failure_as_feedback - Whether to add format failures as feedback. Default is False.
use_merge - Whether to use merge-based optimization. Default is True.
max_merge_invocations - The maximum number of merge invocations to perform. Default is 5.
num_threads - The number of threads to use for evaluation with Evaluate. Optional.
failure_score - The score to assign to failed examples. Default is 0.0.
perfect_score - The maximum score achievable by the metric. Default is 1.0. Used by GEPA to determine if all examples in a minibatch are perfect.
log_dir - The directory to save the logs. GEPA saves elaborate logs, along with all candidate programs, in this directory. Running G

Summary

Types

t()

Functions

auto_budget(ref, num_preds, num_candidates, valset_size, args, opts \\ [])

Python method GEPA.auto_budget.

compile(ref, student, opts \\ [])

GEPA uses the trainset to perform reflective updates to the prompt, but uses the valset for tracking Pareto scores.

get_params(ref, opts \\ [])

Get the parameters of the teleprompter.

new(metric, opts \\ [])

Initialize self. See help(type(self)) for accurate signature.

Types

t()

@opaque t()

Functions

auto_budget(ref, num_preds, num_candidates, valset_size, args, opts \\ [])

@spec auto_budget(SnakeBridge.Ref.t(), term(), term(), integer(), [term()], keyword()) ::
  {:ok, integer()} | {:error, Snakepit.Error.t()}

Python method GEPA.auto_budget.

Parameters

num_preds (term())
num_candidates (term())
valset_size (integer())
minibatch_size (integer() default: 35)
full_eval_steps (integer() default: 5)

Returns

integer()

compile(ref, student, opts \\ [])

@spec compile(SnakeBridge.Ref.t(), term(), keyword()) ::
  {:ok, term()} | {:error, Snakepit.Error.t()}

GEPA uses the trainset to perform reflective updates to the prompt, but uses the valset for tracking Pareto scores.

If no valset is provided, GEPA will use the trainset for both.

Parameters:

student: The student module to optimize.
trainset: The training set to use for reflective updates.
valset: The validation set to use for tracking Pareto scores. If not provided, GEPA will use the trainset for both.

Parameters

student (term())
trainset (list(term()) keyword-only, required)
teacher (term() keyword-only default: None)
valset (term() keyword-only default: None)

Returns

term()

get_params(ref, opts \\ [])

@spec get_params(
  SnakeBridge.Ref.t(),
  keyword()
) :: {:ok, %{optional(String.t()) => term()}} | {:error, Snakepit.Error.t()}

Get the parameters of the teleprompter.

Returns

%{optional(String.t()) => term()}

new(metric, opts \\ [])

@spec new(
  term(),
  keyword()
) :: {:ok, SnakeBridge.Ref.t()} | {:error, Snakepit.Error.t()}

Initialize self. See help(type(self)) for accurate signature.

Parameters

metric (term())
auto (term() | nil keyword-only default: None)
max_full_evals (term() keyword-only default: None)
max_metric_calls (term() keyword-only default: None)
reflection_minibatch_size (integer() keyword-only default: 3)
candidate_selection_strategy (term() keyword-only default: 'pareto')
reflection_lm (term() keyword-only default: None)
skip_perfect_score (boolean() keyword-only default: True)
add_format_failure_as_feedback (boolean() keyword-only default: False)
instruction_proposer (term() keyword-only default: None)
component_selector (term() keyword-only default: 'round_robin')
use_merge (boolean() keyword-only default: True)
max_merge_invocations (term() keyword-only default: 5)
num_threads (term() keyword-only default: None)
failure_score (float() keyword-only default: 0.0)
perfect_score (float() keyword-only default: 1.0)
log_dir (term() keyword-only default: None)
track_stats (boolean() keyword-only default: False)
use_wandb (boolean() keyword-only default: False)
wandb_api_key (term() keyword-only default: None)
wandb_init_kwargs (term() keyword-only default: None)
track_best_outputs (boolean() keyword-only default: False)
warn_on_score_mismatch (boolean() keyword-only default: True)
use_mlflow (boolean() keyword-only default: False)
seed (term() keyword-only default: 0)
gepa_kwargs (term() keyword-only default: None)