Benchmark Evaluation

Benchmark results for PTC-Lisp with guidance on model selection and improving reliability.

Results Summary (v0.4.1)

Model	Pass Rate	Duration	Cost	Notes
Claude Haiku 4.5	100%	4.4m	$0.024	Highest reliability
Gemini 2.5 Flash	92.6%	2.8m	$0.009	Fastest, good value
DeepSeek v3	92.6%	9.2m	$0.009	Cost-effective

Configuration: 19 tests, 5 runs per model, schema data mode (January 2026)

Interpreting Results

Take these numbers with a grain of salt. This is a single benchmark run with a specific test suite. Results vary between runs, and small percentage differences (100% vs 92.6%) represent only a few failed tests out of 95.

What We Can Say

All models handle PTC-Lisp syntax well. Basic to intermediate queries (filtering, aggregation, sorting, joins) pass consistently across all models.
Complex multi-step analysis is harder. Tests involving temporal trends, budget optimization, and chained aggregations caused most failures.
Failure patterns are similar. DeepSeek and Gemini failed on the same tests, suggesting test difficulty rather than model-specific issues.

What We Can't Say

That 100% means "best model" for your use case
That 92.6% means a model is unreliable
That these results generalize to all domains

Example: Generated Program

Here's a program generated by Claude Haiku for a budget optimization query. It demonstrates what PTC-Lisp can express:

; Query: Select products to restock with $50,000 budget, maximizing expected revenue

(def products data/products)

; Add calculated fields: value_ratio and expected_revenue
(def enriched
  (map (fn [p]
         (assoc p
           :value_ratio (/ (:stock p) (:price p))
           :expected_revenue (* (:price p) (:stock p))))
       products))

; Sort by value_ratio descending (greedy: best bang for buck first)
(def sorted-products (sort-by :value_ratio > enriched))

; Greedy selection: pick products until budget exhausted
(def budget 50000)
(def result
  (reduce
    (fn [acc product]
      (let [current-cost (:total_cost acc)
            new-cost (+ current-cost (:price product))]
        (if (<= new-cost budget)
          (-> acc
              (update :product_ids conj (:id product))
              (update :total_cost + (:price product))
              (update :expected_revenue + (:expected_revenue product)))
          acc)))
    {:product_ids [] :total_cost 0 :expected_revenue 0}
    sorted-products))

(return result)

This shows data enrichment, sorting, the accumulator pattern with reduce, and thread-first macros — all generated from a natural language query.

Test Configuration

Test Categories

Level	Tests	Turn Limit	Description
Basic	1-5	1	count, filter, sum, avg
Intermediate	6-10	1	compound filters, sort, find extremes
Advanced	11-15	1	cross-dataset joins, grouped aggregation
Multi-turn	16-19	1-4	tool calls, temporal analysis, optimization

Single-shot tests (turn limit 1) are unforgiving — no recovery from errors.

Hardest Tests

Three tests caused most failures:

Test	Challenge	Failure Mode
#15: Employee with most rejected claims	Group → count → find max	Confused `max-by` with `max-key`
#18: Month with highest growth rate	Temporal grouping, sequential comparison	Missing `partition` function
#19: Budget optimization	Greedy algorithm with constraints	Heap limits on naive approaches

Improving Reliability

1. Increase Turn Limits

For complex analytical queries, allow more iterations:

SubAgent.run(agent, context, max_turns: 8)  # default is 5

This helps when the model needs to explore data or recover from errors.

2. Prompt Customization

The base prompt includes common mistakes to avoid. Domain-specific examples can further improve reliability. See SubAgent Advanced.

3. Language Improvements (Ongoing)

Some failures stem from LLMs expecting Clojure functions that PTC-Lisp doesn't yet support (e.g., partition, float). We're actively adding commonly-expected functions to reduce friction.

Practical Recommendations

Priority	Recommendation
Reliability first	Claude Haiku — highest pass rate in testing
Speed first	Gemini Flash — 2-3x faster than alternatives
Cost first	DeepSeek or Gemini — similar cost, different speed

Key insight: All models achieve 90%+ with default settings. For production, consider retry logic for complex queries regardless of model choice.

Running Benchmarks

cd demo

# Run benchmark with reports
mix lisp --test --runs=5 --report

# Specific model
mix lisp --test --model=haiku --runs=3

# Verbose output to debug failures
mix lisp --test --model=gemini -v

Via GitHub Actions:

gh workflow run benchmark.yml -f runs=5 -f dsl=lisp

Reports are saved to demo/reports/.