Benchmark Evaluation

View Source

Benchmark results for PTC-Lisp with guidance on model selection and improving reliability.

Results Summary (v0.4.1)

ModelPass RateDurationCostNotes
Claude Haiku 4.5100%4.4m$0.024Highest reliability
Gemini 2.5 Flash92.6%2.8m$0.009Fastest, good value
DeepSeek v392.6%9.2m$0.009Cost-effective

Configuration: 19 tests, 5 runs per model, schema data mode (January 2026)

Interpreting Results

Take these numbers with a grain of salt. This is a single benchmark run with a specific test suite. Results vary between runs, and small percentage differences (100% vs 92.6%) represent only a few failed tests out of 95.

What We Can Say

  • All models handle PTC-Lisp syntax well. Basic to intermediate queries (filtering, aggregation, sorting, joins) pass consistently across all models.
  • Complex multi-step analysis is harder. Tests involving temporal trends, budget optimization, and chained aggregations caused most failures.
  • Failure patterns are similar. DeepSeek and Gemini failed on the same tests, suggesting test difficulty rather than model-specific issues.

What We Can't Say

  • That 100% means "best model" for your use case
  • That 92.6% means a model is unreliable
  • That these results generalize to all domains

Example: Generated Program

Here's a program generated by Claude Haiku for a budget optimization query. It demonstrates what PTC-Lisp can express:

; Query: Select products to restock with $50,000 budget, maximizing expected revenue

(def products ctx/products)

; Add calculated fields: value_ratio and expected_revenue
(def enriched
  (map (fn [p]
         (assoc p
           :value_ratio (/ (:stock p) (:price p))
           :expected_revenue (* (:price p) (:stock p))))
       products))

; Sort by value_ratio descending (greedy: best bang for buck first)
(def sorted-products (sort-by :value_ratio > enriched))

; Greedy selection: pick products until budget exhausted
(def budget 50000)
(def result
  (reduce
    (fn [acc product]
      (let [current-cost (:total_cost acc)
            new-cost (+ current-cost (:price product))]
        (if (<= new-cost budget)
          (-> acc
              (update :product_ids conj (:id product))
              (update :total_cost + (:price product))
              (update :expected_revenue + (:expected_revenue product)))
          acc)))
    {:product_ids [] :total_cost 0 :expected_revenue 0}
    sorted-products))

(return result)

This shows data enrichment, sorting, the accumulator pattern with reduce, and thread-first macros — all generated from a natural language query.

Test Configuration

Test Categories

LevelTestsTurn LimitDescription
Basic1-51count, filter, sum, avg
Intermediate6-101compound filters, sort, find extremes
Advanced11-151cross-dataset joins, grouped aggregation
Multi-turn16-191-4tool calls, temporal analysis, optimization

Single-shot tests (turn limit 1) are unforgiving — no recovery from errors.

Hardest Tests

Three tests caused most failures:

TestChallengeFailure Mode
#15: Employee with most rejected claimsGroup → count → find maxConfused max-by with max-key
#18: Month with highest growth rateTemporal grouping, sequential comparisonMissing partition function
#19: Budget optimizationGreedy algorithm with constraintsHeap limits on naive approaches

Improving Reliability

1. Increase Turn Limits

For complex analytical queries, allow more iterations:

SubAgent.run(agent, context, max_turns: 8)  # default is 5

This helps when the model needs to explore data or recover from errors.

2. Prompt Customization

The base prompt includes common mistakes to avoid. Domain-specific examples can further improve reliability. See SubAgent Advanced.

3. Language Improvements (Ongoing)

Some failures stem from LLMs expecting Clojure functions that PTC-Lisp doesn't yet support (e.g., partition, float). We're actively adding commonly-expected functions to reduce friction.

Practical Recommendations

PriorityRecommendation
Reliability firstClaude Haiku — highest pass rate in testing
Speed firstGemini Flash — 2-3x faster than alternatives
Cost firstDeepSeek or Gemini — similar cost, different speed

Key insight: All models achieve 90%+ with default settings. For production, consider retry logic for complex queries regardless of model choice.

Running Benchmarks

cd demo

# Run benchmark with reports
mix lisp --test --runs=5 --report

# Specific model
mix lisp --test --model=haiku --runs=3

# Verbose output to debug failures
mix lisp --test --model=gemini -v

Via GitHub Actions:

gh workflow run benchmark.yml -f runs=5 -f dsl=lisp

Reports are saved to demo/reports/.

Further Reading