Benchmark Evaluation
View SourceBenchmark results for PTC-Lisp with guidance on model selection and improving reliability.
Results Summary (v0.4.1)
| Model | Pass Rate | Duration | Cost | Notes |
|---|---|---|---|---|
| Claude Haiku 4.5 | 100% | 4.4m | $0.024 | Highest reliability |
| Gemini 2.5 Flash | 92.6% | 2.8m | $0.009 | Fastest, good value |
| DeepSeek v3 | 92.6% | 9.2m | $0.009 | Cost-effective |
Configuration: 19 tests, 5 runs per model, schema data mode (January 2026)
Interpreting Results
Take these numbers with a grain of salt. This is a single benchmark run with a specific test suite. Results vary between runs, and small percentage differences (100% vs 92.6%) represent only a few failed tests out of 95.
What We Can Say
- All models handle PTC-Lisp syntax well. Basic to intermediate queries (filtering, aggregation, sorting, joins) pass consistently across all models.
- Complex multi-step analysis is harder. Tests involving temporal trends, budget optimization, and chained aggregations caused most failures.
- Failure patterns are similar. DeepSeek and Gemini failed on the same tests, suggesting test difficulty rather than model-specific issues.
What We Can't Say
- That 100% means "best model" for your use case
- That 92.6% means a model is unreliable
- That these results generalize to all domains
Example: Generated Program
Here's a program generated by Claude Haiku for a budget optimization query. It demonstrates what PTC-Lisp can express:
; Query: Select products to restock with $50,000 budget, maximizing expected revenue
(def products ctx/products)
; Add calculated fields: value_ratio and expected_revenue
(def enriched
(map (fn [p]
(assoc p
:value_ratio (/ (:stock p) (:price p))
:expected_revenue (* (:price p) (:stock p))))
products))
; Sort by value_ratio descending (greedy: best bang for buck first)
(def sorted-products (sort-by :value_ratio > enriched))
; Greedy selection: pick products until budget exhausted
(def budget 50000)
(def result
(reduce
(fn [acc product]
(let [current-cost (:total_cost acc)
new-cost (+ current-cost (:price product))]
(if (<= new-cost budget)
(-> acc
(update :product_ids conj (:id product))
(update :total_cost + (:price product))
(update :expected_revenue + (:expected_revenue product)))
acc)))
{:product_ids [] :total_cost 0 :expected_revenue 0}
sorted-products))
(return result)This shows data enrichment, sorting, the accumulator pattern with reduce, and thread-first macros — all generated from a natural language query.
Test Configuration
Test Categories
| Level | Tests | Turn Limit | Description |
|---|---|---|---|
| Basic | 1-5 | 1 | count, filter, sum, avg |
| Intermediate | 6-10 | 1 | compound filters, sort, find extremes |
| Advanced | 11-15 | 1 | cross-dataset joins, grouped aggregation |
| Multi-turn | 16-19 | 1-4 | tool calls, temporal analysis, optimization |
Single-shot tests (turn limit 1) are unforgiving — no recovery from errors.
Hardest Tests
Three tests caused most failures:
| Test | Challenge | Failure Mode |
|---|---|---|
| #15: Employee with most rejected claims | Group → count → find max | Confused max-by with max-key |
| #18: Month with highest growth rate | Temporal grouping, sequential comparison | Missing partition function |
| #19: Budget optimization | Greedy algorithm with constraints | Heap limits on naive approaches |
Improving Reliability
1. Increase Turn Limits
For complex analytical queries, allow more iterations:
SubAgent.run(agent, context, max_turns: 8) # default is 5This helps when the model needs to explore data or recover from errors.
2. Prompt Customization
The base prompt includes common mistakes to avoid. Domain-specific examples can further improve reliability. See SubAgent Advanced.
3. Language Improvements (Ongoing)
Some failures stem from LLMs expecting Clojure functions that PTC-Lisp doesn't yet support (e.g., partition, float). We're actively adding commonly-expected functions to reduce friction.
Practical Recommendations
| Priority | Recommendation |
|---|---|
| Reliability first | Claude Haiku — highest pass rate in testing |
| Speed first | Gemini Flash — 2-3x faster than alternatives |
| Cost first | DeepSeek or Gemini — similar cost, different speed |
Key insight: All models achieve 90%+ with default settings. For production, consider retry logic for complex queries regardless of model choice.
Running Benchmarks
cd demo
# Run benchmark with reports
mix lisp --test --runs=5 --report
# Specific model
mix lisp --test --model=haiku --runs=3
# Verbose output to debug failures
mix lisp --test --model=gemini -v
Via GitHub Actions:
gh workflow run benchmark.yml -f runs=5 -f dsl=lisp
Reports are saved to demo/reports/.
Further Reading
- SubAgent Getting Started — Basic usage
- SubAgent Advanced — Turn limits, truncation, prompts
- PTC-Lisp Specification — Language reference