# `CMDCEval`
[🔗](https://github.com/tupleyun/cmdc_eval/blob/v0.1.0/lib/cmdc_eval.ex#L1)

CMDC Agent 评测框架（benchmark harness）—— 接公开基准 + 自定义 suite。

## 核心抽象

| 概念 | 模块 | 职责 |
|---|---|---|
| Suite | `CMDCEval.Suite` behaviour | 一组 case 集合（如 BFCL v3、tau2-bench、internal） |
| Case | `CMDCEval.Case` struct | 单个评测用例（id / input / expected） |
| Run | `CMDCEval.Run` struct | 单次评测结果（pass / latency / tokens / cost） |
| Report | `CMDCEval.Report` | JSONL 报告写入（与 LangSmith / Langfuse 同源 schema） |
| Runner | `CMDCEval.Runner` | 并发跑 case + 收集 Run + 输出 Report |

## 内置 Suite

- `CMDCEval.Suites.Internal` — cmdc 内部 scenario 验证（DAG / Steering /
  HumanApproval / Checkpoint resume 等机内特性，互补外部基准）
- `CMDCEval.Suites.BFCL` — Berkeley Function Calling Leaderboard v3
  fixtures（从 upstream 公开仓库 fetch，详见 `mix cmdc.eval.fetch_bfcl`）

## Quick Start

    # 1. 跑 internal suite，输出 JSONL 报告
    $ mix cmdc.eval --suite=internal --model="anthropic:claude-sonnet-4-5" --report=out.jsonl

    # 2. 跑 BFCL（先 fetch fixtures）
    $ mix cmdc.eval.fetch_bfcl
    $ mix cmdc.eval --suite=bfcl --model="openai:gpt-4o" --report=bfcl.jsonl

    # 3. 程序化调用
    {:ok, report} = CMDCEval.run(
      suite: CMDCEval.Suites.Internal,
      model: "anthropic:claude-sonnet-4-5",
      report_path: "out.jsonl"
    )

## 报告 JSONL 字段（稳定 schema）

    {
      "suite": "internal",
      "case_id": "steering_basic",
      "model": "anthropic:claude-sonnet-4-5",
      "pass": true,
      "latency_ms": 1234,
      "tokens_in": 567,
      "tokens_out": 89,
      "cost_usd": 0.0034,
      "events_digest": "sha256:abc123...",
      "error": null,
      "timestamp": "2026-05-18T12:34:56Z"
    }

与 LangSmith / Langfuse / Datadog 同源消费,便于跨 benchmark 比对。

## v0.1 范围

- ✅ Suite behaviour + 4 struct（Case / Run / Report / Suite）
- ✅ `Mix.Tasks.Cmdc.Eval` CLI
- ✅ Internal suite（5+ scenario：DAG / Steering / HumanApproval / Checkpoint / Compactor）
- ✅ BFCL fetch + 占位 suite（10 用例骨架，可被 upstream fixtures 填充）
- ✅ JSONL 报告 schema（与 12G Telemetry 字段对齐）
- 🔁 推后到 v0.2：tau2-bench airline / MemoryAgentBench 子集 / LangSmith 直接同步

# `run_opts`

```elixir
@type run_opts() :: [
  suite: module(),
  model: String.t(),
  report_path: String.t() | nil,
  concurrency: pos_integer(),
  timeout_ms: pos_integer(),
  provider_opts: keyword()
]
```

运行 evals 的入参 keyword。

# `run`

```elixir
@spec run(run_opts()) :: {:ok, CMDCEval.Report.t()} | {:error, term()}
```

跑一个 Suite，返回 `{:ok, %Report{}}` 或 `{:error, reason}`。

## 选项

- `:suite` — Suite 模块（实现 `CMDCEval.Suite` behaviour），必填
- `:model` — model 字符串（如 `"anthropic:claude-sonnet-4-5"`），必填
- `:report_path` — 输出 JSONL 报告路径；nil 则只返回 `%Report{}` 不写文件
- `:concurrency` — 并发跑 case 数（默认 4，Mock provider 可设更高）
- `:timeout_ms` — 单 case 超时（默认 60_000）
- `:provider_opts` — 透传给 `CMDC.Provider.stream/4` 的选项（如 `api_key`）

## 示例

    {:ok, report} = CMDCEval.run(
      suite: CMDCEval.Suites.Internal,
      model: "anthropic:claude-sonnet-4-5",
      report_path: "internal.jsonl",
      concurrency: 4
    )

    report.summary
    # => %{total: 5, pass: 5, fail: 0, total_latency_ms: 12345, ...}

# `version`

```elixir
@spec version() :: String.t()
```

返回 cmdc_eval 当前版本号。

---

*Consult [api-reference.md](api-reference.md) for complete listing*