Tribunal.Judges.Toxicity (Tribunal v1.3.6)

Copy Markdown View Source

Detects hostile, abusive, or toxic content in LLM outputs.

Evaluates across categories: identity attacks, insults, threats, harassment, profanity, and violent content.

This is a negative metric: "yes" (toxicity detected) = fail.