Detects hostile, abusive, or toxic content in LLM outputs.
Evaluates across categories: identity attacks, insults, threats, harassment, profanity, and violent content.
This is a negative metric: "yes" (toxicity detected) = fail.
Detects hostile, abusive, or toxic content in LLM outputs.
Evaluates across categories: identity attacks, insults, threats, harassment, profanity, and violent content.
This is a negative metric: "yes" (toxicity detected) = fail.