Skip to main content

Toxicity

SDK Usage

Learn how to evaluate this metric programmatically in the Trustwise SDK Documentation.

The Toxicity metric judges whether the input text is toxic. This includes obscenities, threats and insults. A higher score indicates the text is more toxic, and thus a lower score is preferred.

FAQs

What is the definition of each label?

  • TOXIC: Rude, disrespectful, or unreasonable comment; likely to make people leave a discussion.
  • OBSCENE: Swear words, curse words, or other obscene or profane language.
  • INSULT: Insulting, inflammatory, or negative comment towards a person or group.
  • IDENTITY HATE: Negative or hateful comments targeting someone because of their identity.
  • THREAT: Describes an intention to inflict pain, injury, or violence against an individual or group.

[1]

Is the TOXIC score a combination of the other scores?

No, the score for each toxic style is computed separately.

How should I use the toxicity metric?

Each toxic style is fine-tuned to detect instances of its style of toxicity. The TOXIC style was fine-tuned on a wide array of toxic styles, whereas each of the other styles were fine-tuned to detect only their style of toxicity. This leads to the TOXIC style giving a broad overview of the toxicity of the text, with the other styles providing a more fine-grained analysis.

Citations

[1] Gehman, Samuel, et al. "Realtoxicityprompts: Evaluating neural toxic degeneration in language models." arXiv preprint arXiv:2009.11462 (2020).