Skip to main content
Version: v3.2.0

Toxicity

The Toxicity metric judges whether the input text is toxic. This includes obscenities, threats and insults. A higher score indicates the text is more toxic, and thus a lower score is preferred.

FAQs

What is the definition of each label?

  • TOXIC: Rude, disrespectful, or unreasonable comment; likely to make people leave a discussion.
  • OBSCENE: Swear words, curse words, or other obscene or profane language.
  • INSULT: Insulting, inflammatory, or negative comment towards a person or group.
  • IDENTITY HATE: Negative or hateful comments targeting someone because of their identity.
  • THREAT: Describes an intention to inflict pain, injury, or violence against an individual or group.

[1]

Is the TOXIC score a combination of the other scores?

No, the score for each toxic style is computed separately.

How should I use the toxicity metric?

Each toxic style is fine-tuned to detect instances of its style of toxicity. The TOXIC style was fine-tuned on a wide array of toxic styles, whereas each of the other styles were fine-tuned to detect only their style of toxicity. This leads to the TOXIC style giving a broad overview of the toxicity of the text, with the other styles providing a more fine-grained analysis.

Citations

[1] Gehman, Samuel, et al. "Realtoxicityprompts: Evaluating neural toxic degeneration in language models." arXiv preprint arXiv:2009.11462 (2020).