Toxicity
The Toxicity metric judges whether the input text is toxic. This includes obscenities, threats and insults. A higher score indicates the text is more toxic, and thus a lower score is preferred.
FAQs
What is the definition of each label?
- TOXIC: Rude, disrespectful, or unreasonable comment; likely to make people leave a discussion.
- OBSCENE: Swear words, curse words, or other obscene or profane language.
- INSULT: Insulting, inflammatory, or negative comment towards a person or group.
- IDENTITY HATE: Negative or hateful comments targeting someone because of their identity.
- THREAT: Describes an intention to inflict pain, injury, or violence against an individual or group.
[1]
Is the TOXIC score a combination of the other scores?
No, the score for each toxic style is computed separately.
How should I use the toxicity metric?
Each toxic style is fine-tuned to detect instances of its style of toxicity. The TOXIC style was fine-tuned on a wide array of toxic styles, whereas each of the other styles were fine-tuned to detect only their style of toxicity. This leads to the TOXIC style giving a broad overview of the toxicity of the text, with the other styles providing a more fine-grained analysis.
Citations
[1] Gehman, Samuel, et al. "Realtoxicityprompts: Evaluating neural toxic degeneration in language models." arXiv preprint arXiv:2009.11462 (2020).