Version: v3.3.0

Toxicity

The Toxicity metric judges whether the input text is toxic. This includes obscenities, threats and insults. A higher score indicates the text is more toxic, and thus a lower score is preferred.

FAQs

What is the definition of each label?

TOXIC: Rude, disrespectful, or unreasonable comment; likely to make people leave a discussion.
OBSCENE: Swear words, curse words, or other obscene or profane language.
INSULT: Insulting, inflammatory, or negative comment towards a person or group.
IDENTITY HATE: Negative or hateful comments targeting someone because of their identity.
THREAT: Describes an intention to inflict pain, injury, or violence against an individual or group.

[1]

Is the TOXIC score a combination of the other scores?

No, the score for each toxic style is computed separately.

How should I use the toxicity metric?

Each toxic style is fine-tuned to detect instances of its style of toxicity. The TOXIC style was fine-tuned on a wide array of toxic styles, whereas each of the other styles were fine-tuned to detect only their style of toxicity. This leads to the TOXIC style giving a broad overview of the toxicity of the text, with the other styles providing a more fine-grained analysis.

Citations

[1] Gehman, Samuel, et al. "Realtoxicityprompts: Evaluating neural toxic degeneration in language models." arXiv preprint arXiv:2009.11462 (2020).

FAQs​

What is the definition of each label?​

Is the TOXIC score a combination of the other scores?​

How should I use the toxicity metric?​

Citations​

FAQs

What is the definition of each label?

Is the TOXIC score a combination of the other scores?

How should I use the toxicity metric?

Citations