Version: v4.0.0

Toxicity

SDK Usage

Learn how to evaluate this metric programmatically in the Trustwise SDK Documentation.

The Trustwise Toxicity metric measures how harmful, offensive, or hurtful text is to readers. It gives higher scores to writing that could upset people, make them feel unsafe, or spread hate. It looks for words that attack, insult, or threaten people or groups. A higher score indicates the text is more toxic, and thus a lower score is preferred.

FAQs

What is the definition of each label?

TOXIC: Rude, disrespectful, or unreasonable comment; likely to make people leave a discussion.
OBSCENE: Swear words, curse words, or other obscene or profane language.
INSULT: Insulting, inflammatory, or negative comment towards a person or group.
IDENTITY HATE: Negative or hateful comments targeting someone because of their identity.
THREAT: Describes an intention to inflict pain, injury, or violence against an individual or group.

[1]

Is the TOXIC score a combination of the other scores?

No, the score for each toxic style is computed separately.

How should I use the toxicity metric?

Each toxic style is fine-tuned to detect instances of its style of toxicity. The TOXIC style was fine-tuned on a wide array of toxic styles, whereas each of the other styles were fine-tuned to detect only their style of toxicity. This leads to the TOXIC style giving a broad overview of the toxicity of the text, with the other styles providing a more fine-grained analysis.

Citations

[1] Gehman, Samuel, et al. "Realtoxicityprompts: Evaluating neural toxic degeneration in language models." arXiv preprint arXiv:2009.11462 (2020).

FAQs​

What is the definition of each label?​

Is the TOXIC score a combination of the other scores?​

How should I use the toxicity metric?​

Citations​

FAQs

What is the definition of each label?

Is the TOXIC score a combination of the other scores?

How should I use the toxicity metric?

Citations