V4 Metrics

Note

Version 4 is now the default for Trustwise metrics. It provides enhanced accuracy, better type safety, and improved response structures compared to v3.

All metrics are available in both synchronous and asynchronous forms. For async, use TrustwiseSDKAsync and await the evaluate() methods.

The SDK provides access to all v4 metrics directly through the metrics namespace. Each metric provides an evaluate() function with enhanced type safety and improved response structures. Example usage:

For more details on the metrics, please refer to the Trustwise Metrics Documentation.

from trustwise.sdk import TrustwiseSDK
from trustwise.sdk.config import TrustwiseConfig
from trustwise.sdk.metrics.v4.types import ContextChunk

config = TrustwiseConfig(api_key="your-api-key")
trustwise = TrustwiseSDK(config)

# v4 context format
context = [
    ContextChunk(chunk_text="Paris is the capital of France.", chunk_id="doc:idx:1")
]

# v4 metric calls (now default)
result = trustwise.metrics.faithfulness.evaluate(
    query="What is the capital of France?",
    response="The capital of France is Paris.",
    context=context
)

Async example:

import asyncio
from trustwise.sdk import TrustwiseSDKAsync
from trustwise.sdk.config import TrustwiseConfig
from trustwise.sdk.metrics.v4.types import ContextChunk

async def main():
    config = TrustwiseConfig(api_key="your-api-key")
    trustwise = TrustwiseSDKAsync(config)

    context = [
        ContextChunk(chunk_text="Paris is the capital of France.", chunk_id="doc:idx:1")
    ]

    result = await trustwise.metrics.faithfulness.evaluate(
        query="What is the capital of France?",
        response="The capital of France is Paris.",
        context=context
    )

asyncio.run(main())

Refer to the API Reference for details on each metric’s parameters.

Note

Custom types such as ContextChunk are defined in trustwise.sdk.metrics.v4.types.

Adherence

metrics.adherence.evaluate(policy: str, response: str) → AdherenceResponse

Evaluate how well the response follows a given policy or instruction.

Request:

AdherenceRequest

Returns:

AdherenceResponse

Example response:

{
    "score": 85.0
}

score: A float, 0-100, measuring how well the response follows the policy (higher is better adherence)

Answer Relevancy

metrics.answer_relevancy.evaluate(query: str, response: str) → AnswerRelevancyResponse

Evaluate the relevancy of a response to the query.

Request:

AnswerRelevancyRequest

Returns:

AnswerRelevancyResponse

Example response:

{
    "score": 92.0,
    "generated_question": "What is the capital city of France?"
}

score: A float, 0-100, measuring how relevant the response is to the query. Higher score indicates better relevancy.
generated_question: The generated question for which the response would be relevant

Clarity

metrics.clarity.evaluate(text: str) → ClarityResponse

The Trustwise Clarity metric measures how easy text is to read. It gives higher scores to writing that contains words which are easier to read, and uses concise, self-contained sentences. It does not measure how well you understand the ideas in the text.

Request:

ClarityRequest

Returns:

ClarityResponse

Example response:

{
    "score": 92.5
}

score: A float, 0-100, measuring how clear and understandable the response is. Higher score indicates better clarity.

Completion

metrics.completion.evaluate(query: str, response: str) → CompletionResponse

Evaluate how well the response completes or follows the query’s instruction.

Request:

CompletionRequest

Returns:

CompletionResponse

Example response:

{
    "score": 85.0
}
score: A float, 0-100, measuring how well the response completes the query. Higher score indicates better completion.

Context Relevancy

metrics.context_relevancy.evaluate(query: str, context: Context, severity: float = None, include_chunk_scores: bool = None, metadata: dict = None) → ContextRelevancyResponse

Evaluate the relevancy of the context to the query.

Request:

ContextRelevancyRequest

Returns:

ContextRelevancyResponse

Example response:

{
    "score": 88.5,
    "scores": [
        {"label": "Circumstances", "score": 0.33},
        {"label": "Claim", "score": 46.34},
        {"label": "Policy", "score": 0.11}
    ]
}

score: A float, 0-100, measuring how relevant the context is to the query. Higher score indicates better relevancy.
scores: List of ObjectStyleScore with detailed breakdown by relevancy aspect

Faithfulness

metrics.faithfulness.evaluate(query: str, response: str, context: Context) → FaithfulnessResponse

Evaluate the faithfulness of a response against its context.

Request:

FaithfulnessRequest

Returns:

FaithfulnessResponse

Example response:

{
    "score": 99.971924,
    "statements": [
        {
            "statement": "The capital of France is Paris.",
            "label": "Safe",
            "probability": 0.9997192,
            "sentence_span": [0, 30]
        }
    ]
}

score: A float, 0-100, measuring how faithful the response is to the context (100 is perfect faithfulness)
statements: List of extracted atomic statements with their verification status

Formality

metrics.formality.evaluate(text: str) → FormalityResponse

Evaluate the formality level of a response.

Request:

FormalityRequest

Returns:

FormalityResponse

Example response:

{
    "score": 75.0,
}

score: A float, 0-100, measuring the overall formality level (100 is very formal)

Helpfulness

metrics.helpfulness.evaluate(text: str) → HelpfulnessResponse

Evaluate the helpfulness of a response.

Request:

HelpfulnessRequest

Returns:

HelpfulnessResponse

Example response:

{
    "score": 88.0
}

score: A float, 0-100, measuring how helpful the response is (100 is very helpful)

PII Detection

metrics.pii.evaluate(text: str, allowlist: list[str] = None, blocklist: list[str] = None, categories: list[str] = None) → PIIResponse

Detect personally identifiable information in text.

Request:

PIIRequest

Returns:

PIIResponse

Example response:

{
    "pii": [
        {
            "interval": [0, 5],
            "string": "Hello",
            "category": "blocklist"
        }
    ]
}

pii: List of detected PII entities with their locations and categories

Prompt Manipulation

metrics.prompt_manipulation.evaluate(text: str, severity: int = None) → PromptManipulationResponse

Detect potential prompt manipulation attempts including jailbreak, prompt injection, and role play.

Request:

PromptManipulationRequest

Returns:

PromptManipulationResponse

Example response:

{
    "score": 0.85,
    "scores": [
        {"label": "jailbreak", "score": 0.90},
        {"label": "prompt_injection", "score": 0.80},
        {"label": "role_play", "score": 0.85}
    ]
}

score: A float, 0-100, measuring overall prompt manipulation likelihood (higher is more likely)
scores: List of ObjectStyleScore with detailed breakdown by manipulation type

Refusal

metrics.refusal.evaluate(query: str, response: str) → RefusalResponse

Evaluate the likelihood that a response is a refusal to answer or comply with the query.

Request:

RefusalRequest

Returns:

RefusalResponse

Example response:

{
    "score": 5.0
}

score: A float, 0-100, measuring the degree of refusal (higher indicates stronger refusal)

Sensitivity

metrics.sensitivity.evaluate(text: str, topics: list[str]) → SensitivityResponse

Evaluate the sensitivity of a response regarding specific topics.

Request:

SensitivityRequest

Returns:

SensitivityResponse

Example response:

{
    "scores": [
        {"label": "politics", "score": 70.0},
        {"label": "religion", "score": 60.0}
    ]
}

scores: List of ObjectStyleScore with sensitivity scores by topic (0-100, higher indicates stronger presence)

Stability

metrics.stability.evaluate(responses: list[str]) → StabilityResponse

Measures how similar the responses are when given the same or similar inputs multiple times. It gives higher scores when responses stay consistent, even if asked by different personas or worded differently. This helps identify if an agent changes its answers unexpectedly.

Request:

StabilityRequest

Returns:

StabilityResponse

Example response:

{
    "min": 75,
    "avg": 85
}

min: An integer, 0-100, measuring the minimum stability between any pair of responses (100 is high similarity)
avg: An integer, 0-100, measuring the average stability between all pairs of responses (100 is high similarity)

Simplicity

metrics.simplicity.evaluate(text: str) → SimplicityResponse

Measures how easy it is to understand the words in a text. It gives higher scores to writing that uses common, everyday words instead of special terms or complicated words. Simplicity looks at the words you choose, not how you put them together in sentences.

Request:

SimplicityRequest

Returns:

SimplicityResponse

Example response:

{
    "score": 82.0
}

score: A float, 0-100, measuring how simple the response is. Higher score indicates simpler text.

Tone

metrics.tone.evaluate(text: str, tones: list[str] = None) → ToneResponse

Evaluate the tone of a response.

Request:

ToneRequest

Returns:

ToneResponse

Example response:

{
    "scores": [
        {"label": "neutral", "score": 89.70},
        {"label": "happiness", "score": 6.68},
        {"label": "realization", "score": 2.99}
    ]
}

scores: List of ObjectStyleScore with tone confidence scores (0-100, higher indicates stronger presence)

Toxicity

metrics.toxicity.evaluate(text: str, severity: int = None) → ToxicityResponse

Evaluate the toxicity of a response.

Request:

ToxicityRequest

Returns:

ToxicityResponse

Example response:

{
    "score": 36.22,
    "scores": [
        {"label": "identity_hate", "score": 3.61},
        {"label": "insult", "score": 6.21},
        {"label": "threat", "score": 2.80},
        {"label": "obscene", "score": 10.55},
        {"label": "toxic", "score": 36.22}
    ]
}

score: A float, 0-100, measuring overall toxicity (higher is more toxic)
scores: List of ObjectStyleScore with detailed breakdown by toxicity category

Carbon

metrics.carbon.evaluate(provider: str, region: str, instance_type: str, latency: float | int) → CarbonResponse

Evaluate the carbon footprint of AI operations based on provider, instance type, region, and latency.

Request:

CarbonRequest

Returns:

CarbonResponse

Example response:

{
    "carbon": {
        "value": 0.0011949989480068127,
        "unit": "kg_co2e"
    },
    "components": [
        {
            "component": "operational_gpu",
            "carbon": {
                "value": 0.0,
                "unit": "kg_co2e"
            }
        },
        {
            "component": "operational_cpu",
            "carbon": {
                "value": 0.00021294669026962343,
                "unit": "kg_co2e"
            }
        },
        {
            "component": "embodied_cpu",
            "carbon": {
                "value": 0.0009820522577371892,
                "unit": "kg_co2e"
            }
        }
    ]
}

carbon: CarbonValue with total carbon footprint
components: List of CarbonComponent with breakdown by component