Metrics

Note

All metrics are available in both synchronous and asynchronous forms. For async, use TrustwiseSDKAsync and await the evaluate() methods.

The SDK provides access to all metrics through the unified metrics namespace. Each metric provides an evaluate() function. Example usage:

For more details on the metrics, please refer to the Trustwise Metrics Documentation.

result = trustwise.metrics.faithfulness.evaluate(query="...", response="...", context=[...])
clarity = trustwise.metrics.clarity.evaluate(query="...", response="...")
cost = trustwise.metrics.cost.evaluate(model_name="...", model_type="LLM", ...)

Async example:

import asyncio
from trustwise.sdk import TrustwiseSDKAsync
from trustwise.sdk.config import TrustwiseConfig

async def main():
    config = TrustwiseConfig()
    trustwise = TrustwiseSDKAsync(config)
    await trustwise.metrics.faithfulness.evaluate(
        query="What is the capital of France?",
        response="The capital of France is Paris.",
        context=[{"node_id": "1", "node_score": 1.0, "node_text": "Paris is the capital of France."}]

asyncio.run(main())

Refer to the API Reference for details on each metric’s parameters.

Note

Custom types such as Context are defined in trustwise.sdk.types.

Adherence

metrics.adherence.evaluate(policy: str, response: str) AdherenceResponse

Evaluate how well the response adheres to a given policy or instruction.

Request:

Returns:

Example response:

{
    "score": 95
}
  • score: An integer, 0-100, measuring how well the response follows the policy (100 is perfect adherence)

Answer Relevancy

metrics.answer_relevancy.evaluate(query: str, response: str) AnswerRelevancyResponse

Evaluate the relevancy of a response to the query.

Request:

Returns:

Example response:

{
    "score": 92.0,
    "generated_question": "What is the capital city of France?"
}

Carbon

metrics.carbon.evaluate(processor_name: str, provider_name: str, provider_region: str, instance_type: str, average_latency: int) CarbonResponse

Evaluates the carbon emissions based on hardware specifications and infrastructure details.

Request:

Returns:

Example response:

{
    "carbon_emitted": 0.00015,
    "sci_per_api_call": 0.00003,
    "sci_per_10k_calls": 0.3
}

Clarity

metrics.clarity.evaluate(response: str) ClarityResponse

Evaluate the clarity of a response.

Request:

Returns:

Example response:

{
    "score": 92.5
}

Completion

metrics.completion.evaluate(query: str, response: str) CompletionResponse

Evaluate how well the response completes or follows the query’s instruction.

Request:

Returns:

Example response:

{
    "score": 99
}
  • score: An integer, 0-100, measuring how well the response completes the query (100 is a perfect completion)

Context Relevancy

metrics.context_relevancy.evaluate(query: str, context: list[ContextNode]) ContextRelevancyResponse

Evaluate the relevancy of the context to the query.

Request:

Returns:

Example response:

{
    "score": 88.5,
    "topics": ["geography", "capitals", "France"],
    "scores": [0.92, 0.85, 0.88]
}

Cost

metrics.cost.evaluate(model_name: str, model_type: str, model_provider: str, number_of_queries: int, total_prompt_tokens: int | None = None, total_completion_tokens: int | None = None, total_tokens: int | None = None, instance_type: str | None = None, average_latency: float | None = None) CostResponse

Evaluates the cost of API usage based on token counts, model information, and infrastructure details.

Request:

Returns:

Example response:

{
    "cost_estimate_per_run": 0.0025,
    "total_project_cost_estimate": 0.0125
}

Faithfulness

metrics.faithfulness.evaluate(query: str, response: str, context: list[ContextNode]) FaithfulnessResponse

Evaluate the faithfulness of a response against its context.

Request:

Returns:

Example response:

{
    "score": 99.971924,
    "facts": [
            {
            "statement": "The capital of France is Paris.",
            "label": "Safe",
            "prob": 0.9997192,
            "sentence_span": [
                0,
                30
            ]
        }
    ]
}

Formality

metrics.formality.evaluate(response: str) FormalityResponse

Evaluate the formality level of a response.

Request:

Returns:

Example response:

{
    "score": 75.0,
    "sentences": [
        "The capital of France is Paris."
    ],
    "scores": [0.75]
}

Helpfulness

metrics.helpfulness.evaluate(response: str) HelpfulnessResponse

Evaluate the helpfulness of a response.

Request:

Returns:

Example response:

{
    "score": 88.0
}

PII

metrics.pii.evaluate(text: str, blocklist: list[str] | None = None, allowlist: list[str] | None = None) PIIResponse

Detect personally identifiable information in text.

Request:

Returns:

Example response:

{
    "identified_pii": [
        {
            "interval": [0, 5],
            "string": "Hello",
            "category": "blocklist"
        },
        {
            "interval": [94, 111],
            "string": "www.wikipedia.org",
            "category": "organization"
        }
    ]
}

Prompt Injection

metrics.prompt_injection.evaluate(query: str) PromptInjectionResponse

Detect potential prompt injection attempts.

Request:

Returns:

Example response:

{
    "score": 98.0
}

Refusal

metrics.refusal.evaluate(query: str, response: str) RefusalResponse

Evaluate the likelihood that a response is a refusal to answer or comply with the query.

Request:

Returns:

Example response:

{
    "score": 5
}
  • score: An integer, 0-100, measuring the degree (firmness) of refusal (100 is a strong refusal)

Sensitivity

metrics.sensitivity.evaluate(response: str, topics: list[str]) SensitivityResponse

Evaluate the sensitivity of a response regarding specific topics.

Request:

Returns:

Example response:

{
    "scores": {
        "politics": 0.70,
        "religion": 0.60
    }
}

Simplicity

metrics.simplicity.evaluate(response: str) SimplicityResponse

Evaluate the simplicity of a response.

Request:

Returns:

Example response:

{
    "score": 82.0
}

Stability

metrics.stability.evaluate(responses: list[str]) StabilityResponse

Evaluate the stability (consistency) of multiple responses to the same prompt.

Request:

Returns:

Example response:

{
    "min": 80,
    "avg": 87
}
  • min: An integer, 0-100, measuring the minimum stability between any two pairs of responses (100 is high similarity)

  • avg: An integer, 0-100, measuring the average stability between all two pairs of responses (100 is high similarity)

Summarization

metrics.summarization.evaluate(response: str, context: list[ContextNode]) SummarizationResponse

Evaluate the quality of a summary.

Request:

Returns:

Example response:

{
    "score": 90.0
}

Tone

metrics.tone.evaluate(response: str) ToneResponse

Evaluate the tone of a response.

Request:

Returns:

Example response:

{
    "labels": [
        "neutral",
        "happiness",
        "realization"
    ],
    "scores": [
        89.704185,
        6.6798472,
        2.9873204
    ]
}

Toxicity

metrics.toxicity.evaluate(response: str) ToxicityResponse

Evaluate the toxicity of a response.

Request:

Returns:

Example response:

{
    "labels": [
        "identity_hate",
        "insult",
        "threat",
        "obscene",
        "toxic"
    ],
    "scores": [
        0.036089644,
        0.06207772,
        0.027964465,
        0.105483316,
        0.3622106
    ]
}