V4 Metrics
Note
Version 4 is now the default for Trustwise metrics. It provides enhanced accuracy, better type safety, and improved response structures compared to v3.
All metrics are available in both synchronous and asynchronous forms. For async, use TrustwiseSDKAsync and await the evaluate() methods.
The SDK provides access to all v4 metrics directly through the metrics namespace. Each metric provides an evaluate() function with enhanced type safety and improved response structures. Example usage:
For more details on the metrics, please refer to the Trustwise Metrics Documentation.
from trustwise.sdk import TrustwiseSDK
from trustwise.sdk.config import TrustwiseConfig
from trustwise.sdk.metrics.v4.types import ContextChunk
config = TrustwiseConfig(api_key="your-api-key")
trustwise = TrustwiseSDK(config)
# v4 context format
context = [
ContextChunk(chunk_text="Paris is the capital of France.", chunk_id="doc:idx:1")
]
# v4 metric calls (now default)
result = trustwise.metrics.faithfulness.evaluate(
query="What is the capital of France?",
response="The capital of France is Paris.",
context=context
)
Async example:
import asyncio
from trustwise.sdk import TrustwiseSDKAsync
from trustwise.sdk.config import TrustwiseConfig
from trustwise.sdk.metrics.v4.types import ContextChunk
async def main():
config = TrustwiseConfig(api_key="your-api-key")
trustwise = TrustwiseSDKAsync(config)
context = [
ContextChunk(chunk_text="Paris is the capital of France.", chunk_id="doc:idx:1")
]
result = await trustwise.metrics.faithfulness.evaluate(
query="What is the capital of France?",
response="The capital of France is Paris.",
context=context
)
asyncio.run(main())
Refer to the API Reference for details on each metric’s parameters.
Note
Custom types such as ContextChunk are defined in trustwise.sdk.metrics.v4.types.
Adherence
- metrics.adherence.evaluate(policy: str, response: str) AdherenceResponse
Evaluate how well the response follows a given policy or instruction.
Request:
Returns:
Example response:
{ "score": 85.0 }
score: A float, 0-100, measuring how well the response follows the policy (higher is better adherence)
Answer Relevancy
- metrics.answer_relevancy.evaluate(query: str, response: str) AnswerRelevancyResponse
Evaluate the relevancy of a response to the query.
Request:
Returns:
Example response:
{ "score": 92.0, "generated_question": "What is the capital city of France?" }
score: A float, 0-100, measuring how relevant the response is to the query. Higher score indicates better relevancy.generated_question: The generated question for which the response would be relevant
Clarity
- metrics.clarity.evaluate(text: str) ClarityResponse
The Trustwise Clarity metric measures how easy text is to read. It gives higher scores to writing that contains words which are easier to read, and uses concise, self-contained sentences. It does not measure how well you understand the ideas in the text.
Request:
Returns:
Example response:
{ "score": 92.5 }
score: A float, 0-100, measuring how clear and understandable the response is. Higher score indicates better clarity.
Completion
- metrics.completion.evaluate(query: str, response: str) CompletionResponse
Evaluate how well the response completes or follows the query’s instruction.
Request:
Returns:
Example response:
{ "score": 85.0 }
score: A float, 0-100, measuring how well the response completes the query. Higher score indicates better completion.
Context Relevancy
- metrics.context_relevancy.evaluate(query: str, context: Context, severity: float = None, include_chunk_scores: bool = None, metadata: dict = None) ContextRelevancyResponse
Evaluate the relevancy of the context to the query.
Request:
Returns:
Example response:
{ "score": 88.5, "scores": [ {"label": "Circumstances", "score": 0.33}, {"label": "Claim", "score": 46.34}, {"label": "Policy", "score": 0.11} ] }
score: A float, 0-100, measuring how relevant the context is to the query. Higher score indicates better relevancy.scores: List ofObjectStyleScorewith detailed breakdown by relevancy aspect
Faithfulness
- metrics.faithfulness.evaluate(query: str, response: str, context: Context) FaithfulnessResponse
Evaluate the faithfulness of a response against its context.
Request:
Returns:
Example response:
{ "score": 99.971924, "statements": [ { "statement": "The capital of France is Paris.", "label": "Safe", "probability": 0.9997192, "sentence_span": [0, 30] } ] }
score: A float, 0-100, measuring how faithful the response is to the context (100 is perfect faithfulness)statements: List of extracted atomic statements with their verification status
Formality
- metrics.formality.evaluate(text: str) FormalityResponse
Evaluate the formality level of a response.
Request:
Returns:
Example response:
{ "score": 75.0, }
score: A float, 0-100, measuring the overall formality level (100 is very formal)
Helpfulness
- metrics.helpfulness.evaluate(text: str) HelpfulnessResponse
Evaluate the helpfulness of a response.
Request:
Returns:
Example response:
{ "score": 88.0 }
score: A float, 0-100, measuring how helpful the response is (100 is very helpful)
PII Detection
- metrics.pii.evaluate(text: str, allowlist: list[str] = None, blocklist: list[str] = None, categories: list[str] = None) PIIResponse
Detect personally identifiable information in text.
Request:
Returns:
Example response:
{ "pii": [ { "interval": [0, 5], "string": "Hello", "category": "blocklist" } ] }
pii: List of detected PII entities with their locations and categories
Prompt Manipulation
- metrics.prompt_manipulation.evaluate(text: str, severity: int = None) PromptManipulationResponse
Detect potential prompt manipulation attempts including jailbreak, prompt injection, and role play.
Request:
Returns:
Example response:
{ "score": 0.85, "scores": [ {"label": "jailbreak", "score": 0.90}, {"label": "prompt_injection", "score": 0.80}, {"label": "role_play", "score": 0.85} ] }
score: A float, 0-100, measuring overall prompt manipulation likelihood (higher is more likely)scores: List ofObjectStyleScorewith detailed breakdown by manipulation type
Refusal
- metrics.refusal.evaluate(query: str, response: str) RefusalResponse
Evaluate the likelihood that a response is a refusal to answer or comply with the query.
Request:
Returns:
Example response:
{ "score": 5.0 }
score: A float, 0-100, measuring the degree of refusal (higher indicates stronger refusal)
Sensitivity
- metrics.sensitivity.evaluate(text: str, topics: list[str]) SensitivityResponse
Evaluate the sensitivity of a response regarding specific topics.
Request:
Returns:
Example response:
{ "scores": [ {"label": "politics", "score": 70.0}, {"label": "religion", "score": 60.0} ] }
scores: List ofObjectStyleScorewith sensitivity scores by topic (0-100, higher indicates stronger presence)
Stability
- metrics.stability.evaluate(responses: list[str]) StabilityResponse
Measures how similar the responses are when given the same or similar inputs multiple times. It gives higher scores when responses stay consistent, even if asked by different personas or worded differently. This helps identify if an agent changes its answers unexpectedly.
Request:
Returns:
Example response:
{ "min": 75, "avg": 85 }
min: An integer, 0-100, measuring the minimum stability between any pair of responses (100 is high similarity)avg: An integer, 0-100, measuring the average stability between all pairs of responses (100 is high similarity)
Simplicity
- metrics.simplicity.evaluate(text: str) SimplicityResponse
Measures how easy it is to understand the words in a text. It gives higher scores to writing that uses common, everyday words instead of special terms or complicated words. Simplicity looks at the words you choose, not how you put them together in sentences.
Request:
Returns:
Example response:
{ "score": 82.0 }
score: A float, 0-100, measuring how simple the response is. Higher score indicates simpler text.
Tone
- metrics.tone.evaluate(text: str, tones: list[str] = None) ToneResponse
Evaluate the tone of a response.
Request:
Returns:
Example response:
{ "scores": [ {"label": "neutral", "score": 89.70}, {"label": "happiness", "score": 6.68}, {"label": "realization", "score": 2.99} ] }
scores: List ofObjectStyleScorewith tone confidence scores (0-100, higher indicates stronger presence)
Toxicity
- metrics.toxicity.evaluate(text: str, severity: int = None) ToxicityResponse
Evaluate the toxicity of a response.
Request:
Returns:
Example response:
{ "score": 36.22, "scores": [ {"label": "identity_hate", "score": 3.61}, {"label": "insult", "score": 6.21}, {"label": "threat", "score": 2.80}, {"label": "obscene", "score": 10.55}, {"label": "toxic", "score": 36.22} ] }
score: A float, 0-100, measuring overall toxicity (higher is more toxic)scores: List ofObjectStyleScorewith detailed breakdown by toxicity category
Carbon
- metrics.carbon.evaluate(provider: str, region: str, instance_type: str, latency: float | int) CarbonResponse
Evaluate the carbon footprint of AI operations based on provider, instance type, region, and latency.
Request:
Returns:
Example response:
{ "carbon": { "value": 0.0011949989480068127, "unit": "kg_co2e" }, "components": [ { "component": "operational_gpu", "carbon": { "value": 0.0, "unit": "kg_co2e" } }, { "component": "operational_cpu", "carbon": { "value": 0.00021294669026962343, "unit": "kg_co2e" } }, { "component": "embodied_cpu", "carbon": { "value": 0.0009820522577371892, "unit": "kg_co2e" } } ] }
carbon:CarbonValuewith total carbon footprintcomponents: List ofCarbonComponentwith breakdown by component