Metrics
Note
All metrics are available in both synchronous and asynchronous forms. For async, use TrustwiseSDKAsync
and await
the evaluate()
methods.
The SDK provides access to all metrics through the unified metrics
namespace. Each metric provides an evaluate()
function. Example usage:
For more details on the metrics, please refer to the Trustwise Metrics Documentation.
result = trustwise.metrics.faithfulness.evaluate(query="...", response="...", context=[...])
clarity = trustwise.metrics.clarity.evaluate(query="...", response="...")
cost = trustwise.metrics.cost.evaluate(model_name="...", model_type="LLM", ...)
Async example:
import asyncio
from trustwise.sdk import TrustwiseSDKAsync
from trustwise.sdk.config import TrustwiseConfig
async def main():
config = TrustwiseConfig()
trustwise = TrustwiseSDKAsync(config)
await trustwise.metrics.faithfulness.evaluate(
query="What is the capital of France?",
response="The capital of France is Paris.",
context=[{"node_id": "1", "node_score": 1.0, "node_text": "Paris is the capital of France."}]
asyncio.run(main())
Refer to the API Reference for details on each metric’s parameters.
Note
Custom types such as Context
are defined in trustwise.sdk.types
.
Adherence
- metrics.adherence.evaluate(policy: str, response: str) AdherenceResponse
Evaluate how well the response adheres to a given policy or instruction.
Request:
Returns:
Example response:
{ "score": 95 }
score
: An integer, 0-100, measuring how well the response follows the policy (100 is perfect adherence)
Answer Relevancy
- metrics.answer_relevancy.evaluate(query: str, response: str) AnswerRelevancyResponse
Evaluate the relevancy of a response to the query.
Request:
Returns:
Example response:
{ "score": 92.0, "generated_question": "What is the capital city of France?" }
Carbon
- metrics.carbon.evaluate(processor_name: str, provider_name: str, provider_region: str, instance_type: str, average_latency: int) CarbonResponse
Evaluates the carbon emissions based on hardware specifications and infrastructure details.
Request:
Returns:
Example response:
{ "carbon_emitted": 0.00015, "sci_per_api_call": 0.00003, "sci_per_10k_calls": 0.3 }
Clarity
- metrics.clarity.evaluate(response: str) ClarityResponse
Evaluate the clarity of a response.
Request:
Returns:
Example response:
{ "score": 92.5 }
Completion
- metrics.completion.evaluate(query: str, response: str) CompletionResponse
Evaluate how well the response completes or follows the query’s instruction.
Request:
Returns:
Example response:
{ "score": 99 }
score
: An integer, 0-100, measuring how well the response completes the query (100 is a perfect completion)
Context Relevancy
- metrics.context_relevancy.evaluate(query: str, context: list[ContextNode]) ContextRelevancyResponse
Evaluate the relevancy of the context to the query.
Request:
Returns:
Example response:
{ "score": 88.5, "topics": ["geography", "capitals", "France"], "scores": [0.92, 0.85, 0.88] }
Cost
- metrics.cost.evaluate(model_name: str, model_type: str, model_provider: str, number_of_queries: int, total_prompt_tokens: int | None = None, total_completion_tokens: int | None = None, total_tokens: int | None = None, instance_type: str | None = None, average_latency: float | None = None) CostResponse
Evaluates the cost of API usage based on token counts, model information, and infrastructure details.
Request:
Returns:
Example response:
{ "cost_estimate_per_run": 0.0025, "total_project_cost_estimate": 0.0125 }
Faithfulness
- metrics.faithfulness.evaluate(query: str, response: str, context: list[ContextNode]) FaithfulnessResponse
Evaluate the faithfulness of a response against its context.
Request:
Returns:
Example response:
{ "score": 99.971924, "facts": [ { "statement": "The capital of France is Paris.", "label": "Safe", "prob": 0.9997192, "sentence_span": [ 0, 30 ] } ] }
Formality
- metrics.formality.evaluate(response: str) FormalityResponse
Evaluate the formality level of a response.
Request:
Returns:
Example response:
{ "score": 75.0, "sentences": [ "The capital of France is Paris." ], "scores": [0.75] }
Helpfulness
- metrics.helpfulness.evaluate(response: str) HelpfulnessResponse
Evaluate the helpfulness of a response.
Request:
Returns:
Example response:
{ "score": 88.0 }
PII
- metrics.pii.evaluate(text: str, blocklist: list[str] | None = None, allowlist: list[str] | None = None) PIIResponse
Detect personally identifiable information in text.
Request:
Returns:
Example response:
{ "identified_pii": [ { "interval": [0, 5], "string": "Hello", "category": "blocklist" }, { "interval": [94, 111], "string": "www.wikipedia.org", "category": "organization" } ] }
Prompt Injection
- metrics.prompt_injection.evaluate(query: str) PromptInjectionResponse
Detect potential prompt injection attempts.
Request:
Returns:
Example response:
{ "score": 98.0 }
Refusal
- metrics.refusal.evaluate(query: str, response: str) RefusalResponse
Evaluate the likelihood that a response is a refusal to answer or comply with the query.
Request:
Returns:
Example response:
{ "score": 5 }
score
: An integer, 0-100, measuring the degree (firmness) of refusal (100 is a strong refusal)
Sensitivity
- metrics.sensitivity.evaluate(response: str, topics: list[str]) SensitivityResponse
Evaluate the sensitivity of a response regarding specific topics.
Request:
Returns:
Example response:
{ "scores": { "politics": 0.70, "religion": 0.60 } }
Simplicity
- metrics.simplicity.evaluate(response: str) SimplicityResponse
Evaluate the simplicity of a response.
Request:
Returns:
Example response:
{ "score": 82.0 }
Stability
- metrics.stability.evaluate(responses: list[str]) StabilityResponse
Evaluate the stability (consistency) of multiple responses to the same prompt.
Request:
Returns:
Example response:
{ "min": 80, "avg": 87 }
min
: An integer, 0-100, measuring the minimum stability between any two pairs of responses (100 is high similarity)avg
: An integer, 0-100, measuring the average stability between all two pairs of responses (100 is high similarity)
Summarization
- metrics.summarization.evaluate(response: str, context: list[ContextNode]) SummarizationResponse
Evaluate the quality of a summary.
Request:
Returns:
Example response:
{ "score": 90.0 }
Tone
- metrics.tone.evaluate(response: str) ToneResponse
Evaluate the tone of a response.
Request:
Returns:
Example response:
{ "labels": [ "neutral", "happiness", "realization" ], "scores": [ 89.704185, 6.6798472, 2.9873204 ] }
Toxicity
- metrics.toxicity.evaluate(response: str) ToxicityResponse
Evaluate the toxicity of a response.
Request:
Returns:
Example response:
{ "labels": [ "identity_hate", "insult", "threat", "obscene", "toxic" ], "scores": [ 0.036089644, 0.06207772, 0.027964465, 0.105483316, 0.3622106 ] }