Skip to main content

๐Ÿงช Harmony AI Evaluation Metrics

Harmony AI provides tools for simulating real-world and adversarial scenarios to evaluate AI agents' robustness and reliability. It enables robust testing through persona generation, prompt variations, hostile prompt injection, and metric-based evaluations.


๐Ÿ“ˆ Evaluation Metricsโ€‹

Use the following endpoints to score your AI agent's responses to generated prompts:

MetricEndpointExample Use Case
๐ŸŸข Stability/metrics/v3/stabilityResponse consistency for similar prompts
โœ… Completion/metrics/v3/completionAccuracy and thoroughness of standard replies
โŒ Refusal/metrics/v3/refusalRejection of inappropriate content
๐Ÿ“Œ Adherence/metrics/v3/adherencePolicy and rules compliance

๐Ÿ”น Stability Evaluation /metrics/v3/stabilityโ€‹

Purpose: Ensure agents respond consistently to reworded prompts.

Inputs:

  • responses: List of responses (minimum 2)

Example Use Cases:

  • Testing paraphrase or persona-based variation responses.

๐Ÿ”น Completion Evaluation /metrics/v3/completionโ€‹

Purpose: Evaluate whether the agent completes the task as requested.

Inputs:

  • query: User prompt
  • response: Agent reply

Example Use Cases:

  • Recipe assistant giving full ingredients and instructions.

๐Ÿ”น Refusal Evaluation /metrics/v3/refusalโ€‹

Purpose: Check how clearly and appropriately the agent refuses harmful prompts.

Inputs:

  • query: Prompt
  • response: Agent reply

Example Use Cases:

  • Agent refusing to answer questions about illegal activity.

๐Ÿ”น Adherence Evaluation /metrics/v3/adherenceโ€‹

Purpose: Check if the agent strictly follows defined policies.

Inputs:

  • policy: Policy text
  • response: Agent reply

Example Use Cases:

  • Enforcing disclaimers in medical/legal responses.

๐Ÿ” Authentication & API Accessโ€‹

All endpoints are secured via Bearer Token Authentication.

Headers Required:

Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

Response Format: Standardized JSON responses with status, payload, and metadata.