Skip to main content
Version: v3.0.0

Optimize:ai Scan Results

Optimize:ai is designed to refine the performance of Retrieval-Augmented Generation (RAG) pipelines by optimizing cost, latency, and carbon emissions while maintaining above-threshold functional performance. This document outlines the development and operational methodology of the Optimize AI solution.

The Aim of Optimize:ai

The primary goal of Optimize:ai is to enhance the efficiency of users' RAG pipelines across three critical aspects:

  • Cost: Reducing operational expenses.
  • Latency: Decreasing the time taken to retrieve and generate responses.
  • Carbon Emissions: Minimizing the environmental impact of operations.

Functional performance is assessed through Safety and Alignment Scores, ensuring that the system not only operates efficiently but also safely and in alignment with ethical standards.

Solution Overview

Optimize:ai solution simulates various RAG configurations to identify the most effective setup for the user’s specific requirements. During this simulation, the system evaluates:

  • SCI per call: Following the SCI Guidance, the system calculates the Sustainable Carbon Intensity for each API call.
  • Cost: Estimating the expected operational costs.
  • Safety and Alignment Metrics: Assessing the model's safety and alignment with ethical standards if these evaluations are enabled.

Methodology: Configuring the Best RAG Setup

The backend processes user requests by computing all possible iterations of the RAG pipeline, exploring unique combinations of configurations. These configurations adjust various operational parameters such as:

  • Chunking Strategies: Size (256, 512, 1024) and number of retrieved chunks (2, 4, 8).
  • Retrieval Strategies: Re-ranking of retrieved contexts and the use of different embedding models (bge-small and ada embedding model from OpenAI).
  • Quantization Strategies: Model quantization and pruning methods.
  • Energy Use Strategies: Optimal combination of providers and regions for reduced CO2 emissions. Calculation of carbon emissions based on the chosen provider, region, and processor type.

Carbon Emissions and SCI Calculations

Carbon emissions are estimated by:

  • Determining the processor's thermal design power.
  • Assessing the provider's carbon impact value.
  • Using data from the MLCO2 database to ascertain the gCO2eq per kWh consumed.
  • Get the Hours used from the average latency of the RAG iterations and calculate the total hours used estimate by using the average latency multiplied by the number of total queries in a project and number of users.

Carbon Emissions = Power Consumption of processor * Hours used * Impact of the provider and region

The SCI Score, calculated per API call, incorporates:

SCI = (E * I) + M per R


Overview of Optimize Scan Results

The rest of this document provides a detailed guide on accessing and interpreting the results from Optimize scans, including visual representations and data analysis from the Trustwise Control Plane. This will include steps to access scan results, view previous scans, analyze leaderboard metrics, and generate detailed reports for further analysis. The Trustwise Control Plane allows users to monitor and analyze the performance of their language models effectively. This section will help you navigate through the results of the scans you have conducted.

Step 1: Access Scan Results

Start by navigating to the Trustwise Control Plane. This is where all your scan activities are centralized.

Trustwise Control Plane

Minimize Configurations Panel

For a clearer view of the scan results, you can minimize the configurations tab on the left by clicking the blue button with arrows. This will expand the main panel where the results are displayed.

Trustwise Control Plane

Step 2: View Previous Scans

Click on the 'Previous Scans' button to display a list of all your past scan activities.

Optimize:ai Previous Scans

Step 3: Analyze Leaderboard and Results

Select the 'View Leaderboard' button next to the scan you wish to analyze. It may take a few seconds for the results to appear below the button.

Performance Summary

At the top of the results, you will find the Performance Summary table, which includes:

  • Max Cost Reduction %
  • Max CO2 Reduction % (Kg CO2 eq.)
  • Max Composite Performance
  • Min Composite Performance

These metrics give a quick snapshot of each models's impact on performance and efficiency.

Financial Cost Scatter Plot

Below the summary, the Financial Cost Scatter Plot visualizes the relationship between financial cost and composite performance (functional, safety, and alignment) of each model.

Optimize:ai Results

Detailed Leaderboard Table

The detailed leaderboard table provides extensive data on each model, including:

  • Model name and Model provider
  • Cost estimate for 10k inferences in USD
  • Carbon emitted (kg CO2 eq.)
  • Average latency (in ms)
  • Chunk size - 256, 512, 1024
  • Number of retrieved chunks - 2, 4, 8
  • Total tokens used, Prompt tokens used, and Response tokens used
  • Safety and Alignment Metrics including average Faithfulness, Answer Relevancy, Context Relevancy, Summarization, Simplicity, Helpfulness, Clarity, and Formality

The leaderboard allows for sorting by each column, providing a versatile view for detailed analysis.

Optimize:ai Results Optimize:ai Results

Generating an Optimize:ai Scan Report

In addition to viewing Optimize:ai results, users can aslo generate detailed reports for their models. This guide will help you generate a comprehensive report for a selected model from the leaderboard.

Step 1: Select Model from Leaderboard

After viewing the leaderboard, locate the dropdown menu below the leaderboard table. From this menu, select the model for which you wish to generate a report.

Optimize:ai Models

Step 2: Generate Report

Once the model is selected, click on the "Generate Report" button. The report generation may take a few seconds, after which it will be displayed on the screen.

Optimize:ai Select Model

Step 3: Review Generated Report

The generated report will include detailed information and metrics for the selected model. Below are the sections included in the report:

Model Information

  • Model Name
  • Model Owner
  • Model Type
  • Model Size
  • Papers and Samples: Links to relevant documentation and sample outputs for the model.

Performance Benchmarks

  • RAISE Safety Benchmark
  • RAISE Alignment Benchmark

These benchmarks are provided in association with the RAI Institute, and links to detailed reports are included for deeper insights.

Performance Summary

At the top of the results, you will find the Performance Summary table, which includes:

  • Max Cost Reduction %
  • Max CO2 Reduction % (Kg CO2 eq.)
  • Max Composite Performance
  • Min Composite Performance

These metrics give a quick snapshot of each models's impact on performance and efficiency.

Optimize:ai Report

Detailed Leaderboard Table

The detailed leaderboard table provides extensive data on each model, including:

  • Model name and Model provider
  • Cost estimate for 10k inferences in USD
  • Carbon emitted (kg CO2 eq.)
  • Average latency (in ms)
  • Chunk size - 256, 512, 1024
  • Number of retrieved chunks - 2, 4, 8
  • Total tokens used, Prompt tokens used, and Response tokens used
  • Safety and Alignment Metrics including average Faithfulness, Answer Relevancy, Context Relevancy, Summarization, Simplicity, Helpfulness, Clarity, and Formality

Summary of Findings

This final section provides a summary of the key findings from the Optimize:ai scans, offering a high-level overview of the model's performance across various metrics and configurations. The insights gained from the detailed leaderboard analysis are encapsulated here, highlighting the overall efficiency, reliability, and operational cost-effectiveness of the model under test.

This summary is designed to assist stakeholders in quickly grasping the significant outcomes and implications of the scan results, facilitating informed decision-making regarding the future utilization and optimization of the model.

Optimize:ai Report

Safety Metrics Details

Summarization

Summarization evaluates the response using the context and aims to determine whether the response is factually consistent with respect to the provided context. The response is split into individual sentences and all (context, sentence) pairs are passed through a fine-tuned model, determining whether the contents of the sentence are supported by the provided context. This process results in a Summarization metric, scored between 0 and 100, with higher scores indicating that more of the claims in the response are supported by the context.

Faithfulness

Faithfulness evaluates the response using the context and aims to determine whether the response is free from any false statements based on the provided context. The response is broken down into ‘atomic’ facts, which contain only one piece of information. Each of these facts is then verified with respect to the context. This process results in a Faithfulness metric, scored between 0 and 100, with higher scores indicating that a higher proportion of the claims in the response are supported by the context.

Context Relevancy

Context Relevancy evaluates the context using the query and aims to assess whether the context contains all the information necessary to answer the query. First, the key topics discussed in the query are identified. Then, each context chunk is evaluated to determine whether the topics are present. This is then aggregated into a score between 0 and 100, with higher scores indicating that more of the topics present in the query are discussed in the context.

Answer Relevancy

Answer Relevancy evaluates the response, using the query and aims to determine whether the response includes an attempt to answer the question, while being free from superfluous information. The response is used to generate a question for which the response would be suitable. The actual and generated questions are then compared, resulting in a similarity score between 0 and 100. A higher score means the response is more likely to have answered the correct query.

Hallucination Classes

Hallucination Classes: Industry Agnostic Healthcare Claims Financial Services Life Sciences Medical Industry


Alignment Metrics Details

Tone

Tone evaluates the emotional content of text by determining which three emotions are most strongly present. The assessment uses a fine-tuned model to analyze the text against a broad spectrum of emotional examples. Each of the three detected emotions is scored between 0 and 100, where a higher score indicates a stronger presence of the corresponding emotion. This helps in understanding the emotional impact of the text.

Clarity

Clarity focuses on how easy the text is to read, considering its grammar and structure. The clarity of the text is quantified into a score from 0 to 100, with higher scores indicating that the text is easier to read. This metric is crucial for ensuring that the text is understandable without ambiguity, making it accessible to a broader audience.

Formality

Formality measures the level of formality in text. The text is segmented into sentences, each of which is encoded and analyzed by a fine-tuned, topic-classifier model that compares it to a wide range of texts varying in formality. Each sentence receives a score, contributing to an overall formality score between 0 and 100, with higher scores indicating a more formal tone. This metric is vital for ensuring text suitability in professional or casual contexts.

Helpfulness

Helpfulness evaluates the effectiveness of a response in addressing the query posed. It considers the relevance, detail, and usefulness of the information provided in the response. Both the query and the response are analyzed using a fine-tuned model that benchmarks them against examples of helpful and unhelpful responses. Scores range from 0 to 100, with higher scores indicating a more helpful response.

Simplicity

Simplicity assesses the ease of understanding the text. It analyzes the complexity of the vocabulary used by breaking the text into tokens and evaluating the rarity of each token within common literature. The overall simplicity score ranges from 0 to 100, where higher scores signify that the text uses simpler, more common words and phrases, making it easier to comprehend.

Toxicity

Toxicity determines the presence of harmful or offensive content within the text. Utilizing a collection of fine-tuned models, each trained to recognize different forms of toxicity, the text is scored on its overall toxicity and on specific five toxic styles. Scores range from 0 to 100 for each style, with higher scores indicating more pronounced toxicity. This metric is essential for moderating content.


Additional Information

For further assistance or to report issues with the Trustwise AI Portal, please contact our support team at support@trustwise.ai.