Cost Metrics
The module calculates cost metrics to provide transparency into expenses related to each LLM call, helping users effectively manage and optimize costs. These metrics are displayed in USD and include:
-
Cost per LLM Call: Represents the expense for each individual LLM call. This metric is useful for applications that require frequent or mission-critical queries.
-
Cost for 10K LLM Calls: A broader view, showing the cumulative cost for 10,000 LLM calls to help estimate costs over high-volume usage patterns.
Token-Based Cost Calculation (For OpenAI and TogetherAI)
Costs are tracked based on token usage for LLMs provided by OpenAI and TogetherAI, which is divided into:
- Prompt Tokens: Tokens used to form the prompt for the LLM call.
- Response Tokens: Tokens generated by the LLM in response to the prompt.
For each LLM call, the cost is calculated based on both the prompt and response tokens:
-
Prompt Tokens: The cost is determined by multiplying the number of prompt tokens used by the cost per prompt token.
-
Response Tokens: The cost is determined by multiplying the number of response tokens used by the cost per response token.
-
Total Cost per LLM Call: The total cost for the LLM call is the sum of the prompt and response token costs.
HuggingFace Cost Calculations
When the LLM provider is HuggingFace, cost calculations are based on the time required to run the queries, using the model's latency and the number of queries, alongside the instance type specified for the deployment. The following methodology applies:
-
Query Execution Time: The time to run all queries is calculated by multiplying the latency per query by the total number of queries. This gives the total time (in seconds) required to process all queries.
-
Instance Type Pricing: The cost estimate is then calculated based on the price for running the selected instance type for one hour. This price is typically provided by HuggingFace for each model and instance type combination.
-
Cost Estimate: The total cost of processing all queries is determined by multiplying the total query execution time by the instance's cost for the duration it took to process all the queries.
Note:
- If an exact match for the instance type is found in HuggingFace's inference data, the cost is calculated directly from that price.
- If no exact match is found, the cost estimate is calculated using the closest available instance type, based on the inference data.
Nvidia NIM Inference Server
Currently, cost calculations are not supported for LLMs provided by Nvidia NIM Inference Server.
These calculations provide an estimate of how much it would cost to run the model for the time it takes to process all queries, considering the selected instance type and the model's operational characteristics.