OpenAI introduces the Evals API: Simplified Model Evaluation for Developers

To empower developers and teams working with large language models (LLM), Openai introduces EVALS APIa new set of tools that bring programmatic evaluation capabilities to the forefront. While evaluations were previously accessible via OpenAI dashboard, the new API allows developers to Define tests, automate evaluation runs and iterate on prompts Directly from their workflow.

Why the Evals API is important

Evaluating LLM performance is often a manual and time-consuming process, especially for teams that scale applications across a variety of domains. Using the Evals API, OpenAI provides a systematic approach:

Evaluate model performance for custom test cases
Measuring improvements across iterations
Quality assurance in automated development pipelines

Now every developer can consider evaluations as a top-notch citizen in the development cycle, similar to how unit testing is handled in traditional software engineering.

Core features of the EVALS API

Custom evaluation definitions: Developers can write their own evaluation logic by extending basic classes.
Test data integration: Seamlessly integrate evaluation datasets to test specific scenarios.
Parameter configuration: Configure model, temperature, maximum token and other generation parameters.
Automatically run: Trigger evaluation through code and retrieve results programmatically.

The EVALS API supports YAML-based configuration structures, which can be flexible and repeatable.

Get started with the Evals API

To use the Evals API, you first install the OpenAi Python package:

You can then use built-in evaluation for evaluation, e.g. factuality_qna

oai evals registry:evaluation:factuality_qna \
  --completion_fns gpt-4 \
  --record_path eval_results.jsonl

Or define a custom evaluation in Python:

import openai.evals

class MyRegressionEval(openai.evals.Eval):
    def run(self):
        for example in self.get_examples():
            result = self.completion_fn(example('input'))
            score = self.compute_score(result, example('ideal'))
            yield self.make_result(result=result, score=score)

This example shows how to define custom evaluation logic – in this case, measure regression accuracy.

Use case: Regression evaluation

Openai’s recipe example will use the API to build a regression evaluator. Here is a simplified version:

from sklearn.metrics import mean_squared_error

class RegressionEval(openai.evals.Eval):
    def run(self):
        predictions, labels = (), ()
        for example in self.get_examples():
            response = self.completion_fn(example('input'))
            predictions.append(float(response.strip()))
            labels.append(example('ideal'))
        mse = mean_squared_error(labels, predictions)
        yield self.make_result(result={"mse": mse}, score=-mse)

This allows developers to baseline digital predictions from models and track changes over time.

Seamless workflow integration

Whether you are building a chatbot, a summary engine, or a classification system, you can now trigger evaluation as part of the CI/CD pipeline. This ensures that each prompt or model update maintains or improves performance before going live.

openai.evals.run(
  eval_name="my_eval",
  completion_fn="gpt-4",
  eval_config={"path": "eval_config.yaml"}
)

in conclusion

The release of the EVALS API marks a shift in powerful automation evaluation standards in LLM development. By providing the ability to configure, run and analyze evaluation, OpenAI is enabling teams to be confident and continually improve the quality of their AI applications.

To explore further, check out the official OpenAi Evals documentation and recipe examples.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Source link