Why Benchmarking LLMs is important for AI Adoption
There are a lot of language models to choose from for a business today. Already, we are seeing serious competition between OpenAI, Google, Anthropic, as well as open source models from Meta and others.
One of the ongoing needs of the business world is to continuously evaluate the capability of models being deployed for different tasks. While AgentBench is an aggregation of many tasks, it does provide some good insights into multiple areas of problem solving that individual businesses may utilise.
The Paper - Motivations and Results
It is not yet clear which model is better and at what tasks. As the models are constantly being upgraded, the race to provide better models, and at better price points continues.
This paper is a nice introduction as to how we may want to benchmark capability between models. The authors introduce 'AgentBench' which is effectively a way to test a multitude of tasks, linked with different environments.
Think of these environments as related to the kinds of tasks involved. For example, one environment can be 'Web surfing' or 'Web shopping' whereby the LLM needs to be able to navigate a task through the online capability. This means both understanding the task as well as using the appropriate tool, in this case, the internet browser.
The agents being evaluated are below:
AgentBench comprises 8 distinct environments, each tailored to evaluate the reasoning and decision-making capabilities of LLMs.
Each of these tasks / environments are evaluated and the average capability is reported in Table 3 below. the headline numbers provide some indication in terms of scale of capability. GPT4 outperforms most models on individual tasks with the exception of Web Shopping, where it's beaten by an earlier version of itself, GPT-3.5-Turbo. This is indicative of some of the trade-offs that these models have to make when advancing their capabilities.
In some cases, the difference is small, whereas in others, like problem solving (LTP), the difference between OpenAI models and everyone else is significant.
for more details check out the paper itself: