Stanford Introduces First AI Benchmark to Help Understand LLMs

Check out the Low-Code/No-Code Summit on-demand sessions to learn how to successfully innovate and achieve efficiencies by enhancing and scaling citizen developers. Watch now.

In the world of artificial intelligence (AI) and machine learning (ML), 2022 has arguably been the year of basic models, or large-scale trained AI models. From GPT-3 to DALL-E, from BLOOM to Image: another day, it seems, another large language model (LLM) or text-to-image model. But until now, there have been no AI benchmarks that provide a standardized way to assess these models, which have developed at a rapid pace over the past two years.

>>Don’t miss our new special issue: Zero Trust: The new security paradigm.<

LLMs have particularly captivated the AI ​​community, but according to the Stanford Institute’s Basic Models Research Center for Human-Centered AI (HAI), the absence of an assessment standard has compromised the community’s ability to understand these models, as well as their possibilities and risks.

To that end, the CRFM today announced the Holistic Evaluation of Language Models (HELM), which it says is the first benchmarking project aimed at improving the transparency of language models and the broader category of core models.


smart security summit

Learn about the critical role of AI and ML in cybersecurity and industry-specific case studies on December 8. Sign up for your free pass today.

Register now

“Historically, benchmarks have pushed the community to come together around a set of problems that the research community believes are valuable,” Percy Liang, associate professor of computer science at Stanford University and director of the CRFM. “One of the challenges with language models and basic models in general is that they are multi-purpose, which makes benchmarking extremely difficult.”

HELM, he explained, takes a holistic approach to the problem by evaluating language models based on recognizing the limitations of the models; on multimetric measurement; and direct comparison of models, with a goal of transparency. The basic principles used in HELM for model evaluation include precision, calibration, robustness, fairness, bias, toxicity, and efficiency, and point out the key elements that make a model sufficient.

Liang and his team evaluated 30 language models from 12 organizations: AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University, and Yandex. Some of these models are open sourced to the public, some are available through commercial APIs, and some are private.

A ‘comprehensive approach’ to LLM assessment

“I applaud the initiative of the Stanford group,” Eric Horvitz, Microsoft’s chief scientific officer, told VentureBeat via email. “They have taken a holistic approach to evaluating language models by creating a taxonomy of scenarios and measuring multiple aspects of performance on them.”

Benchmarking neural language models is crucial for driving innovation and progress in both industry and academia, he added.

“The evaluation is essential for advancing the science and engineering of neural models, as well as for evaluating their strengths and limitations,” he said. “We do rigorous benchmarking of our models at Microsoft and welcome benchmarking from the Stanford team within their holistic framework, which further enriches our insights and insights.”

Stanford’s AI Benchmark Sets the Foundation for LLM Standards

Liang says HELM lays the foundation for a new set of industry standards and will be maintained and updated as an ongoing community effort.

“It is a living reference point that is not going to be done, there are things that we are missing and that we must cover as a community,” he said. “This is really a dynamic process, so part of the challenge will be to maintain this benchmark over time.”

Many of the options and ideas in HELM can serve as the basis for further discussion and improvement, Horvitz agreed.

“In the future, I look forward to seeing a community-wide process to refine and expand the ideas and methods put forward by the Stanford team,” he said. “There is an opportunity to engage stakeholders from academia, industry, civil society and government, and extend the assessment to new arenas, such as interactive AI applications, where we seek to measure how well AI can empower people. people at work and in their daily lives. it lives.”

AI benchmarking project is a ‘dynamic’ process

Liang stressed that the benchmarking project is a “dynamic” process. “When I tell you about the results, tomorrow they could change because possibly new models will come out,” she said.

One of the main things the benchmark seeks to do, he added, is to capture the differences between the models. When this reporter suggested that it looked a bit like a Consumer Reports analysis of different car models, he said that “it’s actually a great analogy: you’re trying to provide consumers or users or the general public with information about the various products, in this case models.”

What is unique here, he added, is the pace of change. “Instead of being a year, it could be a month before things change,” she said, pointing to Galactica, Meta’s recently released language model for scientific papers, as an example.

“This is something that will add to our baseline,” he said. “So it’s like Toyota coming out with a new model every month instead of every year.”

Another difference, of course, is the fact that LLMs are not well understood and have a “broad surface area of ​​use cases”, unlike a car that you just drive. Also, the automotive industry has a variety of standards, something that the CRFM is trying to develop. “But we are still very early in this process,” Liang said.

HELM AI benchmark is a ‘herculean’ task

“I congratulate Percy and his team for taking on this herculean task,” Yoav Shoham, co-founder of AI21 Labs, told VentureBeat via email. “It is important that a neutral, scientifically inclined person [organization] undertake it.”

The HELM benchmark should be perennial, he added, and updated regularly.

“This is for two reasons,” he said. “One of the challenges is that it is a fast-moving target and in many cases the models tested are out of date. For example, J1-Jumbo v1 is one year old and J1-Grande v1 is 6 months old, and both have newer versions that haven’t been ready for third-party testing.”

Also, which models to test is notoriously difficult, he added. “General considerations such as perplexity (which is defined objectively) or bias (which has a subjective component) are certainly relevant, but the set of criteria will also evolve, as we better understand what really matters in practice,” he said. . “I hope that future versions of the document will refine and expand these measures.”

Shoham sent a farewell note to Liang over the HELM landmark: “Percy, no good deed goes unpunished,” he quipped. “You’re stuck with it.”

VentureBeat’s mission is to be a digital public square for technical decision makers to gain insights into transformative business technology and transact. Discover our informative sessions.

Leave a Comment