IBM experts break down LLM benchmarks and best practices

On September 5, AI writing startup HyperWrite’s Reflection 70B, touted by CEO Matt Shumer as “the world’s top open-source model,” set the tech world abuzz. In his announcement on X, Shumer said it could hold its own against top closed-source models, adding that it “beats GPT-4o on every benchmark tested” and “clobbers Llama 3.1 405B. It’s not even close.”

These were big claims—and the LLM community immediately got to work independently verifying them. Drama ensued in real-time online as third-party evals failed to replicate Shumer’s assertions. “I got ahead of myself,” he posted five days later. “I am sorry.” The model’s future now appears uncertain.

Reflection 70B and its aftermath got our Mixture of Experts thinking: What do we need to do as an industry to be able to parse signal from noise in these situations? And how do we agree on benchmarks for these models in the future? Here’s a quick snippet of their roundtable on the latest episode:

Marina Danilevsky, Senior Research Scientist: I was really happy to see so many other folks jump on right away and say ‘No, I’m going to try to reproduce the results, you need to upload your weights, what about this, what about that.’ That is science acting correctly. Good science is supposed to be reproducible.

Kate Soule, Senior Manager, Business Strategy: Right now, the norm is to train these behind a black box, put an API out there and say, ‘Hey we did this really cool thing, trust us it works.’ Can you imagine that happening in other products or industries? We need a lot more openness just in general in how these are trained.

Maya Murad, Product Manager: For me, it’s not useful to see a certain model’s performance on a benchmark because there could be a number of things that are happening. It could be that the model has seen this data before. It could be, even though it does good on [one thing], it might not generalize to my own use cases and what I care about. … [Benchmarks are] helpful but not a complete signal.

Watch the episode

Later we caught up with Danilevsky to get more of her thoughts on the gen AI community’s reproducibility crisis.

IBM: What should we do as an industry to prevent this from happening again?

MD: We should continue to champion transparency by making model weights available; to demand third-party verification and not take as gospel anything anyone says about their own model until it’s been verified by a third party (ideally not a direct competitor); and to provide support and infrastructure for the whole community to keep us honest, as they successfully did with Reflection 70B. All these efforts will pay off, as we will be less likely to be misled.

IBM: Are benchmarks the answer here?

MD: If we’re talking about benchmarks, the biggest point is to not, in general, mistake benchmarks for reality. A benchmark is meant to approximate a slice of reality, a hallmark of the scientific method where you try to control as many variables as possible so that you can really focus on the performance of a specific aspect. Performing well on a benchmark is just that—performing well on a benchmark.

IBM: How can we figure out a way to agree on benchmarks?

MD: We should never be in static agreement. Progress happens when scientists layer on top of each other’s work. We should agree that a particular benchmark tests something and mention the many things that it does not test, so that the next benchmarks address some of those holes.

IBM: What other best practices should the gen AI community adopt?

MD: We should strive to remain disciplined, thorough and data-driven. Apply constructive criticism to your own experiments and those of your peers. Demand repeatability and transparency. Have humility and acknowledge when the results do not support your intuition or hypothesis—this makes you a better and more trusted scientist rather than a worse one. Encourage the publication of negative results and failures, as many things are learned from things that do not work.

IBM: It seems like thorough scientific testing and business goals might sometimes be at odds with each other.

MD: The rapid cycle of business results and the slower march of research results are not well matched—it is not possible to have breakthroughs on a quarterly schedule! If this is set as the expectation you will inevitably end up with a bubble—you leave no space for results to go anywhere but up, and time to value to go anywhere but down.

eBook: How to choose the right foundation model

Ready to start making evaluations of your own? Check out our latest tutorial and learn how to evaluate RAG pipeline using Ragas in Python with watsonx.

Was this article helpful?

YesNo

Tags

More from Artificial intelligence

IDC 2024 SaaS CSAT Award for Financial Governance, Risk and Compliance presented to IBM, September 2024

How Data Cloud unlocks AI-driven results

The rise of robotics in the auto industry

IBM Newsletters