September 17, 2024 By Antonia Davison 3 min read

On September 5, AI writing startup HyperWrite’s Reflection 70B, touted by CEO Matt Shumer as “the world’s top open-source model,” set the tech world abuzz. In his announcement on X, Shumer said it could hold its own against top closed-source models, adding that it “beats GPT-4o on every benchmark tested” and “clobbers Llama 3.1 405B. It’s not even close.”

These were big claims—and the LLM community immediately got to work independently verifying them. Drama ensued in real-time online as third-party evals failed to replicate Shumer’s assertions. “I got ahead of myself,” he posted five days later. “I am sorry.”  The model’s future now appears uncertain.

Reflection 70B and its aftermath got our Mixture of Experts thinking: What do we need to do as an industry to be able to parse signal from noise in these situations? And how do we agree on benchmarks for these models in the future? Here’s a quick snippet of their roundtable on the latest episode:

Marina Danilevsky, Senior Research Scientist: I was really happy to see so many other folks jump on right away and say ‘No, I’m going to try to reproduce the results, you need to upload your weights, what about this, what about that.’ That is science acting correctly. Good science is supposed to be reproducible.

Kate Soule, Senior Manager, Business Strategy: Right now, the norm is to train these behind a black box, put an API out there and say, ‘Hey we did this really cool thing, trust us it works.’ Can you imagine that happening in other products or industries? We need a lot more openness just in general in how these are trained.

Maya Murad, Product Manager: For me, it’s not useful to see a certain model’s performance on a benchmark because there could be a number of things that are happening. It could be that the model has seen this data before. It could be, even though it does good on [one thing], it might not generalize to my own use cases and what I care about. … [Benchmarks are] helpful but not a complete signal.

Watch the episode

Later we caught up with Danilevsky to get more of her thoughts on the gen AI community’s reproducibility crisis.

IBM: What should we do as an industry to prevent this from happening again?

MD: We should continue to champion transparency by making model weights available; to demand third-party verification and not take as gospel anything anyone says about their own model until it’s been verified by a third party (ideally not a direct competitor); and to provide support and infrastructure for the whole community to keep us honest, as they successfully did with Reflection 70B. All these efforts will pay off, as we will be less likely to be misled.

IBM: Are benchmarks the answer here?

MD: If we’re talking about benchmarks, the biggest point is to not, in general, mistake benchmarks for reality. A benchmark is meant to approximate a slice of reality, a hallmark of the scientific method where you try to control as many variables as possible so that you can really focus on the performance of a specific aspect. Performing well on a benchmark is just that—performing well on a benchmark.

IBM: How can we figure out a way to agree on benchmarks?

MD: We should never be in static agreement. Progress happens when scientists layer on top of each other’s work. We should agree that a particular benchmark tests something and mention the many things that it does not test, so that the next benchmarks address some of those holes.

IBM: What other best practices should the gen AI community adopt?

MD: We should strive to remain disciplined, thorough and data-driven. Apply constructive criticism to your own experiments and those of your peers. Demand repeatability and transparency. Have humility and acknowledge when the results do not support your intuition or hypothesis—this makes you a better and more trusted scientist rather than a worse one.  Encourage the publication of negative results and failures, as many things are learned from things that do not work.

IBM: It seems like thorough scientific testing and business goals might sometimes be at odds with each other.

MD: The rapid cycle of business results and the slower march of research results are not well matched—it is not possible to have breakthroughs on a quarterly schedule! If this is set as the expectation you will inevitably end up with a bubble—you leave no space for results to go anywhere but up, and time to value to go anywhere but down.

eBook: How to choose the right foundation model

Ready to start making evaluations of your own? Check out our latest tutorial and learn how to evaluate RAG pipeline using Ragas in Python with watsonx.

Was this article helpful?
YesNo

More from Artificial intelligence

IDC 2024 SaaS CSAT Award for Financial Governance, Risk and Compliance presented to IBM, September 2024

2 min read - IBM's prowess in the Financial Governance, Risk and Compliance (GRC) sector has been recognized by IDC, a leading global market intelligence firm. In its 2024 SaaS Path Survey, IBM emerged as a standout performer, securing the highest customer satisfaction scores in the Financial GRC application market. The survey, which collected ratings from approximately 2,900 organizations worldwide, asked customers to rate their vendor on over 30 different customer satisfaction metrics. IBM's high customer satisfaction scores, compared to the overall average in…

How Data Cloud unlocks AI-driven results

4 min read - Agentforce is going to be a major focus at Dreamforce 2024, and we’ve already seen a tremendous amount of hype and development around the artificial intelligence capabilities it provides. We have also seen a commensurate focus on Data Cloud as the tool that brings data from multiple sources to make this AI wizardry possible. But how exactly do the two work together? Is Data Cloud needed to enable Einstein? Why is there such a focus on data, anyway? Data Cloud…

The rise of robotics in the auto industry

5 min read - The auto industry is going all-in on robotics. The automotive sector has become the number one adopter of industrial robots, making up 33% of all installations in the US last year, according to a 2024 study by the International Federation of Robotics. Key reasons include transitioning to more electric vehicles as well as labor shortages. Automakers employ a variety of robots that range from collaborative robots (or “cobots”) to six-axis robotic arms. But the latest—and buzziest—tech is the humanoid robots…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters