Reflection 70B’s performance questioned, accused of ‘fraud’

[ad_1]

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

It took just one weekend for the new, self-proclaimed king of open source AI models to have its crown tarnished.

Reflection 70B, a variant of Meta’s Llama 3.1 open source large language model (LLM) — or wait, was it a variant of the older Llama 3? — that had been trained and released by small New York startup HyperWrite (formerly OthersideAI) and boasted impressive, leading benchmarks on third-party tests, has now been aggressively questioned as other third-party evaluators have failed to reproduce some of said performance measures.

The model was triumphantly announced in a post on the social network X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s top open-source model.”

I’m excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week – we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on ⬇️: pic.twitter.com/kZPW1plJuo

— Matt Shumer (@mattshumer_) September 5, 2024

In a series of public X posts documenting some of Reflection 70B’s training process and subsequent interview over X Direct Messages with VentureBeat, Shumer explained more about how the new LLM used “Reflection Tuning,” a previously documented technique developed by other researchers outside the company that sees LLMs check the correctness of or “reflect” on their own generated responses before outputting them to users, improving accuracy on a number of tasks in writing, math, and other domains.

However, on Saturday September 7, a day after the initial HyperWrite announcement and VentureBeat article were published, Artificial Analysis, an organization dedicated to “Independent analysis of AI models and hosting providers” posted its own analysis on X stating that “our evaluation of Reflection Llama 3.170B’s MMLU score” — referencing the commonly used Massive Multitask Language Understanding (MMLU) benchmark — “resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B,” showing a major discrepancy with HyperWrite/Shumer’s originally posted results.

Our evaluation of Reflection Llama 3.1 70B’s MMLU score resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B.

A LocalLLaMA post (link below) also compared the diff of Llama 3.1 & Llama 3 weights to Reflection Llama 3.1 70B and concluded the… pic.twitter.com/hqvFp2TyCC

— Artificial Analysis (@ArtificialAnlys) September 7, 2024

On X that same day, Shumer stated that Reflection 70B’s weights — or settings of the open source model — had been “fucked up during the upload process” to Hugging Face, the third-party AI code hosting repository and company, and that this issue could have resulted in worse quality performance compared to HyperWrite’s “internal API” version.

We’ve figured out the issue. The reflection weights on Hugging Face are actually a mix of a few different models — something got fucked up during the upload process.

Will fix today. https://t.co/rKuOlTApRK

— Matt Shumer (@mattshumer_) September 7, 2024

On Sunday, September 8, 2024 at around 10 pm ET, Artificial Analysis posted on X that it had been “given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.”

Reflection 70B update: Quick note on timeline and outstanding questions from our perspective

Timeline:– We tested the initial Reflection 70B release and saw worse performance than Llama 3.1 70B.

– We were given access to a private API which we tested and saw impressive…

— Artificial Analysis (@ArtificialAnlys) September 9, 2024

The organization detailed two key questions that seriously call into question HyperWrite and Shumer’s initial performance claims, namely:

“We are not clear on why a version would be published which is not the version we tested via Reflection’s private API.
We are not clear why the model weights of the version we tested would not be released yet.

As soon as the weights are released on Hugging Face, we plan to re-test and compare to our evaluation of the private endpoint.”

All the while, users on various machine learning and AI Reddit communities or subreddits, have also called into question Reflection 70B’s stated performance and origins. Some have pointed out that based on a model comparison posted on Github by a third party, Reflection 70B appears to be a Llama 3 variant rather than a Llama-3.1 variant, casting further doubt on Shumer and HyperWrite’s initial claims.

This has led to at least one X user, Shin Megami Boson, to openly accuse Shumer of “fraud in the AI research community” as of 8:07 pm ET on Sunday, September 8, posting a long list of screenshots and other evidence.

A story about fraud in the AI research community:

On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they’ve made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it’s real.

It isn’t. pic.twitter.com/S0jWT8rDVb

— ? Shin Megami Boson ? (@shinboson) September 9, 2024

Others accuse the model of actually being a “wrapper” or application built atop of propertiary/closed-source rival Anthropic’s Claude 3.

However, other X users have spoken up in defense of Shumer and Reflection 70B, and some have posted about the model’s impressive performance on their end.

I know @mattshumer_ and this does not mesh with my understanding of him. He knows his stuff and is super pragmatic and works around problems in impressive ways that most people get bogged down on for months. I would say maybe give the guy a little more time before you say stuff…

— Sasha krecinic (@SashaKrecinic) September 9, 2024

Regardless, the model’s rollout, lofty claims, and now criticism show how rapidly the AI hype cycle can come crashing down.

As for now, the AI research community waits with breath baited for Shumer’s response and updated model weights on Hugging Face. VentureBeat has also reached out to Shumer for a direct response to these allegations of fraud and will update when we hear back.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

[ad_2]

Source link