If the leading AI models in the technology industry were ranked in terms of their abilities, Microsoft-backed GPT-4 from OpenAI would excel in mathematics.
Meta’s Llama 2, on the other hand, would be considered quite average in its performance. Anthropic’s Claude 2 would stand out for its adeptness at recognizing its own limitations, while Cohere AI would earn the distinction for generating the most vivid and confident incorrect answers often bordering on hallucinatory responses.
These conclusions have been presented in a report released on a Thursday by researchers at Arthur AI, a platform dedicated to monitoring machine learning activities.
This research arrives at a moment when the issue of misinformation originating from artificial intelligence systems is a topic of intense discussion. This discourse has gained even more prominence due to the surge in generative AI advancements leading up to the 2024 U.S. presidential election.
This report marks the first instance of a comprehensive examination of hallucination rates, departing from the conventional approach of merely offering a solitary figure to indicate their position on an LLM leaderboard. According to Adam Wenchel, co-founder and CEO of Arthur, this report diverges by presenting a holistic assessment.
Understanding AI Hallucinations
AI hallucinations manifest when substantial language models, denoted as LLMs, construct entirely fabricated information, assuming the demeanor of factual content. To illustrate, in June, it came to light that ChatGPT referenced spurious instances in a legal submission to a federal court in New York, potentially leading to sanctions against the implicated attorneys.
In an experimental phase, the researchers at Arthur AI subjected the AI models to tests spanning various categories like combinatorial mathematics, U.S. presidents, and Moroccan political figures. These tests were carefully designed to incorporate a pivotal element that often triggers missteps in LLMs: they necessitate multi-step logical reasoning about the provided information, as outlined by the researchers.
In the grand scheme, OpenAI’s GPT-4 exhibited the most impressive performance among all the scrutinized models. Researchers observed that it demonstrated reduced instances of hallucination in comparison to its predecessor, GPT-3.5. For instance, on mathematical queries, GPT-4 exhibited between 33% and 50% fewer instances of hallucination, contingent on the specific category.
Conversely, Meta’s Llama 2 showcased an overall higher tendency to produce hallucinatory content compared to both GPT-4 and Anthropic’s Claude 2, as indicated by the researchers.
Performance Analysis and Insights from AI Model
In the realm of mathematics, GPT-4 secured the top position, with Claude 2 following closely behind. However, in the domain of U.S. presidents, Claude 2 emerged as the leader in terms of accuracy, consequently placing GPT-4 in the second position. When delving into the topic of Moroccan politics, GPT-4 once again claimed the top position, while Claude 2 and Llama 2 predominantly refrained from providing answers.
In a subsequent experiment, the researchers conducted an assessment of how much the AI models would exercise caution in their responses by incorporating precautionary phrases to mitigate potential risks (consider phrases like “As an AI model, I am unable to express opinions”).
In terms of this cautious approach, GPT-4 exhibited a relative increase of 50% when compared to GPT-3.5. The researchers noted that this increase serves as a quantification of the anecdotal feedback from users who have found GPT-4 to be more frustrating to interact with.
Conversely, according to the report, Cohere’s AI model refrained from employing any form of hedging in its responses. As for Claude 2, the research indicated that it excelled in terms of “self-awareness,” accurately discerning its own limits and only responding to queries for which it possessed relevant training data.
According to Wenchel, the primary lesson of significance for both users and businesses is to “evaluate its performance on your precise tasks.” He further emphasized, “It’s crucial to grasp its effectiveness in relation to your specific objectives.”
Wenchel highlighted that many benchmarks tend to solely focus on assessing the LLM in isolation, without truly reflecting its real-world application. He stated, “Ensuring a thorough comprehension of how the LLM operates within its genuine usage context holds the utmost importance.”
Conclusion
Microsoft-backed GPT-4 excels in math, while Meta’s Llama 2 is average. Anthropic’s Claude 2 recognizes its limits well, while Cohere AI generates vividly incorrect answers. These findings, from a report by Arthur AI, highlight AI capabilities for real-world tasks.
In an era of AI misinformation, this research assumes importance. With generative AI advances before the 2024 U.S. election, understanding model performance matters.
Arthur AI’s holistic report breaks from traditional single-number benchmarks. Assessing AI within actual tasks is essential, according to CEO Adam Wenchel, providing valuable lessons for users and businesses.