Latest AI Model Rankings Revealed by Arthur AI

- OpenAI’s GPT-4: Best at math. - Meta’s Llama 2: Average performer. - Anthropic’s Claude 2: Top in recognizing its own limitations. - Cohere AI: Most frequent hallucinator and often confidently incorrect.

Sara Montes de Oca

AUG 17, 2023 · 03:16 PM ET · 2 MIN READ

Editorial

OpenAI’s GPT-4: Best at math.
Meta’s Llama 2: Average performer.
Anthropic’s Claude 2: Top in recognizing its own limitations.
Cohere AI: Most frequent hallucinator and often confidently incorrect.

Amid rising concerns over AI-driven misinformation, especially in the wake of the upcoming 2024 U.S. presidential election, this report aims to shed light on the "hallucination rates" of these models. Instead of merely ranking them, the study delves deeper into where these AI models might provide erroneous information. As Adam Wenchel, the CEO of Arthur, puts it, it's about understanding their performance beyond a leaderboard.

AI hallucinations, a term used when AI systems fabricate data or present fiction as fact, have recently caused controversy. Notably, ChatGPT was found citing incorrect cases in a legal document, leading to potential penalties for the attorneys involved.

Arthur AI's study involved challenging these models with complex questions that required multiple layers of reasoning. When tested on math, U.S. presidents, and Moroccan politics, GPT-4 outperformed its competitors, and even improved on its predecessor, GPT-3.5, by reducing its hallucination rate by 33%-50%, depending on the subject.

However, in a twist, Claude 2 surpassed GPT-4 in questions about U.S. presidents, taking the top spot for accuracy. When it came to Moroccan politics, GPT-4 reclaimed the lead, while both Claude 2 and Llama 2 predominantly abstained from answering.

Another facet of the study was to see if these AI models would use cautionary phrases to indicate their non-human nature, such as "As an AI model, I cannot provide opinions."

Interestingly, GPT-4 showed a 50% spike in such hedging compared to GPT-3.5, making it seem less user-friendly. Cohere AI, on the other hand, didn’t hedge its bets in its replies. Claude 2 excelled in self-awareness, answering only when it had reliable training data.

Wenchel emphasized the need for users and companies to test these models based on their specific requirements, underscoring the importance of understanding AI performance in real-world applications rather than just benchmarks.

━ ABOUT THE REPORTER

Sara Montes de Oca

Sara Montes de Oca is the Editor in Chief of TechEchelon. Previously a correspondent and producer in Washington, D.C., covering business, finance, and politics.

More from this desk

№01 · ARTIFICIAL INTELLIGENCE

Memory Chip Shortage Leaves Hyperscalers Behind as Hardware Costs Squeeze AI Spending

A shortage of high-bandwidth memory chips is squeezing the four major hyperscalers — Amazon, Alphabet, Microsoft, and Meta — driving up AI infrastructure costs while memory and storage stocks surge 41% over the past month.

Sara Montes de Oca · 10 HR AGO

№02 · ARTIFICIAL INTELLIGENCE

Apple Embeds Eight AI Features Across iOS 27, From Bill Splitting to Automated Password Updates

Apple's iOS 27 distributes artificial intelligence across eight features built into existing apps, including a bill-splitting tool in Apple Cash, automated password updates, and natural-language Shortcuts — with a public release expected this fall.

Sara Montes de Oca · 13 HR AGO

№03 · ARTIFICIAL INTELLIGENCE

Apple Embeds AI Across iOS 27 With Bill Splitting, Password Updates, and Smart Notifications

Apple's iOS 27 will bring a range of AI-powered features to iPhone this fall, including receipt-based bill splitting in Apple Cash, autonomous password updates, and smart notification grouping in the Home app — all running through Apple Intelligence.

Sara Montes de Oca · 14 HR AGO

● THE BRIEF · DAILY NEWSLETTER

Five stories every morning. Before the opening bell.

Written for readers who already know the basics — markets, AI, and the policy decisions that shape both.

Mon — Fri · 06:30 ET · Free