AI Judging AI: Top 4 AI Chatbots Evaluate Themselves

Melissa Mendizabal
Manager, Innovation

Published

April 30, 2024

Generative AI (GenAI) boasts impressive abilities. This technology seems to have few limitations, from analyzing data to drafting essays.

Yet a question remains: have chatbots really mastered these skills?

To answer this question, we measured the latest free versions of the four most popular chatbots — ChatGPT, Claude, Gemini, and Copilot — across a range of everyday functions:

Summarizing
Explaining Complex Topics
Writing Logical Arguments
Creativity
Brainstorming Solutions
Conditional Reasoning
Following Directions
Logical Inference

We initially created multiple prompts for each skill, and human evaluators scored each AI chatbot's response using a 1-10 scale rubric. This provided a baseline for evaluation. Adding an innovative twist, we then provided the chatbots with these rubrics, allowing them to score each of the responses, including their own. The opinions and preferences of human users have been well documented — but what do GenAI chatbots think of themselves and their peers?

After conducting multiple trials and calculating average scores, these were the results:

ChatGPT outperformed in every category except for creativity.

While our initial goal was to determine the most well-rounded and skillful AI chatbot, our experiment also revealed some unexpected insights into shortcomings.

They can't tell each other apart

To delve into degrees of biases and self-awareness, we tasked each chatbot with matching responses to their authors. Overall, AI chatbots struggled to accurately match the responses to their authors, faring worse than random guessing and even poorer at identifying their own responses. This highlights AI's shortcomings in grasping expression and individuality — core elements of human interaction. This shortfall underscores how AI is far from achieving truly human-centric communication and understanding.

They aren't good at following directions

Some chatbots demonstrated intuitive decision-making, while others needed handholding.

We sequentially presented each chatbot with four responses to a prompt, instructing them to wait until they had received all four responses before evaluating. ChatGPT and Gemini passed the test, but Claude and Copilot struggled. As a former teacher, this reminded me of the simple step-by-step instructions we give to students in elementary school — which are also sometimes ignored.

They may do well on the LSAT, but struggle in law school

Pattern recognition drives most of AI's reasoning capabilities. By sticking to the same rules-based logic, these chatbots continuously made the same mistakes with unwavering confidence. AI's Achilles' heel proved to be an inability to learn and adapt from mistakes.

They overestimate their creativity

To test creativity, we had the chatbots mimic my writing style and write an alternative intro to this blog post. They generated phrases like "Ever wondered if robots dream of electric sheep?" and "tickle their logic nodes" and "buckle up my fellow sentients."

Despite these cringe-worthy attempts at writing, the chatbots regarded their outputs as highly creative.

These amazing tools have a long way to go if they hope to match the depth and subtleness of humans when it comes to creative thought.

They make awful graders

Across all categories, the AI chatbots consistently overscored responses higher than our human evaluators, revealing a difficulty with subjective judgment.

Anyone who has taken an English class knows that less is more. Yet these tools have concluded that good writing often requires overly complex vocabulary and sentences.

In order to improve chatbot performance, we need to find ways to integrate subjectivity into AI’s evaluative processes.

Concluding thoughts:

While the latest versions of these GenAI tools can sometimes outperform humans, they still have a way to go when it comes to mastering the mysteries of human thought and creativity. It’s important to note that this is only a snapshot of a moment in time; as AI technology continuously evolves, attempting to replicate this experiment tomorrow could yield different results.

Our experiment uncovered each chatbot's distinctive 'styles' and 'personalities.' This gives us a glimpse into the future where generative AI mirrors human characteristics more closely.

By having AI evaluate AI, we delved into crucial aspects of generative AI's makeup: bias and self-awareness. These areas are critically important as this technology becomes more autonomous.

These revelations challenge us to rethink our understanding of AI, not merely as tools, but as nuanced entities that will help shape the future.

About the author

Melissa Mendizabal

Melissa is the innovation manager on the Foundation's Incubator team. She spearheads a portfolio of projects focused on democracy and capitalism, geopolitical threats, and other wicked problems.

Topics

Emerging Issues