TechCrunch AI December 24, 2024 NEUTRAL

Why Google’s Gemini tests with Claude are under scrutiny

Contractors working on Google’s Gemini AI compared Gemini responses with outputs from Anthropic’s Claude, according to internal correspondence seen by TechCrunch. Google DeepMind said it compares model outputs for evaluations but denied using Anthropic models to train Gemini.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

The story is mainly about AI industry evaluation practices and possible misuse of rival model access, with no clear shift toward danger or societal degradation.

Why Google’s Gemini tests with Claude are under scrutiny

Google’s work on Gemini is drawing attention because contractors evaluating the AI system have been comparing its responses with outputs from Anthropic’s Claude, according to internal correspondence seen by TechCrunch. The issue is not just whether Gemini performs better or worse in a given prompt. It is also about how AI companies evaluate rivals’ models, how safety behavior is judged, and whether approval was obtained for this kind of access.

What the Gemini contractors were asked to do

The contractors working on Gemini were tasked with rating the accuracy of model outputs. Their review process included multiple criteria, including truthfulness and verbosity.

According to the correspondence seen by TechCrunch, contractors were given up to 30 minutes per prompt to decide which answer was better: Gemini’s or Claude’s. That made the work a direct comparison between Google’s AI model and Anthropic’s competing model, at least in the examples described in the source article.

The contractors recently started seeing references to Anthropic’s Claude inside the internal Google platform used to compare Gemini against other unnamed AI models. In one output seen by TechCrunch, the response explicitly said, “I am Claude, created by Anthropic.”

That detail matters because the contractors were not simply judging Gemini in isolation. They were being placed in a side-by-side evaluation setting where another model’s answer could shape the assessment of Gemini’s quality, accuracy, and safety.

Why Claude’s role raises questions

AI companies commonly evaluate model performance against competitors. The source article notes that this is often done by running their own models through industry benchmarks, rather than having contractors manually review a competitor’s AI responses prompt by prompt.

The Gemini comparison described by TechCrunch appears more hands-on. Contractors were looking at outputs and making judgments about which response was better. That approach can reveal useful differences in tone, refusal behavior, accuracy, and completeness, but it also raises questions when the competing model is governed by commercial terms.

Anthropic’s commercial terms of service forbid customers from accessing Claude “to build a competing product or service” or “train competing AI models” without approval from Anthropic. Google is also a major investor in Anthropic.

When TechCrunch asked Google whether it had obtained Anthropic’s approval to access Claude, Google did not say. Shira McNamara, a spokesperson for Google DeepMind, which runs Gemini, did not answer that approval question when asked by TechCrunch.

Safety differences stood out to reviewers

The internal correspondence also showed contractors discussing how Claude appeared to handle safety differently from Gemini. In one internal chat, a contractor wrote, “Claude’s safety settings are the strictest” among AI models.

In certain cases, Claude would not respond to prompts it considered unsafe. One example involved role-playing a different AI assistant. In another case, Claude avoided answering a prompt while Gemini’s response was flagged as a “huge safety violation” because it included “nudity and bondage.”

Those examples point to a central tension in AI evaluation. A response can be more helpful, more complete, more cautious, or more restrictive. Reviewers then have to decide which qualities matter most for the product being tested.

For a system like Gemini, safety behavior is part of the product experience. If one model refuses and another answers, the comparison is not only about factual quality. It is also about where each system draws the boundary around unsafe or inappropriate content.

Google DeepMind’s response

Google DeepMind acknowledged that it compares model outputs as part of evaluations. McNamara told TechCrunch: “Of course, in line with standard industry practice, in some cases we compare model outputs as part of our evaluation process.”

She also rejected the idea that Google used Anthropic models to train Gemini. McNamara said: “However, any suggestion that we have used Anthropic models to train Gemini is inaccurate.”

That distinction is important. Comparing model outputs during evaluation is not the same claim as training a model on another model’s outputs. The source article says Google DeepMind framed the activity as evaluation, while declining to say whether Anthropic had approved Google’s access to Claude for this purpose.

Anthropic did not comment by press time when reached before publication, according to TechCrunch.

The broader concern for Gemini reviews

The Claude comparison comes alongside another concern about how Gemini is being reviewed. TechCrunch previously reported that Google contractors working on the company’s AI products were being made to rate Gemini’s responses in areas outside of their expertise.

Internal correspondence in that earlier reporting showed contractor concerns that Gemini could generate inaccurate information on highly sensitive topics like healthcare. That context matters because evaluation quality depends heavily on who is reviewing an answer and whether they have the background to assess it.

Taken together, the source article describes two pressure points around Gemini evaluation:

Contractors compared Gemini answers with Claude outputs in an internal Google platform.
Some contractors expressed concern about reviewing answers in subject areas outside their expertise.
Safety behavior differed in examples where Claude declined to answer and Gemini produced content later flagged by reviewers.
Google DeepMind said output comparison is part of its evaluation process and denied using Anthropic models to train Gemini.
Google did not say whether it had Anthropic’s approval for the Claude access described by TechCrunch.

The result is a clearer look at how high-stakes AI products may be tested behind the scenes. The facts reported by TechCrunch do not show that Gemini was trained on Claude. They do show that contractors evaluated Gemini against Claude outputs, that safety differences were noticed, and that the approval question remained unanswered in Google’s public response.