TechCrunch AI December 19, 2024 IDIOCRACY

Google’s Gemini raters face a new accuracy dilemma

A new guideline from Google tells Gemini contractors not to skip prompts that require specialized domain knowledge. The change has raised concerns because contractors may now rate technical answers on sensitive topics, including healthcare, even when they lack relevant expertise.

WTF Index IDIOCRACY

◄ Terminator 1 Idiocracy 3 ►

The story centers on weaker truth and quality controls for Gemini because non-expert contractors may rate specialized answers they cannot fully assess.

Google’s Gemini raters face a new accuracy dilemma

Google’s Gemini is being improved with help from contractors who review chatbot responses for qualities such as truthfulness. A new internal guideline described by TechCrunch changes how those reviewers handle prompts outside their areas of expertise, and the shift has triggered concern about accuracy on sensitive subjects.

The issue centers on contractors working with GlobalLogic, an outsourcing firm owned by Hitachi. Until recently, those workers could skip tasks when a prompt required knowledge they did not have. Under the newer instruction, they are told not to skip prompts merely because the topic demands specialized domain knowledge.

What changed for Gemini contractors

Generative AI systems can appear seamless to users, but their development depends on people who evaluate chatbot outputs. At companies including Google, OpenAI, and others, prompt engineers and analysts review AI-generated answers and rate them so the systems can be improved.

For Gemini, contractors working through GlobalLogic are routinely asked to judge AI-written responses using factors that include “truthfulness.” That work can involve ordinary prompts, but it can also involve technical or highly specialized subjects.

Previously, contractors had a clear way to avoid rating material far beyond their expertise. The older guideline said: “If you do not have critical expertise (e.g. coding, math) to rate this prompt, please skip this task.”

GlobalLogic then announced a change from Google last week. The newer guideline says: “You should not skip prompts that require specialized domain knowledge.” Contractors are instead instructed to “rate the parts of the prompt you understand” and add a note explaining that they do not have domain knowledge.

Why the rule is raising accuracy concerns

The concern is straightforward: if a contractor without scientific or technical training rates an answer on a specialized topic, that rating may not fully reflect whether the answer is correct. TechCrunch reported that contractors can be asked to evaluate highly technical AI responses about subjects such as rare diseases, even when they have no background in those areas.

Healthcare is one of the sensitive areas mentioned in the source article. A prompt about a niche cardiology question is one example of a task that a contractor without a scientific background could previously skip. Under the new approach, the contractor would not skip solely because the prompt requires that kind of specialized knowledge.

That does not mean every rating would be treated as expert review. The new instruction tells contractors to rate only the portions they understand and to note the limits of their expertise. Still, the change has led some workers to question whether the process is moving away from the goal of routing difficult prompts to people better equipped to assess them.

“I thought the point of skipping was to increase accuracy by giving it to someone better?”

That comment, from internal correspondence seen by TechCrunch, captures the central tension. A system that depends on human feedback needs enough reviewers to handle a wide range of tasks, but specialized questions can require specialized judgment.

When contractors can still skip tasks

The new guideline does not remove skipping entirely. According to TechCrunch, contractors can still skip prompts in two situations.

They can skip when the task is “completely missing information,” such as the full prompt or the response.
They can skip when the prompt contains harmful content that requires special consent forms to evaluate.

Those exceptions are narrower than the previous rule. Lack of domain knowledge by itself is no longer listed as a reason to skip. That is the key operational change affecting Gemini contractors who rate AI responses.

The difference matters because contractor ratings are part of how companies assess and improve generative AI products. If the task is about format or style, a non-specialist may still be able to give useful feedback. If the task is about whether a technical answer is true, the limits of the reviewer’s knowledge become more important.

Google’s response

Google did not respond to TechCrunch’s requests for comment by press time. After the story was published, Google told TechCrunch that it was “constantly working to improve factual accuracy in Gemini.” TechCrunch also reported that Google did not dispute its reporting.

Google spokesperson Shira McNamara said: “Raters perform a wide range of tasks across many different Google products and platforms.” She added: “They do not solely review answers for content, they also provide valuable feedback on style, format, and other factors. The ratings they provide do not directly impact our algorithms, but when taken in aggregate, are a helpful data point to help us measure how well our systems are working.”

That response frames contractor ratings as one input among many, rather than a direct switch that changes Gemini’s algorithms. It also broadens the role of raters beyond checking whether a response is factually correct. Even so, the source of concern remains: some tasks involve subject matter where a reviewer’s expertise can matter a great deal.

The bigger question for AI quality control

The Gemini guideline highlights a larger challenge for generative AI companies. Chatbots answer questions across many domains, from everyday topics to highly technical fields. The people asked to judge those answers may not always share the same expertise as the prompt requires.

For users, the practical issue is trust. If a chatbot produces an answer on a sensitive topic, the quality of the review process behind that system becomes part of the product’s credibility. For contractors, the change places more responsibility on reviewers to separate what they can assess from what they cannot.

The new rule does not say contractors should pretend to have expertise they lack. It tells them to rate the parts they understand and disclose the gap. Whether that approach improves Gemini or increases the risk of weak ratings on technical answers is the concern now being raised inside the contractor workflow.