Multimodal AI is often judged by how well it performs on image benchmarks. A Stanford University study described in the source article argues that this measure can hide a serious problem: some leading models answer visual questions as if they had seen an image, even when no image was supplied.
The issue affects models including GPT-5, Gemini 3 Pro, and Claude Opus 4.5. According to the study, these systems can produce detailed image descriptions, medical diagnoses, and explanations without actual visual input. The researchers call this behavior the mirage effect.
What the mirage effect means
The study separates the mirage effect from ordinary hallucination. A hallucination usually means a model adds false details within a task that still has a valid frame of reference. The mirage effect is more basic: the model acts as though a missing image exists and builds its answer around that false premise.
That distinction matters because the model is not merely making a small factual error. It is operating under the wrong assumption about what information it has. In a visual task, that can make a confident answer look grounded when it is not.
To test the problem, the researchers created a benchmark called Phantom-0. It includes 200 visual questions across 20 categories, but the questions are shown without any accompanying image. In this setup, all tested frontier models confidently described visual details in over 60 percent of cases, according to the study.
The behavior became even stronger when extra prompt instructions common in typical evaluation workflows were added. Under those conditions, the rate rose to 90 to 100 percent. That suggests the problem can be amplified by the same kinds of instructions often used to guide model evaluation.
Why benchmark scores can mislead
The Stanford University study also tested how much image benchmarks depend on the image itself. The researchers evaluated four frontier models, Gemini 3 Pro, Gemini 2.5 Pro, GPT-5.1, and Claude Opus 4.5, across six established benchmarks.
The general visual understanding benchmarks were MMMU-Pro, Video-MMMU, and Video-MME. The medical image analysis benchmarks were VQA-Rad, MicroVQA, and MedXpertQA-MM.
The central finding was striking: models reached an average of 70 to 80 percent of their full benchmark accuracy without seeing an image. In other words, the image contributed only the remaining 20 to 30 percent of overall performance.
The source explains that much of the score can come from text patterns, prior knowledge, and structural cues in the questions. A benchmark can therefore appear to measure visual understanding while also rewarding models for exploiting non-visual information.
The effect was largest in medical benchmarks. In that area, models achieved up to 99 percent of their image-mode accuracy through text alone. If a medical image benchmark can be solved largely without the image, its ranking may say less about visual analysis than users expect.
Medical use shows the stakes
The medical examples make the risk easier to understand. The researchers asked Gemini 3 Pro to describe nonexistent images and produce diagnoses across five clinical categories: X-ray, brain MRI, ECG, pathology, and dermatology. Each question was repeated with 200 different random seeds.
The resulting diagnoses leaned heavily toward severe pathologies. Among the most frequent outputs were ST-elevation myocardial infarctions (STEMI), melanomas, and carcinomas. The source notes that “Normal” and “No diagnosis” also appeared among the top responses, but pathological findings dominated cumulatively.
This creates a practical concern for API-based applications and agentic tools. If an image upload fails or a visual input is not actually passed to the model, the system may still return a confident medical-sounding answer. In the scenario described by the source, that could mean an urgent recommendation for a condition that does not exist.
The broader point is not limited to medicine. Any safety-critical system that relies on visual input needs to know whether the model actually received and used that input. Benchmark performance alone may not provide that assurance.
The super-guesser exposes the shortcut
To show how far text alone can go, the researchers trained what the source calls a “super-guesser.” It was a pure text model based on Qwen 2.5 with 3 billion parameters, fine-tuned on the public training set of the ReXVQA benchmark for chest X-ray analysis, with all images removed.
According to the study, this text-only model outperformed all frontier multimodal models on the held-out test set, including models with hundreds of billions of parameters. It also beat human radiologists by more than 10 percent on average.
The model had no image access, yet it could provide both the correct answer and a plausible visual explanation. Some explanations were described as indistinguishable from actual ground-truth reasoning. That result underlines the central weakness: a convincing explanation does not prove the model used visual evidence.
The study also compared GPT-5.1 in two modes. In mirage mode, it received a visual question without an image and was not told the image was absent. In guess mode, it was explicitly told no image was available and asked to choose the best possible answer.
Performance dropped in guess mode across nearly all benchmark categories. The information available to the model was the same in both cases: the question text and its trained world knowledge. The difference was how the model behaved. In mirage mode, it acted as though it had visual input and constructed a plausible perceptual narrative.
What this changes for evaluation
The study does not claim that multimodal models cannot process images. Its narrower claim is that current benchmarks often cannot tell whether a model used the image or derived the answer from text alone.
That is a serious measurement problem. Companies and hospitals may choose AI models based on benchmark rankings. If those rankings mostly reflect non-visual reasoning, they can give buyers and developers a false sense of visual competence.
The study also warns that the failure mode varies by domain. A model that uses visual information effectively for natural images may not do the same for X-rays or pathology slides. That makes broad benchmark claims especially fragile when the intended use is specialized.
The practical lesson is direct: visual AI evaluation needs controls that detect whether the model actually depends on the image. Without that, high scores can reward a model for sounding like it saw something, rather than proving that it did.