Google’s Veo-3 shows a sharp split between visual imitation and medical understanding. In tests based on real surgical footage, the video AI could create clips that appeared credible at first glance, but the details that matter in an operating room quickly broke down.
The result is a useful warning for anyone looking at synthetic video in healthcare: a realistic-looking surgical scene is not the same thing as a medically meaningful one.
How the SurgVeo test worked
An international team created the SurgVeo benchmark to evaluate how Veo-3 handles surgical video prediction. The benchmark used 50 real videos from abdominal and brain surgeries.
The task was narrow but demanding. Veo-3 received a single image and was asked to predict how the surgery would continue over the next eight seconds. That made the test less about making a polished clip from scratch and more about whether the model could infer what should happen next in a real procedure.
Four experienced surgeons then reviewed the AI-generated clips. They scored Veo-3 across four criteria:
- Visual appearance
- Instrument use
- Tissue feedback
- Whether the actions made medical sense
This distinction is central. A model can make tools, tissue, and motion look plausible without understanding how instruments should be handled, how tissue should react, or what sequence of actions belongs in a procedure.
Strong images did not mean strong medicine
Veo-3’s best performance came from the surface-level quality of its clips. In abdominal surgery tests, it scored 3.72 out of 5 for visual plausibility after one second. Some surgeons described the visual quality as "shockingly clear."
But the more the evaluation depended on surgical correctness, the weaker the results became. For abdominal procedures, instrument handling scored only 1.78 points. Tissue response was lower at 1.64, and surgical logic came in at 1.61.
That pattern matters because the visual layer can be misleading. A clip may look like a surgical video while still showing actions that do not fit the real medical situation. In a domain where the sequence and purpose of movements are critical, visual plausibility is only a small part of the problem.
The abdominal surgery results show that Veo-3 can reproduce the appearance of an operating room scene more effectively than it can reproduce the underlying logic of the procedure. The model can imitate what surgery looks like, but it cannot reliably represent what actually happens during surgery.
Brain surgery exposed deeper limits
The benchmark also tested Veo-3 on brain surgery footage, where the required precision was even harder for the model to handle. From the first second, it struggled with the fine detail needed in neurosurgery.
For brain operations, instrument handling dropped to 2.77 points, compared to 3.36 for abdominal surgery. Surgical logic fell as low as 1.13 after eight seconds.
The longer prediction window is important. A model may preserve the look of the scene for a moment, but as it extends the imagined procedure, errors in cause and effect become more visible. The test suggests that Veo-3’s weakness is not just in isolated details, but in maintaining a medically coherent chain of events.
The researchers also examined what kinds of mistakes the model made. More than 93 percent of errors were tied to medical logic. The AI invented tools, imagined impossible tissue responses, or generated actions that made no clinical sense.
By contrast, only a small share of errors came from image quality: 6.2 percent for abdominal surgery and 2.8 percent for brain surgery. In other words, the main failure was not that the clips looked bad. It was that they looked good while being medically wrong.
More context did not solve the problem
The researchers tried giving Veo-3 additional information, including the type of surgery and the exact phase of the procedure. The results did not show meaningful or consistent improvement.
That finding points to a deeper issue. If the model cannot process and use relevant surgical context, adding labels or procedural hints may not be enough. According to the team, the problem is not simply a lack of information. It is the model’s inability to understand that information in a medical way.
This is especially important for discussions about video models as "world models." The SurgVeo results suggest that current systems can imitate motion and appearance without reliably grasping physical or anatomical logic. They can produce video that seems convincing, but they do not capture the real cause-and-effect structure behind surgery.
The researchers plan to release the SurgVeo benchmark on GitHub so other teams can test and improve their models. That could help make progress more measurable, especially if future systems are judged not only by visual quality but also by whether their outputs make sense in the domain they are trying to represent.
Why this matters for medical AI
The study highlights a specific risk for synthetic AI-generated videos in medical training. Future systems could one day help train doctors, assist with surgical planning, or even guide procedures. But the current results show that today’s models are not close to that level.
The concern is not just that a generated clip might contain a visible flaw. The more serious risk is that a video could look plausible while teaching the wrong procedure, the wrong instrument behavior, or the wrong tissue response.
The source article contrasts this with Nvidia’s approach, where AI videos help train robots for general tasks. In healthcare, the same kind of hallucination can carry different stakes. A synthetic surgical video that appears correct but shows medically incorrect actions could mislead robots or trainees.
The findings also sit alongside a broader split in medical AI. Text-based AI is already showing real gains in medicine. In one study, Microsoft’s "MAI Diagnostic Orchestrator" delivered diagnostic accuracy four times higher than experienced general practitioners in complex cases, although the study notes some methodological limitations.
For surgical video, however, the SurgVeo test shows that realistic generation is not enough. Until video AI can connect appearance with medical logic, tissue behavior, and instrument handling, its role in high-stakes clinical settings should remain limited.