Google is moving toward a future where its Gemini AI models and Veo video-generating models work more closely together. In a recent appearance on Possible, a podcast co-hosted by LinkedIn co-founder Reid Hoffman, Google DeepMind CEO Demis Hassabis said the company plans to eventually combine the two model families.
The stated reason is practical: Google wants Gemini to better understand the physical world. For a company building toward a universal digital assistant, that means models need to process more than text. They need to work across images, video, audio, and other forms of media in ways that support real-world help.
Why Gemini and Veo Fit Into the Same AI Roadmap
Hassabis described Gemini as a foundation model that was designed to be multimodal from the beginning. That design choice matters because Google's vision is not limited to a chatbot that answers written questions. The company is aiming at an assistant that can operate with a richer view of the world around the user.
In Hassabis' words: "We've always built Gemini, our foundation model, to be multimodal from the beginning." He connected that approach to "this idea of a universal digital assistant," one that "actually helps you in the real world."
Veo adds another piece to that direction because it is built around video generation. If video models can learn patterns about motion, objects, and cause and effect, then combining that capability with Gemini could give the broader system a stronger grasp of how things behave outside a text prompt.
The Physical World Is the Key Goal
The clearest point from Hassabis was not simply that Google wants bigger AI models. It is that Google wants models with a better model of physical reality. Video is central to that ambition because it contains movement, timing, spatial relationships, and visible outcomes.
Hassabis said Veo 2 can learn about physics by watching YouTube videos. His explanation was direct: "Basically, by watching YouTube videos — a lot of YouTube videos — [Veo 2] can figure out, you know, the physics of the world."
That framing helps explain why Gemini and Veo are complementary. Gemini is the broad foundation model. Veo is a video-generating model. A future system that combines them could use video understanding to support the wider assistant experience Google has described.
AI Models Are Moving Toward More Media Types
The source article places Google's plans inside a wider AI industry shift toward models that can understand and synthesize many forms of media. These are sometimes described as "omni" models: systems that can work across text, images, audio, video, and other inputs or outputs.
Google's newest Gemini models can generate audio as well as images and text. OpenAI's default model in ChatGPT can now create images, including Studio Ghibli-style art. Amazon has also announced plans to launch an "any-to-any" model later this year.
The common thread is that major AI systems are becoming less tied to a single format. A model that only handles text has limits. A model that can handle text, image, audio, and video can be positioned as a more general assistant, creator, and interface.
The Data Question Behind Multimodal AI
These broader models require large amounts of training data across media types. The source article lists images, videos, audio, text, and more as part of the training mix needed for omni models.
Hassabis implied that Veo's video data comes mostly from YouTube, which Google owns. That matters because YouTube is a major source of video content inside Google's own ecosystem, and video is especially relevant to the physical-world understanding Hassabis described.
Google previously told TechCrunch that its models "may be" trained on "some" YouTube content, in accordance with its agreement with YouTube creators. The source article also says the company reportedly broadened its terms of service last year in part to tap more data for AI model training.
What to Watch Next
The important takeaway is that Google is not presenting Gemini and Veo as isolated projects. Hassabis' comments point to a longer-term plan in which the strengths of video generation and multimodal foundation models are brought together.
For users, the most visible result would not necessarily be a separate product called a Gemini-Veo hybrid. The bigger idea is a universal digital assistant with a stronger ability to interpret and act on real-world context. For Google, that means Gemini's future may depend partly on what Veo learns from video.
The details of how and when Google will combine Gemini and Veo were not provided in the source. But the direction is clear: the company sees video as a route to better physical-world understanding, and it sees multimodal AI as the foundation for assistants that do more than respond to text.