TechCrunch AI December 16, 2024 NEUTRAL

Veo 2 Pushes Google DeepMind Deeper Into AI Video

Google DeepMind has announced Veo 2, a new video-generating AI model that can theoretically produce clips over two minutes long at up to 4k resolution. For now, access is limited through VideoFX, where outputs are capped at 720p and eight seconds.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

A routine AI video model launch with limited access and no clear harmful or degrading societal impact described.

Veo 2 Pushes Google DeepMind Deeper Into AI Video

Google DeepMind is moving quickly in AI video with Veo 2, a new model designed to generate video from prompts and compete directly with OpenAI’s Sora. The announcement points to a model with higher theoretical output than Sora, but the version people can use today is still tightly limited.

What Veo 2 Can Generate

Veo 2 is the successor to Veo, the model already powering a growing number of products across Google’s portfolio. DeepMind says the new system can create clips longer than two minutes and at resolutions up to 4k (4096 x 2160 pixels).

That headline capability matters because it gives Veo 2 a theoretical edge over Sora. According to the source, Veo 2’s top resolution is 4x higher and its possible duration is over 6x longer than what Sora can achieve.

The practical experience is more constrained. In VideoFX, Google’s experimental video creation tool and the only place Veo 2 is currently available, videos are capped at 720p and eight seconds. By comparison, Sora can produce clips up to 1080p and 20 seconds long.

VideoFX is also behind a waitlist, though Google says it is expanding access this week. Eli Collins, VP of product at DeepMind, told TechCrunch that Veo 2 will come to Vertex AI “as the model becomes ready for use at scale.”

More Control Over Motion And Camera Work

Like the first Veo, Veo 2 can generate video from a text prompt or from text paired with a reference image. A prompt such as “A car racing down a freeway” can be used to direct the model toward a specific scene.

DeepMind says the new model has improved “understanding” of physics and camera controls. In plain terms, that means it is intended to place and move a virtual camera more precisely, while making objects, people, and scenes appear more consistent from different angles.

The company also says Veo 2 produces “clearer” footage. That refers to sharper textures and images, especially when a scene contains a lot of movement.

DeepMind claims the model is better at representing motion, fluid dynamics, and light. Examples include coffee being poured into a mug, shadows, reflections, lenses, cinematic effects, and “nuanced” human expression.

TechCrunch viewed selected samples from DeepMind and described them as strong for AI-generated video. The source noted Veo 2’s handling of refraction, liquids such as maple syrup, and Pixar-style animation.

The Limits Are Still Visible

Veo 2 is not being presented as a solved system. DeepMind says the model is less likely to hallucinate elements such as extra fingers or “unexpected objects,” but the source article still describes visible weaknesses in sample clips.

Those issues included lifeless eyes in a cartoon dog-like creature, a strangely slippery road, background pedestrians blending into one another, and buildings with physically impossible facades.

Collins acknowledged the remaining gaps directly. “Coherence and consistency are areas for growth,” he said. “Veo can consistently adhere to a prompt for a couple minutes, but [it can’t] adhere to complex prompts over long horizons. Similarly, character consistency can be a challenge. There’s also room to improve in generating intricate details, fast and complex motions, and continuing to push the boundaries of realism.”

That limitation is important for anyone thinking about production use. A model may generate impressive short clips, but complex prompts, long scenes, character continuity, and intricate movement remain hard problems.

Creators, Training Data And Safety

DeepMind says Veo 2 was trained on video-description pairings. Collins described those as a video and an associated description of what happens in it.

The company has not said exactly where the training videos came from. The source notes that YouTube is one possible source because Google owns YouTube, and DeepMind previously told TechCrunch that Google models like Veo “may” be trained on some YouTube content.

DeepMind, through Google, provides tools that let webmasters block its bots from extracting training data from websites. However, it does not offer a way for creators to remove works from existing training sets. The lab and Google maintain that training models on public data is fair use.

That view is contested by some creatives. The source points to studies estimating that tens of thousands of film and TV jobs could be disrupted by AI in the coming years, as well as lawsuits against several AI companies over training on content without consent.

DeepMind says it is working with artists and producers. Collins said the team has worked with Donald Glover, the Weeknd, d4vd, and others since the start of Veo development to understand creative processes and how the technology could support their work.

Safety measures include prompt-level filters for violent, graphic, and explicit content. DeepMind also says Veo 2 uses SynthID, Google’s proprietary watermarking technology, to add invisible markers into generated frames. The source notes that SynthID, like all watermarking technology, is not foolproof.

Google’s indemnity policy, which provides a defense for certain customers against copyright infringement allegations related to use of its products, will not apply to Veo 2 until it is generally available, Collins said.

Imagen 3 Also Gets An Upgrade

Alongside Veo 2, Google DeepMind announced upgrades to Imagen 3, its commercial image generation model. A new version is rolling out to ImageFX users beginning Monday.

DeepMind says the updated Imagen 3 can produce “brighter, better-composed” images and photos in styles including photorealism, impressionism, and anime. In a blog post provided to TechCrunch, DeepMind wrote that the upgrade follows prompts more faithfully and renders richer details and textures.

ImageFX is also getting interface changes. When users write prompts, key terms can become “chiplets” with drop-down suggestions for related words. Users can adjust prompts through those chips or choose from auto-generated descriptors beneath the prompt.

Together, the Veo 2 and Imagen 3 updates show Google DeepMind trying to improve both video generation and image generation while keeping the newest video model inside a limited experimental release. The main question is how quickly Veo 2’s theoretical capabilities become available beyond short clips in VideoFX.