Google’s Veo 3 marks a clear shift in consumer AI video. The model can create short clips with realistic people, synchronized audio, sound effects, music and dialog, making synthetic media harder to separate from authentic footage.
The launch also shows how quickly AI video is moving from visual novelty to a more complete media format. With sound attached to generated characters and scenes, the question is no longer only whether an image looks convincing. It is whether a full clip can feel believable enough to pass casual inspection.
What Google Veo 3 Can Generate
Google introduced Veo 3 last week as its newest video generation model. It creates 8-second clips at 720p resolution from text descriptions, known as prompts, or from still image inputs.
The major addition is synchronized audio. Veo 3 can produce sound effects, music and spoken dialog within the generated video. For Google’s AI tools, that combination is described as a first.
Google also launched Flow, an online AI filmmaking tool that brings together Veo 3, Imagen 4 and Gemini. Flow lets users describe scenes in natural language while managing characters, locations and visual styles in a web interface.
Together, Veo 3 and Flow point toward a more complete creative system. A user is not only asking for a clip. The user can shape a scene, define its participants and generate both the visual and audio pieces inside one workflow.
The Price of Short AI Video
Veo 3 and Flow are available to US subscribers of Google AI Ultra. The plan costs $250 a month and includes 12,500 credits.
Each Veo 3 generation costs 150 credits. That means the plan allows 83 videos before the included credits run out. Extra credits are sold at 1 cent per credit in blocks of $25, $50, or $200, which works out to about $1.50 per video generation.
Those numbers matter because AI video quality often depends on repetition. The source testing notes that better results can come from running the same prompt multiple times and selecting the strongest output. In that context, the visible price of one clip may understate the practical cost of getting a usable clip.
During testing, each 8-second-long 720p video generated through Google’s Flow platform took around three to five minutes to complete. The tests were paid for directly, and most prompts were run only once unless noted.
How the System Works
Veo 3 is built on diffusion technology, the same general approach used by image generators such as Stable Diffusion and Flux. In training, real videos are gradually transformed into noise, and a neural network learns how to reverse that process.
When generating a new clip, the system begins with random noise and a prompt. It then refines that noise into a video that matches the description.
Veo 3 is not a single model doing one task. It combines several AI models, including a large language model to interpret prompts, a video diffusion model to generate the moving image, and an audio generation model that adds sound.
The training data remains an open issue. DeepMind has not said exactly where the content used to train Veo 3 came from. YouTube is described as a strong possibility because Google owns YouTube, and DeepMind previously told TechCrunch that Google models like Veo “may” be trained on some YouTube material.
What the Tests Revealed
The most important advance in Veo 3 is the integrated audio generation. The model can create traffic sounds, music and character dialog, making the output feel more complete than silent AI video.
The testing also found flaws. Spaghetti made crunching sounds when eaten. In scenes with multiple people, dialog sometimes came from the wrong character’s mouth. Generated videos also tended to show garbled subtitles that nearly matched the spoken words, likely reflecting subtitles present in training data.
Even with those issues, Veo 3 was described as a major improvement in quality and coherency compared with models from OpenAI, Runway, Minimax, Pika, Meta, Kling and Hunyuanvideo.
The tested prompts covered a wide range of synthetic scenes. Examples included a muscular barbarian beside a CRT television set, a horror-film chase involving a peanut costume, a trailer concept called The Haunted Basketball Train, an ASMR scene, a 1980s PBS-style computer segment, a 1980s fitness video with werewolf masks and a Zoom-style therapist clip.
One prompt asked for a barbarian character to speak the line: “You’ve been looking for this for years: a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting. Got that, Benj?” Another therapist-style prompt included the line: “Oh my lord, look at that Atari 800 you have behind you! I can’t believe how nice it is!”
Why Detection and Trust Are Now Central
Google is trying to limit misuse in several ways. DeepMind says it uses SynthID, its proprietary watermarking technology, to place invisible markers inside frames generated by Veo 3. These markers are intended to remain present even after videos are compressed or edited.
That may help people identify AI-generated content, but the source raises doubt that watermarking alone will stop deception. The closer synthetic video gets to authentic-looking footage, the more pressure falls on detection tools, platform policies and user judgment.
Google also blocks some prompts and outputs under its content agreement. During testing, generation failure messages appeared for romantic and sexual material, some types of violence, mentions of certain trademarked or copyrighted media properties, some company names, certain celebrities and some historical events.
Those filters shape what this version of Veo 3 can produce. But the broader implication is clear: once realistic video, voices and sound can be generated together, AI media becomes more persuasive. A clip can flatter, perform, explain or imitate a familiar format while being entirely synthetic.
Veo 3 is therefore not just another creative tool. It is a sign that AI video realism has entered a more consequential stage, where the technical leap is impressive and the trust problem becomes harder to ignore.