TechCrunch AI December 12, 2024 NEUTRAL

Real-time video finally reaches ChatGPT voice chats

OpenAI has released real-time video for ChatGPT’s Advanced Voice Mode after first showing the capability nearly seven months ago. The feature lets eligible users point a phone camera at objects or share a screen and get near real-time spoken responses, though access is limited by plan and region.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

This is mainly a routine product capability rollout, with only mild concerns about stronger real-time AI perception and user dependence.

Real-time video finally reaches ChatGPT voice chats

ChatGPT’s voice experience is becoming visual. OpenAI has released real-time video capabilities for Advanced Voice Mode, giving some subscribers a way to show ChatGPT objects, screens, drawings, and settings instead of describing everything in text.

The launch arrives after OpenAI first demoed the capability nearly seven months ago. It also comes with limits: the rollout is staggered, some subscription tiers will wait until January, and several European markets have no stated timeline.

What ChatGPT can now see

OpenAI said during a Thursday livestream that Advanced Voice Mode, the conversational feature built to make ChatGPT feel more natural in spoken exchanges, is getting vision. In the ChatGPT app, users on ChatGPT Plus, Team, or Pro can point their phones at objects and ask ChatGPT to respond in near real time.

The practical change is simple but important. Instead of typing a description or uploading a still image, a user can use the phone camera while talking to ChatGPT. That makes the interaction closer to asking for help while showing the model what is in front of you.

The same feature can also work with what is on a device’s screen through screen sharing. According to the source, Advanced Voice Mode with vision can explain settings menus or offer suggestions on a math problem. Those examples show the feature’s main purpose: it connects spoken guidance with visual context.

OpenAI’s demo history suggests a broad range of uses, but the reported examples stay grounded in everyday tasks: identifying what a user is drawing, understanding app screens, and reacting while the user continues interacting with the real world.

How eligible users start a video or screen share

The access flow is built into the ChatGPT app’s voice interface. To use Advanced Voice Mode with vision, users tap the voice icon next to the ChatGPT chat bar, then tap the video icon on the bottom left to start video.

For screen sharing, the path is different. Users tap the three-dot menu and select “Share Screen.” That gives ChatGPT access to the screen context needed to comment on menus, work through a displayed problem, or respond to other visible material.

The distinction matters because the feature is not only about a phone camera. OpenAI is positioning vision as an added layer for the voice experience, whether the user is showing the physical world through a camera or showing digital content through screen sharing.

Who gets access now, and who has to wait

The rollout starts Thursday and is expected to wrap up in the next week, OpenAI says. Access, however, is not universal across ChatGPT accounts.

Users subscribed to ChatGPT Plus, Team, or Pro are included in the current release. ChatGPT Enterprise and Edu subscribers will not get the feature until January.

There is also a regional gap. OpenAI has no timeline for ChatGPT users in the EU, Switzerland, Iceland, Norway, or Liechtenstein. That means the launch is both a product release and a staged availability update, with plan type and location determining when users can try it.

The staggered rollout follows a longer delay. OpenAI had promised in April that Advanced Voice Mode would reach users “within a few weeks.” Months later, the company said it needed more time. When Advanced Voice Mode arrived in early fall for some ChatGPT users, the visual analysis component was not included.

The demos show promise, but also a warning

In a recent CBS News “60 Minutes” demo, OpenAI President Greg Brockman showed Advanced Voice Mode with vision quizzing Anderson Cooper on anatomy. Cooper drew body parts on a blackboard, and ChatGPT responded to what it appeared to understand from the drawing.

“The location is spot on,” ChatGPT said. “The brain is right there in the head. As for the shape, it’s a good start. The brain is more of an oval.”

That kind of exchange explains why real-time video matters for a voice assistant. The model is not just listening to a question; it is reacting to a changing visual scene. For users, that could make help feel more immediate because they can show the issue instead of translating it into a written prompt.

But the same demo also showed a limitation. Advanced Voice Mode with vision made a mistake on a geometry problem, suggesting that it is prone to hallucinating. That caveat is central to how the feature should be understood: the system can interpret visual input and respond conversationally, but it can still be wrong.

For tasks involving learning, troubleshooting, or interface guidance, the feature may be useful as a real-time assistant. For answers that require precision, users still need to treat its output with caution.

Why the timing matters

The release lands as other major AI companies are also working on video-aware chatbots. Rivals like Google and Meta are developing similar capabilities for their respective chatbot products. This week, Google made its real-time, video-analyzing conversational AI feature available to a group of “trusted testers” on Android.

For OpenAI, the launch fills in a missing piece of Advanced Voice Mode. The voice-only experience had already expanded to additional platforms and users in the EU before Thursday’s announcement, but the visual component remained the delayed part of the original pitch.

OpenAI also launched a seasonal addition on Thursday: “Santa Mode,” which adds Santa’s voice as a preset voice in ChatGPT. Users can find it by tapping or clicking the snowflake icon in the ChatGPT app next to the prompt bar.

The larger shift is still the arrival of Advanced Voice Mode with vision. ChatGPT can now combine speech, live camera input, and screen sharing for eligible users, moving the assistant from a text-and-voice tool toward a more visually aware conversational product.