Ars Technica AI October 17, 2024 TERMINATOR

Why AI video scraping makes any screen recording useful data

AI researcher Simon Willison showed that a short screen recording can be turned into structured data using Google’s Gemini models. The same capability could make assistants more useful, but it also raises serious privacy questions when AI systems can watch computer screens.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 1 ►

Screen-recording extraction makes AI assistants more capable but raises meaningful privacy and surveillance risks when models can watch computer screens.

Why AI video scraping makes any screen recording useful data

A simple screen recording is becoming a new kind of input for AI. In an experiment described by AI researcher Simon Willison, a video of emails was enough for Gemini to extract payment dates and dollar amounts into structured data.

The result points to a broader shift: instead of typing, copying, pasting, or explaining every detail to a chatbot, people may increasingly let multimodal AI models read what is visible on their screens.

What Willison Tested

Willison needed to add up charges from a cloud service. The relevant payment values and dates were spread across twelve different emails, which made manual entry tedious.

Rather than copy the data line by line, he recorded a 35-second video while scrolling through the emails. He then uploaded the video to Google’s AI Studio, a tool for experimenting with versions of Gemini 1.5 Pro and Gemini 1.5 Flash.

His prompt asked Gemini to identify the price data and return it as JSON, including dates and dollar amounts. After that, he converted the result into a CSV table for spreadsheet use and checked the output for mistakes.

The striking part was not only that the extraction worked. It was also how little the analysis appeared to cost. Willison wrote, “The cost [of running the video model] is so low that I had to re-run my calculations three times to make sure I hadn’t made a mistake.”

According to Willison, the full video analysis process appeared to cost less than one-tenth of a cent on Gemini 1.5 Flash 002, using just 11,018 tokens. He actually paid nothing because Google AI Studio is currently free for some types of use.

Why Video Scraping Matters

Willison calls the method “video scraping.” The idea is direct: record what appears on a screen, give the video to an AI model, and ask the model to extract the useful information.

That matters because many data sources are difficult to scrape with older techniques. Data can be locked inside awkward formats, displayed through web applications, or presented in ways that resist normal extraction.

For data journalists, this is a familiar problem. Willison is also a data journalist and has created tools such as the Datasette project, which lets anyone publish data as an interactive website. His interest in converting unstructured material into structured data fits that background.

He has tested the same broad idea before. In February, he used a seven-second video of books on his bookshelves and asked Gemini 1.5 Pro to identify the book titles and place them into an organized list.

The key advantage is that the AI model is not limited to a clean text field or a convenient export button. If the information appears on screen, the model can potentially process it. As Willison put it, “There’s no level of website authentication or anti-scraping technology that can stop me from recording a video of my screen while I manually click around inside a web application.”

From Text Prompts To Screen Awareness

The experiment reflects a larger change in AI interfaces. Models such as Google’s Gemini and GPT-4o are multimodal, meaning they can work with audio, video, image, and text input.

These systems convert different media into tokens, or chunks of data, and then predict what tokens should come next. The source notes that “token prediction model” may describe today’s multimodal systems more accurately than “LLM,” though no general replacement term has taken hold.

Once video becomes an ordinary input, users no longer have to describe every screen state in words. A person could show the model what is happening instead. In the example from the source, an AI system could help with a difficult pizza website interface by performing the needed mouse clicks.

Large AI companies are already exploring this direction, though they may describe it as “video understanding” or “vision” rather than video scraping.

In May, OpenAI demonstrated a prototype version of its ChatGPT Mac App with an option for ChatGPT to see and interact with what is on a user’s screen, but that feature has not yet shipped. Microsoft also demonstrated a “Copilot Vision” prototype concept based on OpenAI’s technology that will be able to “watch” the screen, help extract data, and interact with running applications.

At the same time, public access remains uneven. The source states that ChatGPT and Anthropic’s Claude have not yet implemented a public video input feature for their models, possibly because processing the extra tokens from tokenized video is relatively computationally expensive.

The Cost Question

Willison’s result is notable because it suggests that this capability may become practical for routine work. A task that once required manual entry or custom scraping code could be handled through a short recording and a prompt.

For now, the economics are shaped by company subsidies. The source says Google is heavily subsidizing user AI costs through Search revenue and data centers. It also notes that OpenAI is subsidizing too, with investor dollars and help from Microsoft.

The broader trend described in the source is that AI compute costs are dropping, which could make these capabilities available to more users over time. If that continues, video scraping may become less like a novelty and more like a standard way to move information from screens into usable formats.

Privacy Depends On Control

The same capability that makes video scraping useful also creates risk. If an AI system can see a screen, it can potentially capture sensitive behavior and private information.

The source contrasts Willison’s controlled experiment with systems that continuously record activity. Apps such as Rewind AI on the Mac and Microsoft’s Recall, which is being built into Windows 11, feed on-screen video into an AI model and store extracted data in a database for later recall.

That approach can create privacy concerns because it records everything a person does on a machine and collects it in one place that could later be hacked.

Willison’s approach is different because he chooses what to record and when to upload it. Even though his current method involves sending the video to Google for processing, the decision about exposure remains with him.

He summarized the benefit this way: “The great thing about this video scraping technique is that it works with anything that you can see on your screen… and it puts you in total control of what you end up exposing to the AI model.”

That control may become the dividing line for future screen-aware AI tools. The useful version helps people extract data from what they deliberately show it. The risky version watches broadly, stores too much, and turns everyday computer use into a searchable record.