How Fun-Tuning turns Gemini fine-tuning into an attack tool

Academic researchers developed Fun-Tuning, a method that uses Gemini’s fine-tuning interface to generate stronger prompt injections. The work shows how fine-tuning signals can help turn failed attacks into working ones against Gemini 1.0 Pro and Gemini 1.5 Flash.

WTF Index TERMINATOR
◄ Terminator 4 Idiocracy 1 ►

The story centers on a method that makes prompt-injection attacks against major AI systems more reliable, increasing security and control risks.

How Fun-Tuning turns Gemini fine-tuning into an attack tool

A new academic attack called Fun-Tuning shows how Gemini’s own fine-tuning API can be used to make prompt injections more reliable. The technique matters because prompt injection has already become one of the most important security problems around large language models, especially when those models process outside content.

The core issue is simple but difficult to solve. A model may be asked to follow instructions from its developer while also reading emails, code comments, documents, or other external material. If hostile instructions are hidden inside that material, the model can be pushed toward actions its operator did not intend.

Why prompt injection is hard to engineer

Indirect prompt injections exploit a weakness in how large language models handle competing instructions. The source describes them as a powerful way to attack systems such as OpenAI’s GPT-3 and GPT-4, Microsoft’s Copilot, and Google’s Gemini.

The potential consequences are serious. The source gives examples including exposure of confidential contacts or emails and false answers that could damage important calculations. In both cases, the attack does not need to break into a conventional database. It needs the model to treat outside text as if it were a valid instruction.

Until now, creating reliable injections against closed-weights models has required extensive trial and error. Models such as GPT, Anthropic’s Claude, and Google’s Gemini do not reveal their underlying code or training data to outside users. That makes them black boxes from the attacker’s point of view.

This opacity has limited attackers as well as defenders. Without direct access to the internal model, a person trying to build a working injection must keep testing variations manually. Some attempts work quickly, while others can take far longer.

What Fun-Tuning changes

Fun-Tuning replaces much of that manual guessing with an algorithmic process. According to the source, academic researchers devised a way to create computer-generated prompt injections against Gemini with much higher success rates than manually written ones.

The method misuses fine-tuning, a feature meant to adapt a model for specialized data. The source lists examples such as a law firm’s legal case files, patient files or research managed by a medical facility, and architectural blueprints. Google makes fine-tuning for Gemini’s API available free of charge.

Fun-Tuning starts with one or more prompt injections and then searches for prefix and suffix strings that make the model more likely to follow the attacker’s instruction. These additions can look like meaningless fragments to a human reader, but they are made from tokens that matter to the model.

In one example, a prompt injection tried to make Gemini accept an incorrect math-related output of '10' instead of the correct answer of 5. The injection did not succeed by itself. After Fun-Tuning added generated text before and after it, the attack worked.

The researchers’ point is that this makes the process more systematic. Earlence Fernandes, a University of California at San Diego professor and co-author of the paper, said: "A key difference is that our attack is methodical and algorithmic—run it, and you are very likely to get an attack that works against a proprietary LLM."

How the attack uses fine-tuning signals

The source says creating an optimized prompt injection with Fun-Tuning requires about 60 hours of compute time. Because the Gemini fine-tuning API used in the process is free of charge, the total cost of such attacks is about $10. The process can return useful optimizations in less than three days.

That is possible because fine-tuning reveals subtle information about how a model responds during training. Fine-tuning measures errors with a numerical score called a loss value. The loss value reflects the gap between what the model produced and what the trainer wanted.

The source explains this with a simple next-word example. If a model is trained on the sequence "Morro Bay is a beautiful..." and predicts "car," the loss score is high. If it predicts "place," the loss score is lower because that answer better matches the expected result.

Fun-Tuning uses these loss signals to search through combinations of prefixes and suffixes. The aim is to identify token sequences that make a prompt injection more likely to succeed. The researchers described this as an adversarial discrete optimization method.

Nishit Pandya, a co-author and PhD student at UC San Diego, concluded that "the training loss serves as an almost perfect proxy for the adversarial objective function when the length of the target string is long."

What the examples show

The source describes attacks against Gemini 1.5 Flash and Gemini 1.0 Pro. In one case, a prompt injection was hidden inside Python code as a harmless-looking comment. Alone, it failed against Gemini 1.5 Flash. With Fun-Tuning’s added prefix and suffix, it succeeded.

Another example used different generated additions around another failed injection. With those additions, the prompt injection worked against Gemini 1.0 Pro.

The important detail is not that the generated strings are readable. They are not meant to be persuasive to a person. They are meant to interact with the model’s token-level behavior in a way discovered through optimization.

That shifts prompt injection from a craft based mainly on repeated manual testing toward a more automated search problem. For defenders, it underlines a difficult reality: features designed to make models more useful, such as fine-tuning APIs, can also reveal signals that help attackers improve their methods.