Can Flint Push LLMs Beyond Predictable Answers?

Mainstream LLMs can produce strikingly similar answers to open-ended prompts, which is useful for reliable work but limiting for brainstorming. Springboards is testing Flint, a model designed to introduce more variety at the points where a response can branch into something less expected.

WTF Index IDIOCRACY
◄ Terminator 0 Idiocracy 2 ►

The story is mainly about AI shaping creative output and even embracing hallucination, which mildly leans toward lower quality or dependence rather than danger.

Can Flint Push LLMs Beyond Predictable Answers?

Ask several popular chatbots for something open-ended, and the answers may be less individual than the interface suggests. The issue is not that large language models cannot respond. It is that they often respond in the same direction.

That pattern matters most when the task is not coding, research, or another job where predictability can be an advantage. For brainstorming, naming, advertising, travel ideas, and other creative work, a model that keeps returning familiar options can quietly narrow the field before a person has had a chance to explore it.

The predictability problem

The source article opens with a simple test: ask a chatbot for a random number between 1 and 10. The answer is often 7. Ask again, and the next results tend to follow a recognizable path, commonly 3 or 4, then 8 or 9.

That is not proof that every model will always behave the same way. It is a demonstration of a larger issue: many LLMs favor responses that sit near the center of what they have learned to produce. In ordinary use, that can feel coherent and helpful. In creative use, it can feel like a rut.

Springboards, an Australian startup, is trying to build around that limitation. Its model, Flint, is designed to produce a wider spread of answers to open-ended prompts. The company’s cofounder and CEO, Pip Bingemann, puts the contrast bluntly: “Most language models are fighting hallucinations,” he says. “We welcome them.”

That does not mean Flint is meant to be correct in every factual setting. The point is different. Springboards wants a model that can help users move away from the most expected answer when the goal is ideation rather than certainty.

How Flint differs from mainstream LLMs

In one demonstration, ChatGPT and Claude both answered 7 when prompted for a random number between 1 and 10. Flint also first returned 7. But after restarting the session and prompting again, ChatGPT gave 7, Claude gave 7, and Flint gave 3.7916.

The same pattern appeared in other prompts. When asked to name a type of car, Bingemann expected ChatGPT and Claude to answer Toyota or Honda, and they did. Flint answered Ford F-150. His view is that models have access to a broader range of possible answers but often do not surface them.

A campaign tagline test made the same point in a more practical setting. Asked for a tagline for New Balance running shoes, Claude answered “Run your way.” ChatGPT also answered “Run your way.” Flint answered “Built to last, run to win.” The Flint line may not be a finished campaign, but it showed the variation Springboards is chasing.

That distinction is important. Flint is not being presented as a machine that automatically produces better ideas. It is being positioned as a model that can throw a different option into the room, giving human teams more material to evaluate.

Why repetition is getting attention

The article points to growing research interest in this kind of model convergence. A paper titled “Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond),” found a high degree of repetition within and across LLMs. The researchers asked 25 different LLMs 50 times each to write a metaphor about time, creating 1,250 responses.

Most of those responses clustered around versions of “Time is a river” or “Time is a weaver.” The researchers suggested that this may happen because many LLMs are trained in similar ways, on similar data, for similar tasks. The paper won the best paper award at NeurIPS.

Kieran Browne, Springboards cofounder and CTO, argues that the chat interface can hide how standardized these outputs are. A conversation may feel personal, even when many users are being steered toward the same ideas.

The article gives another example: band names. Browne says many models tend toward words such as “glass,” “neon,” “velvet,” or “static.” In the article’s test, ChatGPT produced “Glass Harbor,” “Static Empire,” “Neon Hearts,” and “Velvet Echo,” while Gemini suggested “Static Horizon.”

Building for controlled variety

Springboards built Flint on top of Qwen 3, an open-source model from Alibaba. Browne says the company is too small to train a foundation model from scratch: “Training a foundation model is not on the table for us. It’s just too expensive.”

The obvious approach would be to raise a model’s randomness setting, often called temperature. Springboards explored that route, but Browne says it was too blunt. Turning up the temperature everywhere can make output less coherent, including responses that switch from English into code halfway through a sentence.

Springboards instead trained Flint to identify where variety is useful inside a response. For a prompt such as “Where should I go in Europe?” the model does not need more randomness in every word. It needs more variety when it chooses the destination.

That approach makes Flint less about chaos and more about targeted divergence. The aim is to keep the response usable while allowing specific words or phrases to break away from the most common path.

Where this could fit

Springboards already has a tool for creative professionals in advertising and marketing. It uses multiple LLMs, including ChatGPT and Claude, and lets users move text around, select useful fragments, and combine them. Flint is being pitched as another model inside that workflow, especially when teams want more variety.

Zoe Scaman, founder of Bodacious and chief strategy officer at 77X, has been testing it. She says Flint is useful when she wants to move in very different directions. In one test involving a classic MBA case study about reinventing a finance company for today’s youth, she said Claude, Gemini, and ChatGPT followed a familiar route around financial literacy. Flint suggested reframing the idea of wealth accumulation.

Flint is still a prototype, and Scaman says it does not always hold up when pushed too far. Maximilian Weigl, cofounder and chief strategy officer at Uncommon, sees value in the model’s ability to introduce an odd option, but he also says most ordinary uses do not need extremes.

That may be the practical lesson. Familiar AI output is not always a failure. Sometimes “good enough” is exactly what a user wants. But when the goal is to find a less obvious idea, LLM groupthink becomes a real constraint.

Springboards is arguing for choice. A mainstream model can still be useful for clarity, structure, and reliable phrasing. A model like Flint may be useful when the first answer is too average, the second is too similar, and the work needs a sharper turn before people decide what is actually worth keeping.