The Decoder August 18, 2024 NEUTRAL

Why Cosine says Genie changes AI coding assistance

Cosine says its Genie assistant, trained with OpenAI on a GPT-4o variant, reached a 30 percent score on SWE-Bench. The company argues that curated data and synthetic correction cycles helped the model learn developer reasoning, though the benchmark results are self-reported.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

A routine AI coding tool launch with self-reported benchmark gains, with only mild autonomy and dependency implications.

Why Cosine says Genie changes AI coding assistance

Cosine is making a direct claim about the next phase of AI coding tools: better performance will come less from clever prompting and more from training models on the way software developers actually think through work.

The San Francisco-based AI startup has unveiled Genie, an AI model for software developers built with OpenAI on a GPT-4o variant. Cosine says Genie can help with bug fixes, new features, code restructuring, and other programming tasks, while also working autonomously or collaboratively.

A benchmark claim aimed at developer tools

Cosine co-founder and CEO Alistair Pullen says Genie reached a 30 percent score on the SWE-Bench test. According to the company, that is the highest score reported for an AI model in this area.

The comparison matters because SWE-Bench is presented in the source as a key test for coding-focused language models. Cosine says Genie’s 30 percent result is ahead of models from Amazon, which are listed at 19 percent, and Cognition's Devin, which is listed at 13.8 percent on a portion of SWE-Bench.

Those numbers frame Genie as more than another interface around a general-purpose model. Cosine’s argument is that software engineering requires a model to follow the kind of reasoning developers use when navigating a codebase, diagnosing a problem, trying a fix, and improving it when the first attempt fails.

There is an important caveat: the available benchmark information is company-reported. That does not make the claim irrelevant, but it does mean the score should be read as Cosine’s reported result rather than an independently established industry conclusion.

Why Cosine trained rather than only prompted

Cosine says it worked with OpenAI to train a GPT-4o variant using high-quality data. Pullen’s stated view is that many coding assistants run into the same limit when they rely mainly on wrapping general models as separate products.

Everyone working on this problem is butting up against the same limit of model intelligence, this is why we chose to train rather than prompt

That sentence explains the company’s central strategy. Instead of treating the base model as fixed and trying to coax better answers from it, Cosine tried to shape the model itself with data chosen for software engineering work.

The source describes this as an attempt to "codify human reasoning." In practice, that means Cosine wants Genie to reflect the process of experienced developers, not just produce code-like output. The company says the assistant is designed to fix bugs, build features, restructure code, and handle varied programming tasks in ways that mirror developer cognition.

The data strategy behind Genie

Cosine says Genie was developed through a proprietary process that trained and fine-tuned a non-public GPT-4o variant with billions of tokens of high-quality data. The company spent nearly a year curating the data with help from experienced developers.

The dataset described in the source is heavily weighted toward common web and software development languages. It includes 21 percent each of JavaScript and Python, 14 percent each of TypeScript and TSX, and 3 percent each of other languages from Java to C++ to Ruby.

That mix matters because coding assistants are only as useful as their ability to operate across the languages and project types developers actually encounter. Cosine’s emphasis is not just on data volume, but on data quality and relevance.

The company also used synthetic data to address a specific training weakness. At first, Genie learned mostly from perfect, working code. That helped with correct examples, but the model struggled when it had to recover from its own wrong answers.

Cosine’s response was to show the model what improvement looked like. When Genie’s first proposed solution was incorrect, the model was shown how to move toward the correct result. According to the company, repeated cycles of this process improved Genie’s solutions and reduced the need for corrections.

Context windows and the timing of the product

Pullen saw the potential for large language models to support human software developers in early 2022. The source says the technology was not yet advanced enough at that time to deliver the vision behind Genie.

One limitation was context. The source notes that context windows were often limited to 4,000 tokens, creating a bottleneck for code work where the model may need to understand more than a short prompt or isolated function.

The environment has changed. Models such as Gemini 1.5 Pro can process up to two million tokens in one prompt. Cosine has not disclosed Genie's token capacity, so that part of the product remains unspecified in the source.

Even without that detail, the broader point is clear: coding assistants become more useful when they can consider more of the surrounding project. Debugging, refactoring, and feature work often depend on relationships between files, patterns, and prior decisions. Cosine’s public positioning suggests Genie is meant to operate closer to that reality.

What Cosine plans next

Cosine does not present Genie as the end of its model work. The company plans to expand its portfolio with smaller, specialized models and larger, more general ones. It also plans to increase involvement in open-source communities and to improve Genie regularly based on customer feedback.

The source says full model training is not ruled out for the future, given the size of the dataset. That leaves open a path beyond fine-tuning or adapting existing model variants.

Pullen also argues that the same idea may extend outside software development.

We truly believe that we’re able to codify human reasoning for any job and industry. Software engineering is just the most intuitive starting point and we can’t wait to show you everything else we’re working on.

For now, the product is still limited in availability. Cosine will offer Genie in two pricing tiers: a roughly $20 option with some feature and usage limitations, and an enterprise-level offering with advanced features and virtually unlimited usage. Interested parties can currently only join a waiting list.

The company, described as a Y Combinator spin-off, recently secured $2.5 million in seed funding from various venture capital firms to support Genie’s development and plans.

Cosine’s claim combines two lessons highlighted in the source: models may perform better when they imitate human work patterns, and the quality of training data is crucial. If Genie’s self-reported SWE-Bench result holds up under broader scrutiny, it would strengthen the case for AI coding assistants trained around developer reasoning rather than positioned only as general models with programming prompts.