How bacterial DNA helped Evo generate new proteins

A Stanford University team trained a genomic language model called Evo on bacterial genomes, where related genes often sit near each other. In tests, Evo generated functional antitoxins and CRISPR inhibitors, including proteins with little or no similarity to known ones.

WTF Index TERMINATOR
◄ Terminator 3 Idiocracy 0 ►

A genomic AI system generating functional novel proteins and CRISPR inhibitors points to more powerful biology design with some dual-use risk.

How bacterial DNA helped Evo generate new proteins

AI has already shown that it can connect protein structure with protein function. The newer question is whether a system can work one level earlier, at the DNA sequence itself, and still produce biological outputs that function.

A small team at Stanford University tested that idea with Evo, a genomic language model trained on bacterial genomes. The results suggest that bacterial DNA organization can give AI enough context to generate proteins and RNA-related sequences that do more than resemble biology: in several tests, they worked.

Why bacterial genomes gave Evo useful clues

The project depends on a simple feature of many bacterial genomes. Genes with related jobs are often located next to one another. A bacterium may keep the genes for importing and digesting a sugar, or for synthesizing an amino acid, in the same neighborhood of its genome.

In many cases, those genes are transcribed into a single, large messenger RNA. That arrangement lets bacteria control an entire biochemical pathway together, which can make bacterial metabolism more efficient.

Evo was trained on an enormous collection of bacterial genomes. Its task resembled the training used for large language models: it predicted the next base in a sequence and was rewarded when the prediction was correct. Because Evo is generative, it can also take a prompt and return new sequences, with different possible outputs from the same prompt.

The researchers described the model as able to “link nucleotide-level patterns to kilobase-scale genomic context.” In plain language, Evo was not only looking at single DNA letters in isolation. It was learning how local sequence patterns fit into larger genomic neighborhoods.

Completing genes and reading genomic context

The first tests asked whether Evo could fill in missing information from known genes. When given 30 percent of the sequence of a gene for a known protein, Evo was able to output 85 percent of the rest. When given 80 percent of the sequence, it could return all of the missing sequence.

The model also performed a broader contextual task. When a single gene was removed from a functional cluster, Evo could identify and restore the missing gene. That mattered because the test was not only about memorizing a familiar sequence. It also asked whether the model could infer what belonged in a gene neighborhood.

When Evo changed a sequence, those changes tended to appear in parts of the protein where variation is tolerated. That suggests the training data helped the system absorb constraints that evolution places on known genes. The model appeared to distinguish between regions that can vary and regions that are more important to preserve.

Generating antitoxins that bacteria could use

The researchers then moved from completion to generation. They focused on bacterial toxins, which are typically encoded with an antitoxin. The antitoxin keeps the cell from killing itself when the toxin genes are activated.

This was a useful test case because there are many toxin and antitoxin examples, and they can evolve rapidly in competition among bacteria and their rivals. The team developed a toxin that was only mildly related to known ones and had no known antitoxin, then gave its sequence to Evo as a prompt.

To make the test harder, the researchers filtered out any Evo responses that looked similar to known antitoxin genes. They then tested 10 of the model’s outputs.

  • Half were able to rescue some toxicity.
  • Two fully restored growth to bacteria producing the toxin.
  • Those two antitoxins had about 25 percent sequence identity to known anti-toxins.
  • They did not appear to be simple patchworks of a few known anti-toxins.

At a minimum, the two successful antitoxins appeared to be assembled from parts of 15 to 20 individual proteins. In an additional test, an output would have needed to be patched together from parts of 40 known proteins. The point is not that Evo merely copied a known answer. The outputs were weakly related to known sequences yet still had the needed function.

Evo also produced useful output beyond proteins. With a different toxin that had an RNA-based inhibitor, the system generated DNA encoding RNAs with the right structural features, even though the specific sequence was not closely related to anything known.

CRISPR inhibitors showed the same pattern

The team ran a related test using inhibitors of the CRISPR system. CRISPR is used for gene editing, while bacteria evolved it as protection from viruses. Naturally occurring CRISPR inhibitors are highly diverse, and many do not appear closely related to one another.

Again, the researchers filtered Evo’s outputs so they only included sequences that encoded proteins, and removed proteins that looked like known examples. Of the outputs they made proteins from, 17 percent inhibited CRISPR function.

Two outputs stood out. They had no similarity to any known proteins and also confused software designed to predict three-dimensional protein structure. That makes them especially notable because Evo appears to have reached functional protein outputs without directly designing around protein structure.

This is different from directed protein design work that aims at a particular protein shape or a particular useful enzyme behavior. Evo’s route starts at nucleic acids, closer to the level where evolutionary changes happen before they show up as proteins.

What this does and does not prove

After those tests, the researchers prompted Evo with 1.7 million individual genes from bacteria and the viruses that prey on them. The result was 120 billion base pairs of AI-generated DNA. Some of it contains genes already known, and some presumably contains novel material.

The source article notes that it is not clear how this resource will be used productively. The larger point is that Evo can generate a very large DNA sequence collection after learning patterns from bacterial genomes.

There are also limits. It is not clear that the same approach will work for more complex genomes. Vertebrates mostly do not organize related genes in clusters, and their genes have more intricate structures. Those differences could make it harder for a model trained on base-frequency rules and genomic context to learn useful patterns.

Still, the result is conceptually important. Evo suggests that AI can learn from genome-level organization, not only from protein sequences or protein structures. In bacterial systems, where related genes often sit together, that context was enough to help generate functional molecules that were weakly related to known ones or, in some cases, unlike any known proteins.