The Decoder February 20, 2025 TERMINATOR

How Evo 2 Pushes AI Biology Toward Whole-Genome Design

Evo 2 is described by its creators as the largest AI model yet built for biological applications. It can process long DNA contexts, generate complex genetic structures, and analyze variants across bacteria, archaea and eukaryotes, while its creators have released it as open source.

WTF Index TERMINATOR

◄ Terminator 4 Idiocracy 0 ►

A powerful open-source genome-scale AI system that can generate complex biological sequences raises clear bio-design and misuse concerns.

How Evo 2 Pushes AI Biology Toward Whole-Genome Design

Evo 2 brings a language-model approach to DNA at a scale its research team says has not been reached before in biological AI. The model is designed to read, predict and generate biological sequences across a wide range of life forms, from molecular patterns to genome-scale structures.

The work comes from researchers at the Arc Institute, Stanford University, UC Berkeley, UC San Francisco and Nvidia. Their central claim is straightforward: by training on a very broad genome atlas, Evo 2 can learn useful patterns in DNA sequence data without being separately tuned for each biological task.

A Genome Model Built Across Life Forms

Evo 2 was trained on an atlas containing 9.3 trillion DNA base pairs. That dataset spans bacteria, archaea and eukaryotes, and represents more than 100,000 species.

That breadth matters because the model is not limited to a single narrow category of organisms. According to the researchers, Evo 2 can predict and design biological sequences from molecular to genomic scales across all life forms included in its training scope.

The team built two versions of Evo 2. One has 7 billion parameters, while the larger version has 40 billion parameters. Both versions can process sequence contexts up to 1 million base pairs long.

That long context window is one of the model's defining features. DNA function can depend on relationships spread across large stretches of sequence, so a system that can work with longer contexts has a different operating range from one limited to short fragments.

What Evo 2 Can Generate

In testing, Evo 2 showed it could generate complete mitochondrial genomes, prokaryotic genomes, and eukaryotic chromosomes. The generated structures matched the length and complexity of natural ones, according to the source article.

The model also appears to pick up biological characteristics from sequence data alone. The researchers say it can predict how genetic variants affect function by analyzing DNA sequences, without additional task-specific training.

One test involved mutations in the breast cancer gene BRCA1. In that setting, Evo 2 nearly matched the accuracy of the best existing AI models in identifying disease-causing changes.

Those results do not mean the model can yet create living, working genomes. Brian Hie from Stanford and Arc Institute acknowledges that Evo 2's generated genomes are better than those from its predecessor, but likely would not function in living cells yet.

Why Chromatin Accessibility Matters

One of Evo 2's more notable capabilities involves chromatin accessibility. This refers to how tightly DNA is packed inside the cell nucleus.

That packaging affects whether cellular proteins can reach genes and activate them. If a region is accessible, it may be available for gene activity. If it is tightly packed, it may remain inactive.

The researchers used a method called inference time search. In this process, Evo 2 generates multiple possible sequences and then filters them through an evaluation function.

Using this approach, the team found that Evo 2 could control complex epigenomic structures such as chromatin accessibility. The source describes this as the first demonstration of scaling results for inference time computing in biology.

The practical significance is that Evo 2 can design DNA sequences with specific epigenetic regulatory patterns. In plain terms, the model can be guided toward sequences where particular regions are intended to be accessible or inactive.

Open Source Release And Evo 1 Comparison

The team has released Evo 2 as completely open source. That includes model parameters, training and inference code, and the OpenGenome2 dataset.

This makes Evo 2 one of the largest fully open models in the field. Like Evo 1, it uses a hybrid architecture from the StripedHyena series.

Compared with Evo 1, the new model is a major expansion. Evo 2 trained on 30 times more data, includes eukaryotes, and extends sequence context from 8,000 to 1 million base pairs. The source says this expansion was enabled partly by the new StripedHyena 2 architecture.

The earlier model could only work with prokaryotes. Evo 2, by contrast, makes genome-wide predictions across all life domains with improved accuracy.

Limits, Safety Choices And Open Questions

The model's scale does not settle every scientific question. Stanford computational biologist Anshul Kundaje praised the technical architecture but questioned whether Evo 2 truly understands remote non-coding sequences that regulate gene activity.

That caveat is important because biological function is not only about generating sequences that look realistic. The harder question is whether a model captures the regulatory logic that determines how DNA operates in living systems.

The team also made specific safety decisions. Human and complex organism pathogens were deliberately excluded from the training data for ethical and safety reasons. The researchers also ensured the model will not provide useful responses about these pathogens.

For now, Evo 2 is best understood as a large open AI system for biological sequence prediction and design, not as a finished engine for creating functioning organisms. Its importance lies in the combination of scale, openness, long-context DNA modeling and early evidence that inference time search can help steer biological design.