Open-Source AI Model Trained on Trillions of DNA Bases

Arc Institute and NVIDIA have released Evo 2, a 40-billion parameter AI model trained on 9.3 trillion nucleotide bases from 128,000 species spanning every domain of life. Published in Nature in March 2026, it is the largest biological AI model ever built, capable of reading and generating DNA sequences up to one million bases long. But the real story is not what Evo 2 can do today. The real story is that biology just got its foundation model moment, and the downstream consequences will restructure drug discovery, synthetic biology, and the competitive landscape of biotech AI for the next decade.

The Architecture That Makes It Possible

To understand why Evo 2 matters, you need to understand why genomic AI has been stuck. Transformers, the architecture behind GPT-4 and its descendants, scale quadratically with sequence length. A human gene might span 50,000 bases. A regulatory region might interact with elements hundreds of thousands of bases away. Fitting that context into a standard transformer is computationally brutal. Prior genomic models either truncated their inputs to a few thousand bases or used coarse-grained representations that threw away single-nucleotide resolution. Either way, they were blind to the long-range dependencies that define how genomes actually work.

Evo 2 solves this with StripedHyena 2, a hybrid architecture that combines input-dependent convolution operators with selective attention mechanisms. The convolution layers handle local patterns efficiently, capturing motifs like transcription factor binding sites and splice signals. The attention layers handle long-range interactions, linking a promoter to an enhancer 500,000 bases away. The result is a model that trains nearly three times faster than an equivalently sized optimized transformer while processing sequences eight times longer than its predecessor, Evo 1.

This is not an incremental improvement. Evo 1, published in Science in 2024, handled 131,000 bases and was trained on 300 billion tokens, almost exclusively from prokaryotic genomes. Evo 2 processes one million bases and was trained on 8.8 trillion tokens from a dataset called OpenGenome2 that spans bacteria, archaea, viruses, plants, fungi, and animals, including the human genome. The jump from Evo 1 to Evo 2 is roughly equivalent to the jump from GPT-2 to GPT-4 in both scale and capability.

The training ran on NVIDIA's infrastructure, and the model is integrated into NVIDIA's BioNeMo framework. But critically, the weights and code are fully open source on GitHub. This is a deliberate strategic choice, and one that has enormous implications for who gets to build the next generation of biological tools.

AlphaFold Solved Structure. Evo 2 Solves the Genome.

The obvious comparison is to DeepMind's AlphaFold, which predicted the 3D structures of nearly every known protein and won the 2024 Nobel Prize in Chemistry for its creators. AlphaFold is a landmark achievement, but it operates at a fundamentally different level of biological abstraction. It takes a single protein sequence and predicts its folded shape. It does not understand why that protein exists, how its gene is regulated, or what happens when a mutation in a non-coding region 200,000 bases upstream alters its expression level.

Evo 2 works at the genome level. It reads raw DNA, the actual source code of life, at single-nucleotide resolution. It can identify pathogenic mutations in human genes, including clinically significant BRCA1 variants, without any task-specific fine-tuning. It can predict the functional impact of non-coding mutations, the kind that fall in regulatory regions and have historically been invisible to computational tools. And it can generate entirely new genomic sequences, from mitochondrial genomes to bacterial chromosomes to yeast chromosomes, that are structurally and functionally plausible.

The two models are not competitors. They are complementary layers of a stack that does not fully exist yet. Imagine Evo 2 designing a novel gene regulatory circuit, then AlphaFold predicting the structure of the proteins it encodes, then a molecular dynamics simulator testing their interactions. That pipeline, from genome design to protein structure to functional simulation, is the endgame for computational biology. Evo 2 fills the first and most foundational layer.

But the competitive dynamics are worth noting. AlphaFold is controlled by Isomorphic Labs, Alphabet's drug discovery spinout. Its access model is strategically constrained. Evo 2 is fully open, backed by a nonprofit institute funded by Patrick Collison (Stripe's CEO), Silvana Konermann, and Patrick Hsu. Arc Institute's explicit model is to fund high-risk science without traditional grant dependencies, giving researchers $1 million over five years with no strings attached. This is not charity. It is an institutional bet that open foundational tools will generate more long-term value than proprietary ones. The same bet that made Linux and the internet possible.

The Synthetic Biology Implications Are Staggering

Evo 2's generative capabilities are where things get genuinely unprecedented. The model can design synthetic genomes at three levels of complexity: mitochondrial genomes (roughly 16,000 bases), minimal bacterial genomes like Mycoplasma genitalium (roughly 580,000 bases), and eukaryotic chromosomes from yeast. These are not random sequences that happen to look like DNA. They are functional designs that respect the structural logic of real genomes, including gene order, regulatory spacing, and codon usage patterns.

The model can also perform controlled generation by integrating external tools at inference time. In one demonstration, researchers combined Evo 2 with chromatin accessibility models like Enformer and Borzoi to generate DNA sequences with programmable regulatory properties, sequences specifically designed to be open or closed to transcription machinery in particular cell types. This is genome engineering guided by AI reasoning, not just pattern matching.

For the synthetic biology industry, this changes the economics of the design-build-test cycle dramatically. Companies like Ginkgo Bioworks, Twist Biosciences, and Synthetic Genomics have spent years building platforms that iterate through thousands of genetic designs to find ones that work. The bottleneck has always been the design step: knowing which sequences to synthesize and test. If Evo 2 can narrow the design space by even an order of magnitude, it collapses the cost and timeline of engineering organisms for everything from biofuel production to therapeutic protein manufacturing.

The startup implications are immediate. Any founder building in synthetic biology, gene therapy, or agricultural biotech who is not already integrating Evo 2 into their computational pipeline is falling behind. The model is free. The compute to run it is not trivial (40 billion parameters requires serious GPU resources), but NVIDIA's BioNeMo integration provides cloud-accessible inference. The barrier to entry for AI-guided genome design just dropped from "build your own foundation model" to "write an API call."

The Open Source Gambit and Who It Threatens

Arc Institute's decision to release Evo 2 as fully open source is the most consequential strategic move in biotech AI since DeepMind published the original AlphaFold paper. It creates a gravitational center for the entire field. Every academic lab, every biotech startup, every pharmaceutical company can now build on Evo 2 rather than training their own genomic foundation model from scratch. This consolidation effect will be powerful.

The losers are companies that have been building proprietary genomic AI models. Several well-funded startups, along with internal teams at major pharma companies, have spent the last two years training their own DNA language models on smaller, curated datasets. Evo 2 makes most of that work redundant overnight. A 40-billion parameter model trained on 9.3 trillion bases from the full tree of life is not something you can match with a Series B and a few hundred GPUs. The foundation layer has been commoditized.

The winners, beyond the obvious beneficiaries in academic research, are companies that sit on proprietary biological data. Evo 2 is a general-purpose foundation model. Its real power will emerge through fine-tuning on specific domains: rare disease genomics, crop genetics, microbiome engineering, viral evolution. Companies with unique datasets in these areas can now build specialized models on top of Evo 2 at a fraction of the cost. The value shifts from "who can train the biggest base model" to "who has the best data and the sharpest fine-tuning."

This mirrors exactly what happened in natural language processing after Meta released LLaMA. The foundation layer became free, and the competition moved to data, fine-tuning, and application-layer differentiation. Biotech AI is about to undergo the same structural shift, compressed into months rather than years because the playbook already exists.

What Comes Next: Three Predictions

First, within 12 months, at least one major pharmaceutical company will announce a drug candidate whose target was identified or validated using Evo 2 or a model fine-tuned from it. The model's ability to predict functional impacts of non-coding variants gives it immediate utility in target discovery for genetic diseases, particularly for the thousands of disease-associated variants that fall outside protein-coding regions and have been sitting in GWAS databases unexplained for years.

Second, the Evo architecture will expand beyond DNA. The same StripedHyena 2 approach that handles million-base genomic sequences can handle other long-range biological data: epigenomic profiles, spatial transcriptomics, long-read sequencing data. Arc Institute will likely release Evo 3 within 18 months with multimodal biological inputs, integrating DNA sequence with chromatin state, gene expression, and protein structure. The convergence of these data types into a single model will unlock capabilities that none of them can achieve alone.

Third, and most controversially, Evo 2 will force a serious policy reckoning around AI-designed organisms. The model can already generate plausible bacterial genomes from scratch. As the technology matures, the ability to design novel organisms with specific properties, metabolic pathways, virulence factors, or environmental behaviors will become accessible to anyone with a GPU cluster and a DNA synthesis provider. The biosecurity community has been sounding alarms about this for years, but Evo 2 makes the concern concrete rather than theoretical. Expect regulatory proposals around AI-guided genome design to appear in both the US and EU before the end of 2027.

Biology's foundation model era has begun. The first generation of these models, AlphaFold for structure, ESM for protein language, and now Evo 2 for genomes, are establishing the computational infrastructure that will define how we understand and engineer life for the next several decades. The fact that the most powerful of these models is open source, built by a nonprofit, and freely available to anyone is not just a technical achievement. It is a statement about how the most important scientific tools should be distributed. Whether the rest of the industry listens is the question that matters now.

Evo 2 Is Biology's GPT Moment. Here's What Happens Next