The Bitter Lesson for Proteins: ESMFold 2 and the World Model of Protein Biology
The Bitter Lesson for Proteins: ESMFold 2 and the World Model of Protein Biology
The Core Thesis: Scaling Laws in Protein Biology
Protein biology is entering a paradigm shift where general-purpose language models, trained on vast evolutionary data, can emerge with deep biological understanding without explicit human-designed priors. By applying the "bitter lesson"—the observation that scaling compute and data consistently outperforms hand-crafted heuristics—BioHub has developed a world model of protein biology that can predict structure, function, and design novel proteins.
ESMC and ESMFold 2: Building a World Model
BioHub has released ESMC (the fourth generation of the Evolutionary Scale Modeling family) and ESMFold 2, an open scientific engine for protein prediction and design. Unlike previous models that relied on Multiple Sequence Alignments (MSAs) or heavy inductive biases, these models utilize a transformer-based language model architecture trained on a massive scale of protein sequences.
Key Technical Achievements
- Data Scale: The model was trained on billions of protein sequences, including a significant integration of metagenomic data (sequences from diverse biomes like hydrothermal vents and the deep ocean). This shift from curated databases (like UniRef) to metagenomics removed the diminishing returns seen in earlier versions (ESM2).
- Structure Prediction: ESMFold 2 provides atomic-resolution structure predictions in seconds, bypassing the need for MSAs and making it significantly faster than predecessors.
- Comprehensive Atlas: BioHub has resolved predicted structures for 1.1 billion proteins (clustered at 70% sequence identity) from a database of 6.8 billion non-redundant proteins.
- Multimer Capabilities: The model represents the state-of-the-art for open models in predicting protein-protein interactions.
Mechanistic Interpretability and Emergent Features
Using sparse autoencoders (SAEs), BioHub analyzed the representation space of the 6-billion parameter ESMC model. They discovered a hierarchy of features that emerged spontaneously from the "next token" prediction task, mirroring decades of reductive biological research.
The Nucleophilic Elbow Example
One concrete discovery is the model's ability to identify a "nucleophilic elbow"—a core functional motif. The model developed a single feature to represent this motif across evolutionarily diverse protein families with completely different structural topologies. This suggests the model learned a latent variable for the biological function that transcends sequence similarity.
Programmable Biology and Therapeutic Design
BioHub is moving toward "programmable biology," where the world model is used as a search space to find molecules that satisfy specific design criteria.
Designing Antibodies (scFvs)
The team has successfully used ESMC to design single-chain variable fragments (scFvs), a critical therapeutic modality. Because antibodies evolve for diversity rather than constrained paths, they often resist traditional MSA-based prediction. ESMC's representation space has proven more effective at designing antibodies with the therapeutic affinity required for medical function.
The Future: From Proteins to the Virtual Cell
Alex Rives outlines a vision for a new scientific paradigm based on three principles: data generation, predictive digital representations, and feedback loops.
The Virtual Biology Initiative
BioHub has launched a $500 million initiative to accelerate the creation of cellular-scale data. This includes:
- $400 million for internal data creation and technology development to increase measurement modalities.
- $100 million to catalyze external data generation efforts.
Scaling the Complexity Ladder
To move from molecular models to a "virtual cell," BioHub is focusing on:
- Interventional Biology: Scaling perturbation experiments to see how cells respond to novel interventions.
- Spatial Biology: Understanding cells in their native tissue context rather than in isolation.
- Cross-Modality: Simultaneously measuring the genome, epigenome, transcriptome, and proteome to map the cellular information hierarchy.
- Feedback Loops: Integrating AI with automated labs and cryo-electron tomography to create an active learning system where models reason over hypotheses and validate them experimentally.