Shinka Evolve: Open-Ended Program Search for Scientific Discovery

Shinka Evolve: Open-Ended Program Search for Scientific Discovery

The Core Thesis: Beyond Fixed-Problem Optimization

Real scientific progress requires the co-evolution of problems and solutions, rather than the optimization of solutions for a fixed, human-defined problem. While existing systems like AlphaEvolve can optimize solutions to specific tasks, they often get stuck in local optima because they lack the ability to automatically invent the "stepping stones"—intermediate, potentially unrelated problems—necessary to reach a major breakthrough.

Shinka Evolve: Architecture and Innovations

Shinka Evolve is designed to be a sample-efficient evolutionary framework that uses Large Language Models (LLMs) as mutation operators to search for optimal programs. Its primary goal is to democratize scientific discovery by reducing the computational cost and number of evaluations required to find state-of-the-art results.

Evolutionary Search Mechanism

Shinka Evolve maintains an archive of programs organized as a tree. The process follows an iterative loop:

  1. Sampling: Parent programs and "inspiration" programs are sampled from the database.
  2. Mutation: An LLM is prompted to improve the program via code edits, full rewrites, or crossover (combining two different programs).
  3. Evaluation: The resulting program is run through a synthetic evaluator to collect evidence.
  4. Diffusion: Knowledge gained from a successful program is diffused across the database to guide further search.

Key Technical Innovations

To improve efficiency and diversity, Shinka Evolve introduces several mechanisms:

  • Model Ensembling via UCB Bandits: Instead of relying on a single LLM, Shinka Evolve uses an ensemble of frontier models (e.g., GPT-5, Sonnet 4.5, Gemini). It employs an Upper Confidence Bound (UCB) bandit algorithm to adaptively select which model to use for a specific mutation, balancing exploration of different models with the exploitation of those that have historically yielded improvements.
  • Mutable Markers: To prevent LLMs from deleting essential code (like imports), the system uses markers to define which parts of the code are mutable and evolvable, employing rejection sampling to ensure robustness.
  • Meta-Scratchpad: The system maintains a global set of insights and summaries extracted from successful programs. These insights are converted into meta-recommendations that are added to the system prompt, allowing the system to semantically grasp and propagate discoveries.

Concrete Results and Applications

Shinka Evolve has demonstrated the ability to outperform human-designed or previously known algorithmic results with significantly fewer evaluations:

  • Circle Packing: The system achieved state-of-the-art results in circle packing (maximizing the sum of radii of circles in a square) in fewer than 200 LLM interactions. Robert Lange notes that using a "surrogate problem" (allowing a tiny amount of overlap before refining to an exact solution) was a key stepping stone to this success.
  • Competitive Programming: In the ALE-Bench (a benchmark for long-horizon algorithm engineering), Shinka Evolve optimized initial solutions to the point where it would have ranked second place in an AtCoder competitive programming challenge.
  • Agentic Scaffolds: Using the Automated Design of Agentic Systems (ADAS) framework, Shinka Evolve evolved agent scaffolds for AIME mathematics benchmarks, significantly improving the performance of smaller, cheaper models like GPT-4.1 nano.
  • MoE Load Balancing: The system evolved load-balancing loss functions for Mixture-of-Experts (MoE) models, illuminating a convex hull of trade-offs between model performance and load balancing.

The "AI Scientist" and the Future of Research

Robert Lange discusses the transition from AI Scientist v1 to v2, moving from a template-based linear execution to an agentic tree search.

From Linear to Tree Search

While v1 followed a linear path (idea $\rightarrow$ experiment $\rightarrow$ paper), v2 implements a loop of hypothesis generation, execution, and falsification based on Karl Popper's scientific method. This allows the agent to adapt its next steps based on the evidence accumulated from previous failed or successful experiments.

The "Slop" Critique and Human Agency

Addressing the concern that AI-generated papers may be "slop" (surface-level mimicry without deep understanding), Lange acknowledges that not every output is nature-worthy. However, he argues that the system is currently at a "GPT-1 moment" for autonomous research. He posits that humans will transition from executing research to shepherding it—steering the direction of exploration and verifying the final results while the AI handles the iterative drudgery of experiment execution.

Long-term Predictions: The Rubicon Moment

Lange predicts that scientific research will be fundamentally transformed over the next 5-20 years. He identifies the "Rubicon moment" as the point when a massive new architecture (e.g., a successor to the Transformer) is discovered by an AI system and subsequently adopted by humans. He believes that while AI can currently perform surface-level recombination, the gap to deep, grounded understanding will be closed through increased diversity, scaling, and the integration of verifiable feedback loops.

Sources