Elicit: Building World Models for Trusted Scientific Reasoning

The Thesis: Moving Beyond Outcome-Based AI

To support high-stakes scientific decisions, AI must move from being a "black box" that provides an answer to a transparent system that provides a verifiable process. Elicit focuses on process supervision—rewarding and evaluating the step-by-step reasoning rather than just the final output—to prevent models from "hallucinating" the completion of tasks or providing persuasive but unfounded conclusions.

Trusted Reasoning via Domain-Specific Languages (DSL)

Elicit addresses the inherent "fuzziness" of large language models (LLMs) by implementing a domain-specific language (DSL) that defines reasoning primitives. This architecture allows frontier models to orchestrate structured workflows that are guaranteed to execute as defined.

Trust at Scale

While a user can manually check a model's output for a few papers, they cannot do so for 10,000. Elicit's DSL ensures that the same rigorous process is applied to the first document as the 10,000th. This systematicity is a core differentiator from general-purpose research agents, which may claim to have analyzed a large corpus of data but fail to actually do so upon inspection.

The Role of Process Supervision

Process supervision is critical because models trained primarily on outcomes are prone to "reward hacking," where they produce an answer that looks correct to a human evaluator without having performed the necessary work. Elicit emphasizes that the only way to ensure a result is correct for the right reasons is to monitor the process—such as tracking which specific sections of a paper a model read before forming a conclusion.

External World Models and Knowledge Representations

To handle massive bodies of evidence (e.g., 5,000+ relevant papers on a specific cancer treatment), Elicit is moving toward external world models. Rather than relying on the model's internal weights or a massive context window, they use structured representations that humans and AIs can inspect.

Beyond Text Files

While simple markdown wikis (similar to the "LLM Wiki" concept) are a starting point, Elicit explores more sophisticated representations to support:

Predictions: Forecasting outcomes based on current evidence.
Interventions: Analyzing what happens if a specific variable is changed.
Counterfactuals: Determining what would have happened if a different path had been taken.

Heterogeneous Representations

World models are not limited to a single format. Depending on the use case, a world model might be a causal graph (nodes and arrows) for biological mechanisms, a SQL table for user metrics, or a "tech tree" for product development. The challenge lies in ensuring information propagates consistently across these different representations.

Evaluating Evidence and Confidence

In scientific research, not all evidence is equal. Elicit focuses on discriminating against evidence based on quality rather than relying on lossy proxies like citation counts or journal impact factors.

Calibrating Confidence

Verbalized calibration (asking a model how confident it is) is currently more useful than token probabilities, though models remain "easy to push around." If a user suggests a counter-argument, models often pivot their confidence levels too easily. Elicit aims to build more stable probabilities by grounding claims in explicit evidence and breaking insights down into individual, checkable claims.

The "Hard-to-Verify" Problem

Many high-level tasks (like company strategy) are "hard to verify." Elicit's approach is to reduce these fuzzy, high-level tasks into a graph of smaller, easy-to-verify tasks. While formal verification is possible in mathematics or coding, scientific reasoning requires a different kind of "certificate of reasoning"—a legible trace that proves the appropriate steps were taken.

Operationalizing AI: "The Line" and Token Economics

Elicit applies its reasoning philosophy to its own internal operations through a system called "The Line," an automated software engineering pipeline.

Automated Engineering

"The Line" handles the end-to-end process of feature development: speccing, implementation, testing (via recorded video), code review, and merging. This system currently merges 30 to 50 issues per week automatically, with humans intervening only when a spec is incomplete or a feature is too complex for automated review.

Token Economics

As token costs rise, Elicit is moving away from using the largest model for every task. Instead, they use a "smart orchestrator" that dispatches simpler tasks to smaller, more efficient models, reserving frontier models for high-level reasoning and orchestration.

Future of AI in Science

The founders argue that the future of AI for science is not a single "winner" but a vast ecosystem of tools. They suggest that discretization (thinking in tokens/words) provides essential error correction that purely continuous weight-space representations (neuralese) lack. By maintaining legible, discrete reasoning traces, AI can remain a tool for human-steered discovery rather than an opaque oracle.

Elicit: Building World Models for Trusted Scientific Reasoning

Elicit: Building World Models for Trusted Scientific Reasoning

The Thesis: Moving Beyond Outcome-Based AI

Trusted Reasoning via Domain-Specific Languages (DSL)

Trust at Scale

The Role of Process Supervision

External World Models and Knowledge Representations

Beyond Text Files

Heterogeneous Representations

Evaluating Evidence and Confidence

Calibrating Confidence

The "Hard-to-Verify" Problem

Operationalizing AI: "The Line" and Token Economics

Automated Engineering

Token Economics

Future of AI in Science

Sources