Devin and OpenInspect: The Shift to Background Agents and Autonomous Coding
Devin and OpenInspect: The Shift to Background Agents and Autonomous Coding
The Shift to Background Agents
AI coding is transitioning from "hand-holding" in the IDE to "background agents"—cloud-based systems that autonomously drive the development process. A critical model inflection point occurred around December 2025 (with models like Opus 4.5 and GPT 5.2), enabling a practical "spec-to-pull request" workflow where agents can move from a specification to a completed PR with minimal friction.
At Cognition, this shift resulted in a 7x growth in merged PRs and a jump in the percentage of commits attributed to Devin from 16% in January to 80% in March.
Background Agent Architecture
Building a production-ready background agent requires choosing between two primary architectural patterns for the agent's execution environment:
Harness In-the-Box vs. Out-of-the-Box
- Harness In-the-Box: The agent runs directly inside the sandbox. While simpler to manage state, it poses security risks because secrets must be placed within the box, increasing the risk of accidental exfiltration by an unpredictable AI.
- Harness Out-of-the-Box: The "brain" (the agent's logic and control plane) runs externally, while the sandbox serves as the "hands" (the execution environment). This is the more complex but secure architecture, as it separates high-privilege secrets from the machine the agent manipulates.
Infrastructure and Sandbox Requirements
- Full VMs vs. Docker: While Docker is useful for infrastructure, full VMs are often necessary for agents to run real applications, perform nested virtualization (e.g., running an Android emulator), and provide a true security boundary.
- Repo Setup: One of the hardest problems in agent deployment is "repo setup"—ensuring the agent has the correct dependencies, credentials, and environment to run and test code autonomously.
- Fast Restore: To avoid long boot times, Cognition developed a block-diff file storage format that allows VMs to be spun up and down quickly by only processing the diffs in the file system rather than the entire disk.
The Testing Challenge
Testing is a distinct problem-solving challenge that goes beyond simple "computer use" (the ability to click coordinates on a screen). Effective autonomous testing requires the agent to:
- Reason through how to orchestrate front-end and back-end applications with the correct code versions.
- Trigger specific features, which may require admin privileges or specific feature flags.
- Verify the results using screenshots and video recordings to provide a "I know it works" merge moment for the human reviewer.
Memory and Knowledge Management
Memory remains a largely unsolved retrieval problem. Current approaches include:
- Auto-generated Memories: Devin uses a system where it suggests memories to the user (e.g., "Do you want me to remember that Cole likes draft PRs?"), which the user then approves or rejects.
- Memory Pruning and Editing: Systems are evolving to allow agents to edit existing memories as preferences change.
- File-System Based Memory: There is a trend toward treating memory as a file system (e.g., a
memory.mdfile) that the agent can navigate and maintain autonomously.
Risks of "Vibe Coding" and Codebase Decay
Uncontrolled "vibe coding"—auto-merging AI code without rigorous review—leads to codebase decay. Experiments showed that a codebase could be maintained this way for about two weeks before becoming unmanageable due to duplication and inconsistent patterns.
Key Risks:
- Regression to the Worst Engineer: If an engineer uses AI without auditing the code, the AI learns those poor patterns and replicates them, exponentially growing "slop."
- AI Code Smells: Common AI-generated patterns include excessive use of
getattrin Python (reward hacking to avoid crashes) and unnecessary backwards compatibility imports to avoid modifying module names.
Production Use Cases for Background Agents
Beyond standard feature development, background agents are being deployed for:
- SRE Auto-Triage: Agents acting as first responders to alerts (from Sentry, DataDog, etc.), collecting context from logs and databases, and proposing a PR to fix the issue before a human intervenes.
- Non-Engineer Contributions: Product Managers (PMs) or marketing teams shipping quick bug fixes or changes directly via Slack prompts.
- Customer Support: Agents analyzing customer-reported bugs with full codebase context to provide immediate technical answers or triage for engineering.
- Security Scanning: Continuous autonomous security reviews of the codebase.