Building the Agent Native Office: Lessons from Datadog

Scaling AI Agents from Demos to Production

Scaling AI agents from a few prototypes to an "agent-native office" requires a shift in focus from raw intelligence—which is no longer the primary bottleneck—to infrastructure, durability, and evaluation. For enterprises, the goal is to move beyond "pretty demos" toward a fleet of self-healing, cloud-deployed agents that handle diverse workloads across SRE, development, and security.

The Datadog Agent Trifecta

Datadog has implemented three primary agent types to automate core operational tasks:

AI SRE Agent: Automatically debugs system problems, reducing the manual burden on Site Reliability Engineering teams.
AI Dev Agent (Bits AI Dev): Writes and develops code specifically to fix errors and problems identified within the system.
Security Analyst Agent: Investigates suspicious signals within SIM products to determine if a security issue is real, automating the initial triage process.

Core Principles for Agent-Native Infrastructure

To scale to hundreds of agents, organizations must move away from simple chat interfaces and toward a structured, agent-first operational model.

Agent-First UX and the "New Bezos Mandate"

User experience design must evolve to treat automated agents as first-class users. This involves moving beyond human-centric visuals to provide agent-friendly interfaces.

Key implementations include:

Agent-Friendly Interfaces: Adopting standards like MCP (Model Context Protocol), APIs, and skills for every piece of functionality offered to customers.
Documentation Optimization: Providing .md support for documentation and implementing llms.txt to make information easily consumable by LLMs.
Internal Validation: Teams should perform their own tasks using agents periodically to ensure the interfaces are functional and intuitive for non-human users.

Proactive, Event-Driven Architectures

Chat is a useful modality for customer interaction, but it should not be the primary trigger for enterprise agents. Most agents should be proactive and event-driven, running in the background and triggered by system events rather than human prompts.

To ensure these background agents are reliable, the following are recommended:

Durability Layers: Using tools like Temporal to ensure agents are durable and can recover from timeouts or failures.
Sandboxing: Appropriately isolating agents to prevent data loss or unauthorized system changes.

The Rigor of Evaluation (Eval)

Building agents without a strong evaluation framework leads to "vibe coding," where developers tweak tools without knowing if the agent is actually improving. A robust eval system requires three stages:

Offline Eval: Using a representative, measurable, and rerunnable dataset to test changes.
Online Eval: Using observability data to monitor how agents perform in the wild.
Continuous Feedback Loops: Regularly pulling real-world interaction traces back into the offline dataset to account for drift in customer behavior or model performance.

The "Bitter Lesson" of Agents

In the context of agents, the "bitter lesson" is that general methods leveraging off-the-shelf models will win over highly customized, hand-tuned agent logic. As models leapfrog each other in capability, specific tweaks often become obsolete.

Model and Framework Agnosticism

Because of "jagged intelligence"—where the best general model may not be the best for a specific task—organizations should:

Stay Model Agnostic: Be comfortable swapping models quickly based on the evaluation data.
Stay Framework Agnostic: Avoid top-down mandates on which framework (e.g., LangGraph, OpenAI Agents, Pydantic) to use, allowing teams to experiment with the best tool for their specific workload.
Leverage Memory: Use memory agents to extract semantic knowledge and context from observability traces, ensuring that improvements are preserved even when the underlying model is swapped.

The Future of Agent Collaboration

Multiplayer functionality is shifting from "multiple mice on a screen" to collaboration between humans and agents, and between agents themselves.

Human-Agent Collaboration: Moving toward high-bandwidth interactions, such as sharing terminals or using voice and real-time interactions to guide agents.
Agent-to-Agent Communication: Establishing secure enclaves (e.g., restricted EKS clusters) where agents can share information and trigger one another safely.
Knowledge Sharing: Creating "skills hubs" or MCP hubs where team members can share and remix the tools and skills used by their agents.

Future Predictions for Enterprise AI

Learning on the Job: A shift toward reinforcement learning (RL) in the enterprise, where agents improve based on real-world outcomes.
Synthetic Environments: The creation of "world models" for specific products—synthetic versions of services where agents can be trained and tested against modeled human behavior.
Long-Horizon Planning: A transition from tasks lasting minutes to durable agents capable of executing workflows over several days.
Generative UI: The emergence of on-the-fly, custom-generated user interfaces tailored to the specific needs of a current observability task.

Building the Agent Native Office: Lessons from Datadog

Building the Agent Native Office: Lessons from Datadog

Scaling AI Agents from Demos to Production

The Datadog Agent Trifecta

Core Principles for Agent-Native Infrastructure

Agent-First UX and the "New Bezos Mandate"

Proactive, Event-Driven Architectures

The Rigor of Evaluation (Eval)

The "Bitter Lesson" of Agents

Model and Framework Agnosticism

The Future of Agent Collaboration

Future Predictions for Enterprise AI

Sources