Building reliable AI agents

Agents are more complex than traditional AI workloads. While LLMs operate on a query → response basis (a 1:1 relationship), agents take in a query, decompose it into subtasks, and take action on the user’s behalf (a 1:many relationship).

In other words, while LLMs fetch information from a variety of data sources to produce an answer, agents both fetch information and perform actions on behalf of the user to produce an answer.

Agentic workflows are more complex than traditional LLM workflows, leveraging different AI models and tools to do something useful for the user.

This means that agentic workflows must coordinate multiple moving parts: reasoning over the user's intent, deciding which tools to use, executing calls to those tools, and integrating the results. Supporting this added complexity requires a unique tech stack.

In this blog, we outline what the stack is: the architecture, tools, and infrastructure needed to build and deploy reliable AI agents. We’ll start with the models and platforms you can use to build reliable agents, then explain how to link those components together, deploy them to production, and evaluate performance in terms of both quality and responsiveness.

The anatomy of an AI agent

The agentic stack has four layers:

The cognitive layer: This layer is powered by LLMs. It’s where user intent is parsed, the course of action is determined, and decisions are made regarding action coordination.
Tool interaction: Agents must use different tools, such as databases, APIs, and other models to satisfy user intent. This layer interacts closely with the cognitive layer to determine which tools to access for various subtasks.
Oversight: This layer is responsible for agent safety and alignment. It consists of grounding, guardrails, and tracing.
Orchestration: Finally, your agent may have to interact with other agents; in this case, an orchestration layer coordinates across multiple specialized agents, managing communication, delegation, and review.

The agentic stack has four layers: the cognitive layer (typically powered by LLMs), tool interaction, oversight, and multi-agent orchestration.

Building these layers requires combining different models, tools, and platforms. Let’s look at the agentic tech stack.

The agent builder’s tech stack

To power the cognitive layer of agents, builders typically leverage state-of-the-art LLMs. Popular model families include:

Llama: A good all-rounder, with support for long context windows with Llama 4 Scout (10M token context) and Maverick (1M token context).
Qwen: A popular choice for coding tasks; typically lighter-weight to run due to the smaller size of many models in the Qwen family.
DeepSeek: Powerful for many applications with good built-in structured outputs and tool usage for both DeepSeek V3 and R1 0528.

Popular platforms for building agents include:

Langchain: This is currently the most popular platform for building agents. It offers a large amount of flexibility and control in design.
Pydantic AI: Pydantic has strong validation capabilities and data schema enforcement, which is beneficial for producing clean and reliable data.
smolagents: This is lightweight, easy-to-use, and well-integrated in the ecosystem, making it a solid choice for development.
CrewAI: CrewAI facilitates agent collaboration, making it ideal for building task-oriented agents that work together.
AutoGen: This is a good solution for those seeking low-code or no-code platforms to get started with building agents.

Once you’ve selected your models and tools, you’ll need to link them together into a coherent workflow.

Building agentic workflows

Modern agent systems require sophisticated orchestration that goes beyond simple request-response patterns. They need modular, scalable workflows, where each component can run on optimal hardware while maintaining frictionless communication.

Without proper orchestration, you can run into:

Cumbersome, inefficient scripts to coordinate between models
Inefficient hardware utilization or processing bottlenecks
Complex error handling and retry logic

For these reasons, building modular deployments—where each model or processing step in an agentic workflow is deployed as an independent unit—is a best practice for agentic systems. Each component can be deployed on optimal hardware (like using CPUs for less-intensive tasks vs. more powerful GPUs for more compute-intensive models), and ideally scale independently as needed. (These capabilities are what frameworks like Baseten Chains aim to solve.)

The agentic workflow: a user sends input, logic determines what needs to be done (like accessing other models or databases), and an answer is returned.

Agentic workflows are inherently modular, and best supported by inference stacks that support modular hardware allocation and scaling.

Building your agent is half the battle—deploying it to production is the other half.

Deploying agents into production

Deploying agents involves serving and scaling multiple interdependent components such as planners, retrievers, and model backends. This requires infrastructure that goes beyond basic model hosting, with support for:

Component-specific autoscaling: Different parts of the agent pipeline experience distinct load patterns. Inference infrastructure should allow independent scaling (e.g., scale retrievers separately from LLMs) to avoid under- or overprovisioning, performance bottlenecks, or wasted money on idle GPUs.
Heterogeneous hardware support: Components should run on optimal hardware for their role—e.g., orchestration logic on CPU, embedding models on lighter-weight GPUs, and large generative models on more powerful GPU instances. This improves both performance and cost-efficiency.
Fault isolation: Since agents span multiple services, a failure in one component (e.g., a vector database or router) can take down the entire pipeline. Use active-active topologies, retry logic, and circuit breakers to support reliability.
Latency-aware routing and orchestration: Especially for interactive agents, infrastructure should support prioritizing low-latency paths, as well as intelligent or geo-aware routing.
Observability and evaluation: Fine-grained logs, traces, and evaluation tooling (e.g., Patronus AI) are essential to debug, monitor, and improve agent performance in production settings.

Once deployed, the challenge shifts from getting agents to run to making sure they behave reliably and effectively in real-world conditions.

That’s where evaluation comes in.

Measuring agent success in production

Determining success is difficult with agents due to their increased complexity. While traditional evaluation methods like LLMs-as-a-Judge score individual outputs, evaluating agents requires analyzing the full trajectory of their behavior; their actions need to be contextualized in a dynamic environment to uncover systemic patterns and failure modes that will otherwise propagate.

Points of failure—such as miscommunication, accessing improper data sources, or delegation errors—can occur in any layer. Additionally, many agent benchmarks do not adequately capture real-world context, nuances, or performance.

To provide a more realistic alternative, Patronus AI developed TRAIL, a benchmark designed to debug and identify errors in complex AI agent workflows. Built on top of that is Percival, an AI agent debugger that evaluates across all major agent failure types with integrations with various platforms like Langchain, Pydantic, and smolagents.

Closing the loop

As agents grow in complexity and capability, building and evaluating them requires a new stack—spanning models, infrastructure, orchestration, and oversight.

If you're building and shipping production AI agents, you can reach out to Baseten's engineers for help deploying agentic workflows in production, and Patronus AI for support evaluating and improving their performance for real users.