The era of agentic AI demands we move beyond optimising single, monolithic models. We must now architect for 'compound AI'—complex graphs of heterogeneous models and tools. This requires a fundamental rethink of our inference stacks, from GPU scheduling to cost attribution.
The End of the Monolithic Era
The industry's recent pivot, underscored by announcements like Microsoft's agentic AI suite at Build 2026, is not merely an evolution; it is a rupture. For the past two years, the apex challenge for AI platform teams has been serving a single, large foundation model efficiently. We solved this. Frameworks like vLLM 0.5.1 with paged attention and TensorRT-LLM 10.2 with in-flight batching have largely tamed the throughput and latency beast for request-response workloads. We have become exceptionally good at optimising a single, well-defined task: generating text from a prompt, fast.
This architectural pattern is now a dead end. Its optimisations are predicated on a homogenous workload and a singular computational graph. Agentic systems are the antithesis of this. They are not monolithic; they are compound systems. An agent designed to act on a user's behalf—say, to analyse a quarterly sales report and draft an executive summary—does not execute a single model. It orchestrates a workflow: it might call a powerful reasoning model (a GPT-5.4 class LLM) to understand the high-level request, then delegate to a smaller, finetuned model to extract structured data from a PDF, call an external API for current market data, use a code interpreter to generate visualisations, and finally, pass the synthesised context back to the primary model for summarisation.
This is a 'compound AI' workload, an execution graph of heterogeneous models, tools, and data sources. Attempting to serve this with an architecture designed for a single chatbot is like trying to run a modern containerised microservices application on a single, monolithic mainframe. The underlying assumptions are wrong, and the performance bottlenecks manifest in entirely new and unexpected places.
Architectural Patterns for Compound Inference
To effectively serve agentic workflows, we must move from model-centric to workflow-centric architectures. The primary objective is no longer maximising tokens per second from one model, but minimising the end-to-end latency and cost of a completed task graph. Two dominant patterns are emerging, each with significant trade-offs.
Most AI platforms are being built on a fundamentally outdated architectural premise. They are optimising for LLM-as-a-service, when the future is workflow-as-a-service.
The first pattern is the **Centralised Orchestrator with Distributed Specialists**. In this model, a large, state-of-the-art reasoning engine runs on a dedicated, high-performance GPU cluster (e.g., banks of H200s or B100s). This orchestrator model acts as the 'brain', decomposing tasks and calling out to other services over the network. These could be smaller, specialised models served as independent microservices—perhaps a 13B parameter model finetuned for SQL generation running on a cost-effective L40S, or a vision model on an NVIDIA L4. The primary benefit here is isolation and independent scalability. The SQL generation service can be scaled based on its specific demand, and its hardware can be right-sized for its task. The critical drawback is network latency. Every hop in the graph from the orchestrator to a specialist incurs a network round-trip time, which can easily accumulate to hundreds of milliseconds, dwarfing the actual inference time for the specialist model.
The second pattern is the **Co-located Model Ensemble**. Here, multiple models are loaded onto the same physical server, often within the same process, managed by a sophisticated serving engine like NVIDIA's Triton Inference Server. Triton's Business Logic Scripting (BLS) and Ensembling features allow a developer to define the entire workflow graph, which Triton then executes internally. A request comes in, Triton routes it to the first model, takes the output, and feeds it directly to the next model in the sequence, all within the server's memory space. This virtually eliminates inter-model network latency. The challenge, however, is the immense operational complexity. You are now playing a high-stakes game of 'GPU memory Tetris', trying to fit multiple models with different memory footprints and compute requirements onto a single device. Scheduling becomes a nightmare; a long-running batch job for one model can starve the real-time orchestrator model, leading to unacceptable user-facing latency.
The GPU Scheduling and Memory Conundrum
Regardless of the pattern chosen, the core infrastructure challenge is managing a heterogeneous pool of GPU resources efficiently. Statically partitioning a cluster—allocating specific nodes for specific models—is simple but grossly inefficient. Utilisation rates for such clusters often struggle to exceed 40% because workloads are dynamic. The demand for your code generation model might spike for an hour and then sit idle for the rest of the day, leaving expensive silicon unused.
The critical mental shift for architects is from scheduling inference requests to scheduling computational graphs. The unit of work is no longer a single prompt; it is the entire multi-step agentic task.
This necessitates a dynamic scheduling and routing layer. This 'inference gateway' must be intelligent. It needs to inspect an incoming task graph and make real-time decisions. For a given step, should it route the request to a quantised AWQ model on an A10G for a low-cost, quick response, or to a full FP16 model on an H100 for maximum accuracy? The gateway needs to be aware of the real-time load on all servers, the current GPU memory availability, and the priority of the request. This is effectively implementing the core logic of a Mixture-of-Experts (MoE) model, but at the infrastructure level—routing tasks to the best 'expert' model server for the job.
Managing memory is equally critical. Continuously loading and unloading models onto GPUs is slow and leads to memory fragmentation. Solutions like NVIDIA's Multi-Instance GPU (MIG) help by physically partitioning a large GPU like an H100 into up to seven smaller, isolated instances. This is an effective tool for the co-located ensemble pattern, allowing you to guarantee resources for a high-priority orchestrator while other instances handle specialist tasks. However, MIG partitions are fixed in size, lacking the flexibility to adapt to dynamically changing model sizes and workload demands.
Rethinking the Economics of Inference
In the monolithic world, cost was simple: dollars per million input and output tokens. In a compound AI system, this metric is dangerously misleading. The total cost of a task is the sum of all node executions in its graph. This fundamentally changes the optimisation game.
A single API call to a frontier model to perform a simple classification might cost a fraction of a cent, but if that task is performed a million times a day, the cost accumulates. If a smaller, finetuned 8B parameter model can perform that same classification task with 99.9% of the accuracy for 1/50th of the cost, the choice is obvious. The architecture must support and encourage this cost-based routing. The inference gateway should be programmable with business logic that allows it to make these trade-offs intelligently.
This also means our approach to quantisation and other optimisation techniques must become more nuanced. Techniques like Activation-aware Weight Quantization (AWQ) or speculative decoding are not all-or-nothing decisions applied to the entire system. For the primary orchestrator model that interacts with the user, minimising latency is paramount. Here, you might deploy speculative decoding, trading some additional compute for a significant reduction in perceived response time. For a backend model that processes documents in batches, raw throughput is the goal, and techniques like FP8 quantisation and aggressive batching are more appropriate. Architecting the agentic stack is about building a platform that can support this diversity of models and optimisation strategies, orchestrating them into a cohesive, efficient, and cost-effective whole.
Ready to apply these patterns in your stack?
Book a free 45-minute AI readiness call with the Precision Data Partners team.
Book a Free Audit