The AI Factory's Bottleneck: Architecting the Inference Stack for Compound Agentic Latency
Back to Insights
AI Infrastructure

The AI Factory's Bottleneck: Architecting the Inference Stack for Compound Agentic Latency

8 June 20268 min read

The era of measuring LLM performance by simple throughput is over. As enterprises build 'AI Factories' for agentic workloads, the critical bottleneck is now 'compound latency' — the end-to-end time for complex, multi-step tasks. This requires a fundamental rethink of the inference stack.

The industry discourse has coalesced around the concept of the "AI Factory." Driven by hyperscaler-scale investments and platforms like NVIDIA's DSX, organisations are now expected to operate their AI capabilities with industrial efficiency. Yet, a critical flaw exists in how most platform teams are measuring and architecting for this new reality. We remain fixated on legacy metrics like Time-To-First-Token (TTFT) and output tokens per second. These are dangerously obsolete in the face of agentic AI.

An agentic workflow is not a single inference call. It is a graph of operations: a planning step, a series of tool calls, data retrieval, synthesis, and final response generation. The user experiences the sum of these latencies. This is "compound latency," and it is the primary performance bottleneck and architectural challenge for any serious AI platform team today. Optimising a single node in this graph while ignoring the whole is a futile exercise. To architect a true AI Factory, we must engineer the entire inference stack to minimise this compound, end-to-end latency.

Abstract image of interconnected nodes representing a complex AI system.
Modern AI infrastructure must be architected for complex, interconnected agentic workflows, not single-shot inference.

Serving Engine Trade-offs: Throughput vs. Latency

The foundational decision for any inference stack is the serving engine. Today, the choice is largely a dichotomy between the vLLM project and NVIDIA's TensorRT-LLM. This is not a matter of which is "better," but a critical trade-off between maximising throughput in a multi-tenant environment and minimising latency for a single, high-priority task.

vLLM has become the de facto standard for many, primarily due to its pioneering implementation of PagedAttention. This innovation treats the KV cache like virtual memory, dramatically improving GPU memory utilisation and enabling high-throughput, continuous batching for diverse, unpredictable workloads. For a public-facing API endpoint serving thousands of disparate users with varying prompt lengths, vLLM's ability to achieve over 85% GPU utilisation makes it an economically sound choice.

Conversely, TensorRT-LLM, particularly version 0.10.0 and beyond, is engineered for raw speed on a specific task. Its strength lies in ahead-of-time (AOT) compilation of the model into an optimised engine. It performs aggressive operator fusion, kernel auto-tuning, and leverages technologies like in-flight batching to minimise kernel launch overhead. For an internal agentic system performing a critical, multi-step financial analysis, shaving 150ms off each of the ten LLM calls in the chain is a 1.5-second reduction in compound latency. That is a tangible performance gain that justifies the architectural rigidity. The choice is clear: architect for multi-tenant throughput with vLLM, or dedicated task latency with TensorRT-LLM.

The New Economics of Quantisation

Quantisation is no longer just a strategy for fitting large models into limited VRAM. It has become a primary lever for unlocking performance and fundamentally altering the cost structure of your AI Factory. The conversation has moved decisively beyond 16-bit floating-point formats (FP16/BF16) towards lower-precision representations like INT4 and, critically, 8-bit floating-point (FP8).

With hardware like NVIDIA's Hopper and Blackwell architectures, FP8 is not an afterthought; it's a design prerequisite for achieving maximum theoretical performance. Using the FP8 Transformer Engine on an H100 can deliver a significant throughput increase over FP16 with negligible quality loss for many models. This isn't a minor tweak; it's a step-change in performance per watt and per dollar.

3.7x
Throughput gain on Llama 3 70B using FP8 vs FP16 on H100 GPUs.
75%
Reduction in memory footprint moving from FP16 to INT4 quantisation.
<1ms
Per-layer latency overhead for AWQ methods, preserving model quality.

The method of quantisation is also a critical architectural choice. While early techniques like GPTQ were effective, they can be brittle. Activation-aware Weight Quantization (AWQ) has emerged as a superior production choice. By identifying and preserving salient weights that are most critical to model performance (based on activation magnitudes), AWQ achieves INT4/INT8 compression with markedly better perplexity scores than its predecessors. For an AI Factory operating at scale, the combination of FP8 for compute and AWQ for memory and bandwidth reduction is the new baseline for cost-efficient, high-performance inference.

State Management via the KV Cache

In autoregressive models, the Key-Value (KV) cache is the artefact that stores the attention state of the generated sequence. As prompts and conversations grow, this cache becomes the dominant consumer of GPU memory, often dwarfing the model weights themselves. In agentic systems, its role is even more profound: it is the state machine of your entire task.

The KV cache is no longer just a memory consumer; in agentic systems, it represents the state and context of an entire multi-turn task. Mismanaging it is the fastest way to performance degradation and catastrophic context loss.

Efficient management is therefore non-negotiable. PagedAttention, as popularised by vLLM, is the foundational technology here, preventing the memory fragmentation that plagued earlier systems. But the challenges evolve with agentic behaviour. An agent might explore multiple branches of a plan, generating parallel histories. How does the inference stack manage these disparate KV caches without forcing costly re-computation of shared prefixes? This is an active area of development, with techniques for cache-level prefix sharing and efficient swapping to host memory becoming crucial for enabling long-running, complex agents. Your choice of serving engine and its underlying KV cache management strategy will directly dictate the complexity of the agents you can deploy effectively.

Advanced Scheduling for the Next Wave of Models

The final frontier of inference optimisation lies in moving beyond the model itself and innovating at the scheduling and orchestration layer. Two techniques are paramount: speculative decoding and Mixture-of-Experts (MoE) routing.

Speculative decoding uses a small, fast "draft" model to generate a sequence of several tokens, which are then validated in a single pass by the larger, more powerful "verifier" model. This converts multiple sequential, latency-bound decoding steps into a single, parallelisable validation. For latency-sensitive applications, this can reduce wall-clock time by 2-3x. Architecturally, this requires a serving system capable of co-hosting two models and orchestrating their interaction efficiently. Platforms like Triton Inference Server, with its ensemble and Business Logic Scripting (BLS) capabilities, are purpose-built for these complex inference graphs.

Similarly, the rise of sparse MoE models like Mixtral 8x22B presents a scheduling challenge. During each forward pass, the model's router network must direct each token to the most appropriate "expert" sub-network. An efficient serving stack must not only hold the massive model in memory but also schedule these routing decisions with minimal overhead, ensuring the correct experts are activated on the right GPUs and that the high-bandwidth interconnects like NVLink are fully utilised. Engines like SGLang and DeepSpeed-Inference are developing specialised MoE communication and scheduling primitives to tackle this exact problem. Ignoring these advanced scheduling capabilities means leaving a significant amount of performance on the table, particularly with the next generation of open models.

"

The era of optimising for a single model's throughput is over. The defining challenge for the next generation of AI platforms is architecting for compound latency across entire agentic workflows.

Ultimately, building a genuine AI Factory requires a shift in perspective. We must move from optimising isolated models to engineering holistic systems. The critical architectural decisions—choosing a serving engine based on the latency/throughput trade-off, implementing a robust quantisation strategy with FP8 and AWQ, sophisticated KV cache management, and adopting advanced schedulers for speculative decoding and MoE—are what separate a mere model-hosting platform from a high-performance engine for agentic AI.

Ready to apply these patterns in your stack?

Book a free 45-minute AI readiness call with the Precision Data Partners team.

Book a Free Audit