The era of the 'AI Factory' is here, but capital investment in GPUs is only the first step. The critical challenge for senior technologists is architecting the software stack to prevent these billion-dollar assets from becoming monuments to inefficiency. We dissect the critical trade-offs in quantisation, serving engines, and advanced throughput techniques that define modern AI infrastructure.
The headlines are dominated by gigawatt-scale data centres and sovereign compute initiatives, rightly dubbed ‘AI Factories’. The recent announcement from SK Telecom, planning a national AI cloud powered by NVIDIA’s DSX platform, is merely the latest example of this monumental capital expenditure cycle. For the practitioner, however, the commissioning of the hardware marks the beginning, not the end, of the architectural challenge. An AI Factory without a meticulously optimised software stack is just an expensive, underutilised shed of servers. The defining battle for AI platform teams over the next 24 months will be fought in the inference stack, where decisions on quantisation, serving engines, and memory management directly dictate the return on that massive investment.
The objective is not merely to serve a model; it is to achieve sustained, cost-effective throughput under real-world conditions. This requires moving beyond simplistic benchmarks and confronting a series of complex architectural trade-offs. The choices we make here separate a performant, scalable AI platform from a perpetual proof-of-concept that bleeds cash.
The Quantisation Dilemma: Precision vs. Performance
The most immediate lever for optimising inference is model quantisation—reducing the numerical precision of model weights and activations. While moving from 16-bit floating-point (FP16/BF16) to 8-bit or 4-bit integers is standard practice, the architectural decision is no longer *if* but *how*. The landscape of quantisation algorithms has matured, presenting a nuanced trade-off between performance gains and potential accuracy degradation.
Activation-aware Weight Quantisation (AWQ) and Generative Pre-trained Transformer Quantisation (GPTQ) remain the dominant post-training quantisation (PTQ) methods. GPTQ, based on Optimal Brain Quantization, uses second-order information to approximate weights, which can be computationally intensive but often yields high accuracy. AWQ, conversely, focuses on protecting salient weights that have a disproportionate impact on activation magnitudes, making it faster and often more robust for models with significant activation outliers. For a Llama 3 70B model, an INT4 AWQ quantisation can reduce the VRAM footprint from ~140GB to ~35GB, enabling it to run on a single H100 80GB GPU, but it might introduce subtle performance regressions on domain-specific tasks not captured by standard benchmarks like MMLU.
The introduction of FP8 (E4M3 and E5M2 formats) with NVIDIA’s Hopper and Blackwell architectures presents another critical choice. Unlike INT8, FP8 retains a dynamic range via its exponent, making it more resilient to accuracy cliffs, particularly during quantisation-aware training (QAT). Our internal testing shows that for many models, FP8 delivers up to 80% of the performance gain of INT4 with virtually none of the accuracy risk, making it a safer default for enterprise workloads where reliability is paramount. The architectural mandate is clear: establish a rigorous evaluation pipeline to test quantised models not just on academic benchmarks, but on a representative set of your organisation’s core tasks before promoting them to production.
Serving Engine Selection in a Fractured Landscape
With a quantised model artefact in hand, the next critical decision is the serving engine. The choice is no longer a simple matter of deploying a Flask app around a Hugging Face pipeline. The performance delta between a naive implementation and a state-of-the-art inference server can be an order of magnitude. The current landscape is dominated by three key contenders: vLLM, TensorRT-LLM, and SGLang, each with a distinct architectural philosophy.
vLLM, with its flagship PagedAttention algorithm, has become the de facto standard for dynamic, high-throughput scenarios. By managing the Key-Value (KV) cache in virtualised, non-contiguous memory blocks, it mitigates internal fragmentation and allows for near-optimal memory utilisation, achieving throughput levels 2-4x higher than naive Hugging Face implementations. It excels in environments with unpredictable request lengths and high concurrency, typical of interactive chatbot applications.
Your choice of serving engine is not a one-way door, but migrating a production workload with active SLAs from vLLM to TensorRT-LLM is a non-trivial engineering effort. Choose your initial path wisely based on a deep understanding of your primary workload profile.
NVIDIA’s TensorRT-LLM, on the other hand, is built for absolute performance on NVIDIA hardware. It compiles models into optimised TensorRT engines, leveraging kernel fusion, in-flight batching, and hardware-specific optimisations. While its performance on static batch sizes can exceed vLLM’s, its configuration can be more rigid. It truly shines in predictable, high-volume workloads like batch document summarisation or code generation, where request patterns are uniform. Integrating it with the Triton Inference Server (we are currently validating version 2.45.0) provides a robust, enterprise-grade serving framework with features like dynamic batching and model ensemble management.
SGLang emerges as a specialised contender for the structured generation patterns increasingly common in agentic workflows. By co-designing the language and the serving system, SGLang can aggressively optimise the generation process for tasks requiring complex logic, tool use, or constrained outputs (e.g., JSON generation). Its RadixAttention mechanism can provide a significant speedup for these scenarios over the more general-purpose attention mechanisms in vLLM. For teams building complex AI agents, SGLang warrants serious consideration as it directly addresses the performance bottlenecks inherent in multi-turn, structured interactions.
Advanced Throughput Maximisation: Speculation and Routing
Beyond quantisation and serving engines lies a frontier of more aggressive optimisation techniques. These methods often introduce greater system complexity, demanding a mature AI platform and operations team. Speculative decoding is the most prominent example. Here, a small, fast "draft" model generates a sequence of tokens, which are then verified in a single pass by the larger, more powerful "target" model. If the verification succeeds, we have generated multiple tokens for the cost of a single forward pass of the large model, drastically reducing wall-clock latency.
Speculative decoding forces a fundamental trade-off: are you willing to burn compute on a draft model to reduce wall-clock latency for your primary model?
The architectural cost, however, is significant. You must now host, manage, and scale two models instead of one. The KV caches of both models must be managed in synchrony, and the performance gain is highly dependent on the "acceptance rate" of the draft tokens. For latency-critical interactive applications, this complexity can be justified. A 300ms reduction in perceived latency can be the difference between a fluid user experience and a frustrating one. But for batch workloads, the added compute cost and complexity are unlikely to provide a positive ROI.
For organisations self-hosting Mixture-of-Experts (MoE) models like Mixtral 8x22B or a custom-trained equivalent, the routing strategy itself becomes a performance lever. The default top-k routing is effective, but exploring alternative routing algorithms—or even training routers to specialise in specific domains—can optimise expert utilisation and reduce inference cost. This is an active area of research, but for large-scale deployments, building the infrastructure to experiment with and deploy custom MoE routers will become a source of competitive advantage.
Conclusion: From Capital Expenditure to Operational Excellence
The era of the AI Factory has shifted the primary challenge from acquiring compute to extracting maximum value from it. The architectural decisions made in the software-defined inference stack—how you quantise your models, which serving engine you select, and whether you embrace advanced techniques like speculative decoding—have a more profound and lasting impact on TCO and system performance than the choice of GPU chipset itself. The role of the AI Systems Architect is to navigate these complex trade-offs not with an eye towards a single "best" tool, but to compose a coherent, optimised system that is precisely matched to the organisation's specific AI workloads. This is the path from raw capacity to true capability.
Ready to apply these patterns in your stack?
Book a free 45-minute AI readiness call with the Precision Data Partners team.
Book a Free Audit