The Full ML Stack: From Electrons to Applications
A Deep Dive into Every Layer That Makes Modern AI Work
There’s a popular misconception that “AI” is mostly about clever algorithms. In reality, the modern ML stack is a deeply intertwined tower of engineering that stretches from power plants and silicon fabrication all the way up to the chatbot interface you’re typing into. Each layer constrains and enables the layers above it, and the most important advances in AI over the last decade have often come not from algorithmic breakthroughs alone, but from co-optimization across multiple layers simultaneously.
This piece walks through the entire stack — energy, compute hardware, systems software, modeling, and application — with particular depth on the compute and modeling layers, where the most consequential engineering decisions are being made right now.
Layer 0: Energy
Everything starts with power. Training a frontier language model like GPT-5 or Gemini 2.5 consumes on the order of tens to hundreds of gigawatt-hours of electricity. To put that in perspective, training a single frontier model can consume roughly the annual electricity usage of a small city. This is not a metaphor — it is a literal infrastructure constraint that shapes where data centers get built, which companies can compete, and how fast the field can move.
The major hyperscalers — Microsoft, Google, Amazon, Meta — are now the largest private purchasers of electricity on the planet, and they are signing multi-billion-dollar power purchase agreements years in advance. Microsoft’s deal with Constellation Energy to restart the Three Mile Island nuclear plant, Amazon’s investments in nuclear through partnerships with Energy Northwest, and Google’s agreement with Kairos Power for small modular reactors all reflect the same underlying reality: the availability of cheap, reliable, large-scale power is becoming a binding constraint on AI progress.
The energy mix matters for several reasons beyond cost. Intermittent renewables like solar and wind are cheap per kilowatt-hour but create utilization challenges for data centers that want to run 24/7. Natural gas is reliable but exposes operators to fuel price volatility and carbon costs. Nuclear is reliable and low-carbon but has extremely long lead times and regulatory complexity. The practical result is that most new AI data centers are being sited near existing natural gas infrastructure or near large hydroelectric installations (which is why places like Iowa, Texas, northern Virginia, Quebec, and the Nordic countries are popular), with nuclear as a longer-term bet.
Power delivery inside the data center is its own engineering discipline. A modern AI data center might draw 100-300+ MW, and the electrical infrastructure — substations, transformers, uninterruptible power supplies, power distribution units — represents a substantial fraction of total build cost. Power Usage Effectiveness (PUE), the ratio of total facility power to IT equipment power, is a key metric. The best modern facilities achieve PUEs of 1.05-1.10, meaning only 5-10% of power is lost to cooling, lighting, and other overhead. Cooling is the dominant non-compute power draw: air cooling, liquid cooling (direct-to-chip or immersion), and increasingly rear-door heat exchangers are all in use, with liquid cooling becoming essentially mandatory for the highest-density GPU racks that can exceed 100 kW per rack.
The energy layer imposes hard constraints upward. The total amount of compute you can deploy is bounded by the power you can deliver and dissipate. The cost of that compute is substantially driven by the cost of that power. And the geographic distribution of AI capability is shaped by where power is abundant and cheap. When people talk about “scaling laws,” they are implicitly talking about a willingness and ability to spend more energy.
Layer 1: Compute Hardware (Deep Dive)
This is where electrons become matrix multiplications. The compute layer encompasses the processors, memory systems, interconnects, and physical packaging that turn electrical power into useful arithmetic. It is also where some of the most dramatic and consequential engineering advances of the last decade have occurred.
GPUs: The Workhorse
The modern ML revolution was built on GPUs, and specifically on NVIDIA GPUs. The reason is straightforward: training and running neural networks is dominated by dense matrix multiplications (and closely related operations like convolutions), and GPUs were originally designed to do exactly this kind of massively parallel arithmetic for graphics rendering. Jensen Huang’s strategic bet, starting around 2012 with the release of cuDNN and the CUDA ecosystem more broadly, was that this hardware could be repurposed for general-purpose parallel computation, and specifically for deep learning. That bet paid off spectacularly.
The current flagship is NVIDIA’s Blackwell architecture (B200 and GB200 configurations), which succeeds the Hopper architecture (H100, H200). To understand why these chips matter, you need to understand a few key concepts.
The first is raw throughput, measured in floating-point operations per second (FLOPS). A single B200 GPU delivers roughly 2.25 petaFLOPS of FP4 throughput, about 1.1 petaFLOPS at FP8, and around 70 teraFLOPS at FP32. These numbers are achieved through Tensor Cores, specialized matrix-multiply-accumulate units that operate on small tiles of matrices (typically 16×16 or similar) in a single clock cycle. Tensor Cores are the key architectural innovation that separates a modern AI GPU from a general-purpose graphics card — they sacrifice generality for enormous throughput on the specific operations that dominate neural network training and inference.
The second concept is memory bandwidth. A GPU’s compute units are useless if they can’t be fed data fast enough. The B200 uses HBM3e (High Bandwidth Memory), providing roughly 8 TB/s of memory bandwidth across 192 GB of capacity. HBM is a 3D-stacked DRAM technology where multiple memory dies are stacked vertically on a silicon interposer and connected to the GPU die via thousands of tiny through-silicon vias (TSVs). This stacking dramatically increases bandwidth compared to traditional GDDR memory by providing a very wide memory bus in a compact physical area. The ratio of compute throughput to memory bandwidth — the arithmetic intensity required to keep the compute units busy — is a fundamental design parameter. For many inference workloads and for attention computations during training, the model is “memory-bandwidth-bound,” meaning the compute units are idle waiting for data. This is why memory bandwidth improvements often matter as much as or more than raw FLOPS improvements.
The third concept is interconnect, which we’ll discuss in its own section, but the key point is that a single GPU is never sufficient for frontier workloads. Training a large model requires distributing computation across thousands to tens of thousands of GPUs, and the speed at which those GPUs can communicate determines how efficiently you can scale.
NVIDIA’s dominance rests not primarily on having the best silicon (though it’s very good) but on the CUDA software ecosystem. CUDA provides a programming model, compiler, and massive library ecosystem (cuDNN for neural network primitives, cuBLAS for linear algebra, NCCL for multi-GPU communication, TensorRT for inference optimization, and many more) that makes it dramatically easier to write and optimize GPU code compared to any competing platform. This ecosystem lock-in is NVIDIA’s real moat, and it is very deep.
TPUs and Custom ASICs
Google’s Tensor Processing Units (TPUs) are the most mature alternative to NVIDIA GPUs. The latest generation, TPU v6e (Trillium), provides substantial improvements in per-chip performance and is designed from the ground up for the specific computational patterns of transformer training and inference. TPUs use a systolic array architecture for matrix multiplication — data flows rhythmically through a grid of multiply-accumulate units in a wave-like pattern, which is highly efficient for large dense matrix operations but less flexible than a GPU’s more general-purpose streaming multiprocessor design. TPUs are tightly integrated with Google’s software stack (JAX and XLA), and are available externally through Google Cloud. The JAX/XLA programming model compiles high-level array operations down to optimized TPU (or GPU) code through a graph-based intermediate representation, which enables aggressive compiler optimizations like operation fusion, layout optimization, and automatic parallelization.
Beyond Google, there is a growing ecosystem of AI-specific chips. AMD’s MI300X and the newer MI350 series use a chiplet-based design that integrates GPU compute dies with HBM on a single package, and are competitive on paper with NVIDIA’s offerings, though the software ecosystem (ROCm) remains significantly less mature. Intel’s Gaudi accelerators (now Gaudi 3) target a slightly different point in the design space. And a wave of startups and newer entrants — Cerebras with its wafer-scale engine, Groq with its deterministic-latency LPU architecture, SambaNova, Graphcore (before its acquisition), and others — are exploring more radical architectural departures. Cerebras is particularly interesting: their WSE-3 is a single chip the size of an entire silicon wafer (roughly 46,225 mm²), containing 900,000 cores and 44 GB of on-chip SRAM, which eliminates the off-chip memory bandwidth bottleneck entirely for models that fit in that SRAM.
For inference specifically, there’s a growing market for chips optimized for the distinct computational profile of running (as opposed to training) models. Inference involves smaller batch sizes, is more latency-sensitive, and increasingly involves speculative decoding and other techniques that change the ratio of compute to memory access. NVIDIA’s inference-focused configurations, various startup chips, and even FPGAs and custom ASICs from the hyperscalers (like Amazon’s Trainium and Inferentia, and Microsoft’s Maia) are all targeting this space.
Memory Hierarchy
Understanding the memory hierarchy is essential for understanding ML hardware performance, because most workloads are memory-bound for at least some of their execution.
At the top of the hierarchy is on-chip SRAM, which is used for register files, caches, and scratchpad memory (called “shared memory” in CUDA). SRAM is extremely fast (single-cycle access) but extremely expensive per bit, so there’s not much of it — a modern GPU might have tens of megabytes total. The key optimization at this level is ensuring that data reuse patterns in your computation are structured so that data can be loaded from SRAM rather than requiring a trip to main memory. This is exactly what techniques like FlashAttention exploit: by restructuring the attention computation to operate on tiles that fit in SRAM, they avoid the quadratic memory reads and writes to HBM that a naive implementation would require.
Below SRAM is HBM, the main GPU memory. As discussed, this provides terabytes per second of bandwidth and tens to hundreds of gigabytes of capacity. The key constraint is that HBM bandwidth is shared across all the compute units on the chip, so if your kernel has low arithmetic intensity (few operations per byte loaded), you will bottleneck here.
Below HBM is host CPU memory (DRAM), accessed over PCIe or NVLink-C2C. This is much slower (tens of GB/s for PCIe Gen 5 vs. TB/s for HBM) but much larger (hundreds of GB to TB). Techniques like CPU offloading move less-frequently-accessed data (optimizer states, inactive layers in pipeline parallelism) to host memory to fit larger models in a given GPU memory budget, at the cost of transfer latency.
Below host memory is storage (NVMe SSDs), which is slower still but provides the massive capacity needed for datasets, checkpoints, and model weight storage. Efficient data loading pipelines that overlap storage reads with computation are critical for training throughput.
The fundamental game of ML systems engineering is keeping the compute units fed. Every piece of systems software, every kernel optimization, every parallelism strategy, every quantization scheme — they are all, in some sense, attempts to navigate the constraints of this memory hierarchy more cleverly.
Interconnects and Networking
No single GPU can train a frontier model. Current frontier training runs use clusters of 16,000 to over 100,000 GPUs, and communicating between them efficiently is one of the hardest engineering challenges in the stack.
Within a single server (typically containing 8 GPUs), NVIDIA uses NVLink, a proprietary high-bandwidth, low-latency interconnect. The latest NVLink generation (NVLink 5 in Blackwell) provides 1.8 TB/s of bidirectional bandwidth per GPU. The GB200 NVL72 configuration connects 72 GPUs (36 Grace-Blackwell modules) via NVLink into a single coherent domain with 130 TB/s of total bisection bandwidth, effectively creating a single logical accelerator with massive aggregate memory and compute. This is significant because it allows tensor parallelism (splitting individual matrix multiplications across GPUs) within the NVLink domain without the crippling latency penalty that would make this infeasible over a network.
Between servers, communication happens over high-speed Ethernet or InfiniBand. InfiniBand (also NVIDIA-owned, via the Mellanox acquisition) has historically dominated AI training clusters due to its lower latency, higher bandwidth, and superior support for the RDMA (Remote Direct Memory Access) operations used in collective communications like all-reduce. Current InfiniBand speeds are 400 Gb/s per port (NDR), with 800 Gb/s (XDR) deploying. However, high-speed Ethernet (400GbE and 800GbE) with RoCE (RDMA over Converged Ethernet) is closing the gap and is preferred by some hyperscalers for its lower cost and better integration with existing data center infrastructure. Google’s TPU pods use a custom toroidal mesh interconnect, and several startups are exploring novel network topologies.
Network topology matters enormously. The standard approach is a fat-tree (or folded Clos) network, which provides full bisection bandwidth — meaning any set of half the nodes can communicate with the other half at full speed. This is expensive, requiring many layers of high-radix switches. Some deployments use rail-optimized designs or other non-blocking topologies tuned to the specific communication patterns of ML training (which tend to be highly structured collectives rather than arbitrary point-to-point traffic).
The key distributed communication primitives are all-reduce (every GPU sends data to every other GPU and receives the sum — used for gradient synchronization in data parallelism), all-gather (every GPU broadcasts its data to every other GPU — used in certain model parallelism schemes like FSDP), and reduce-scatter. The NCCL library (and its equivalents on other platforms) implements these operations efficiently, using algorithms like ring all-reduce, tree all-reduce, or recursive halving-doubling depending on the message size and network topology.
Network failures are a constant reality at scale. At 10,000+ GPUs, some hardware component is failing essentially continuously. Fault tolerance — the ability to detect failures, checkpoint progress, and resume training with reduced or reconfigured hardware — is a critical systems engineering capability. Companies like Meta, Google, and others have published extensively on their approaches, which typically involve periodic checkpointing to distributed storage, automatic health monitoring, and the ability to drain and replace nodes without halting the entire training run.
The Semiconductor Supply Chain
It’s worth noting that the entire compute layer ultimately depends on semiconductor manufacturing, which is one of the most concentrated supply chains on Earth. TSMC manufactures essentially all cutting-edge AI chips (NVIDIA, AMD, Apple, many others), and their most advanced process nodes (N3, N4, and upcoming N2) are produced in a small number of fabs in Taiwan, with new fabs under construction in Arizona, Japan, and Germany. The extreme ultraviolet (EUV) lithography equipment required for these nodes is produced exclusively by ASML, a single Dutch company. This concentration creates geopolitical risk that is increasingly influencing AI strategy at the national level, as evidenced by the US CHIPS Act and similar initiatives worldwide.
Layer 2: Systems Software
Between the hardware and the model lives a thick layer of systems software that makes the hardware programmable and manages the complexity of distributed training and inference. This layer is often underappreciated but is where much of the real engineering difficulty lives.
Frameworks
The dominant ML frameworks are PyTorch and JAX. PyTorch, developed by Meta, dominates both research and increasingly production. Its eager execution model (operations execute immediately, like normal Python) made it intuitive for researchers, and its evolution toward a compiled model via torch.compile and the TorchInductor compiler backend has closed much of the performance gap with graph-based approaches. JAX, developed by Google, takes a functional, compilation-first approach: you write Python functions that operate on arrays, and JAX traces them into an XLA HLO (High-Level Optimizer) graph that is then compiled and optimized for the target hardware. JAX’s functional constraints (no side effects, explicit random state) enable powerful transformations like automatic differentiation (jax.grad), automatic vectorization (jax.vmap), and automatic parallelization (jax.pmap, jax.shard_map) that are harder to achieve in PyTorch’s more imperative style.
Below the frameworks sit compiler stacks. PyTorch uses TorchInductor, which generates Triton or C++ code. JAX uses XLA. Both perform graph-level optimizations like operator fusion (combining multiple elementwise operations into a single kernel launch to reduce memory traffic), memory planning (deciding when to allocate and free intermediate tensors), and layout optimization (choosing memory layouts for tensors that match the hardware’s preferred access patterns). Triton, developed by OpenAI, deserves special mention: it’s a domain-specific language and compiler for writing GPU kernels at a higher level of abstraction than raw CUDA, making it dramatically easier to write high-performance custom kernels. Many of the most impactful optimizations in the stack (FlashAttention, various quantized kernels, fused layer norm implementations) are written in Triton or similar DSLs.
Distributed Training Systems
Training a model across thousands of GPUs requires orchestrating multiple forms of parallelism simultaneously. The key parallelism strategies are as follows.
Data parallelism is the simplest: every GPU holds a complete copy of the model, processes a different mini-batch of data, and then gradients are averaged across all GPUs via all-reduce before each weight update. Fully Sharded Data Parallelism (FSDP, or the DeepSpeed ZeRO family of techniques) is a refinement where model parameters, gradients, and optimizer states are sharded across GPUs and gathered on-demand, dramatically reducing per-GPU memory requirements at the cost of additional communication.
Tensor parallelism splits individual operations (particularly the large matrix multiplications in transformer layers) across multiple GPUs within a single node. Since this requires very frequent, fine-grained communication, it is only practical within the high-bandwidth NVLink domain. The seminal work here is Megatron-LM from NVIDIA, which showed how to partition transformer layers across GPUs with minimal communication overhead.
Pipeline parallelism splits the model by layers, with each GPU (or group of GPUs) responsible for a contiguous set of layers. Micro-batches flow through the pipeline, and the challenge is minimizing “pipeline bubbles” — idle time when a GPU is waiting for input from the previous stage or gradients from the next stage. Schedules like 1F1B (one forward, one backward), interleaved stages, and zero-bubble schedules reduce this idle time.
Sequence parallelism splits the sequence dimension of the input, which is particularly important for long-context models where the attention computation’s memory cost grows quadratically (or at least super-linearly) with sequence length. Techniques like Ring Attention distribute the attention computation across GPUs in a way that overlaps communication with computation.
Context parallelism is closely related and sometimes used interchangeably with sequence parallelism, specifically referring to distributing the attention computation across the context/sequence dimension for long sequences.
Expert parallelism is specific to Mixture-of-Experts (MoE) models, where different experts (FFN sub-networks) reside on different GPUs, and an all-to-all communication step routes tokens to the appropriate experts.
In practice, frontier training runs use all of these simultaneously in a multi-dimensional parallelism configuration. For example, you might use 8-way tensor parallelism within a node, 8-way pipeline parallelism across nodes within a pod, and 512-way data parallelism (with FSDP) across pods, for a total of 32,768 GPUs. Figuring out the optimal parallelism configuration for a given model, cluster, and interconnect is itself a complex optimization problem, and tools like Alpa attempt to automate it.
Inference Systems
Inference — actually running a trained model to produce outputs — has a very different performance profile from training, and has spawned its own rich ecosystem of optimization techniques.
The dominant framework for LLM inference is vLLM, an open-source system that introduced PagedAttention, which manages the KV cache (the stored key and value tensors from previous tokens that are needed for autoregressive generation) using a virtual memory-like paging system. This dramatically reduces memory waste from fragmentation and over-allocation, enabling much higher throughput by serving more concurrent requests on the same hardware. Other inference systems include NVIDIA’s TensorRT-LLM, which leverages NVIDIA’s compiler stack for aggressive optimization, and SGLang, which focuses on programming-model innovations for structured generation.
Continuous batching is the practice of dynamically adding new requests to a running batch as existing requests complete, rather than waiting for an entire batch to finish. This is essential for high-throughput serving because LLM requests vary enormously in length, and naive static batching leads to most GPUs sitting idle waiting for the longest request to finish.
Speculative decoding is an increasingly important technique where a small, fast “draft” model generates candidate tokens, and the large target model verifies them in parallel. Since verification is much cheaper per token than sequential generation (because you can process multiple tokens in a single forward pass), this can provide 2-3x speedups in wall-clock latency at the same output quality. Medusa and similar approaches explore using additional lightweight heads on the main model itself for speculation, avoiding the need for a separate draft model.
Prefix caching stores the KV cache for common prompt prefixes (like system prompts) so they don’t need to be recomputed for every request. This is straightforward but can provide large speedups when many requests share a common prefix, which is the case in most deployed systems.
Disaggregated inference architectures separate the “prefill” phase (processing the input prompt, which is compute-bound) from the “decode” phase (generating output tokens one by one, which is memory-bandwidth-bound) onto different hardware, since they have different computational profiles and bottleneck on different resources. This enables better hardware utilization by matching each phase to hardware optimized for its specific bottleneck.
Kernel-Level Optimizations
At the lowest software level, individual GPU kernels — the functions that actually execute on the GPU hardware — are where theoretical hardware performance meets reality. The gap between naive and optimized kernel performance can be 10x or more.
FlashAttention (now in its third major iteration) is perhaps the most impactful single kernel optimization in the history of the transformer era. Standard attention requires materializing the full N×N attention matrix in HBM, which is both memory-intensive (quadratic in sequence length) and bandwidth-intensive. FlashAttention restructures the computation into tiles that fit in SRAM, computing attention block by block and accumulating results using the online softmax trick, never materializing the full attention matrix. This reduces memory usage from quadratic to linear and dramatically reduces HBM reads and writes, providing 2-4x speedups on attention-heavy workloads and enabling much longer context lengths.
Fused kernels more broadly combine multiple operations (e.g., a matrix multiplication followed by a bias add, activation function, and dropout) into a single kernel launch, eliminating intermediate writes to HBM. The framework compilers (TorchInductor, XLA) perform some of this fusion automatically, but hand-written fused kernels for critical operations often still outperform auto-generated code.
Quantized kernels implement matrix multiplications and other operations directly in low-precision formats (INT8, INT4, FP8, FP4) with careful accumulation in higher precision to minimize accuracy loss. These kernels are specifically designed to exploit the Tensor Core’s native support for low-precision arithmetic, and achieving peak throughput requires careful attention to data layout, tile sizes, and accumulation strategies.
Layer 3: The Modeling Layer (Deep Dive)
This is the layer most people think of when they think of “AI research,” but as should be clear by now, it rests on and is deeply shaped by all the layers below it.
Architecture
The transformer remains the dominant architecture for language models, and indeed for most modalities as of early 2025, though the details have evolved substantially from the original 2017 “Attention Is All You Need” design.
Modern transformer variants almost universally use decoder-only architectures (as opposed to the original encoder-decoder design) for language modeling. Key architectural refinements that are now standard include: pre-normalization (applying layer norm before the attention and FFN sublayers, rather than after, which stabilizes training), RMSNorm (a simplified normalization that drops the mean-centering step), SwiGLU or GeGLU activations in the feedforward network (which consistently outperform ReLU), Rotary Position Embeddings (RoPE, which encode position information through rotation matrices applied to query and key vectors, enabling better length generalization than learned absolute position embeddings), and Grouped Query Attention (GQA, where multiple query heads share a single key-value head, dramatically reducing the KV cache size for inference with minimal quality loss).
The attention mechanism itself has seen significant innovation aimed at reducing its quadratic cost in sequence length. Beyond FlashAttention (which is a systems optimization that doesn’t change the mathematical operation), there’s a family of approaches that modify the attention computation itself. Multi-head Latent Attention (MLA), introduced in DeepSeek-V2, projects keys and values into a much lower-dimensional latent space, compressing the KV cache by 10x or more. Various sparse and linear attention approximations exist, though none have fully displaced standard dense attention for the highest-quality models.
Mixture-of-Experts (MoE) models, where each transformer layer contains multiple FFN “experts” and a gating mechanism routes each token to only a small number of them, have become increasingly important. MoE enables models with very large total parameter counts (and thus large “capacity”) while keeping the computational cost per token manageable, since only a fraction of parameters are activated for any given token. DeepSeek-V3 and Mixtral are prominent examples. MoE introduces its own engineering challenges: load balancing across experts, the all-to-all communication for expert routing in distributed training, and the memory overhead of storing all expert parameters even though most are inactive for any given token.
State-space models (SSMs) like Mamba and its successors represent the most serious architectural alternative to transformers. SSMs process sequences through a learned linear recurrence, which can be computed very efficiently via a parallel scan, providing linear scaling with sequence length during training (compared to attention’s quadratic scaling). However, their ability to perform precise information retrieval over long contexts has historically been weaker than attention’s, leading to hybrid architectures like Jamba that interleave Mamba layers with attention layers. As of 2025, pure SSM models have not definitively surpassed transformers at the largest scales, but the architectural space is being actively explored, and hybrid approaches look promising.
Training
The training process for a modern LLM is a multi-stage pipeline, each stage with its own data, objectives, and engineering challenges.
Pre-training is the most compute-intensive phase, where the model learns general language understanding and world knowledge by predicting the next token on a massive text corpus. The corpus for a frontier model is typically on the order of 10-15+ trillion tokens, carefully curated and deduplicated from web crawls (Common Crawl), books, code repositories, scientific papers, and other sources. Data quality and mixture are among the most impactful decisions in the entire training process — models trained on cleaner, better-curated data consistently outperform those trained on larger but noisier datasets. The data pipeline — filtering, deduplication (both exact and fuzzy), quality classification (often using a smaller model to score document quality), content filtering, and tokenization — is a substantial engineering effort in itself.
The optimizer is almost universally some variant of Adam (typically AdamW, which decouples weight decay from the gradient update). Learning rate scheduling typically follows a warmup-then-cosine-decay or warmup-stable-decay pattern. Training stability at large scale is a persistent challenge; loss spikes (sudden increases in training loss) can occur due to bad data batches, numerical instability, or other factors, and teams maintain infrastructure for detecting and recovering from these events. Gradient clipping (capping the gradient norm) is standard practice for stability.
Mixed-precision training is universal. The standard approach uses FP16 or BF16 for forward and backward passes (leveraging Tensor Core throughput) with FP32 master weights and optimizer states for numerical stability. BF16 is generally preferred over FP16 because its larger exponent range makes it more robust to overflow and underflow, eliminating the need for the loss scaling required with FP16. The Blackwell generation’s native FP8 and FP4 support is pushing the frontier of training precision even lower, with careful techniques for managing the quantization error (like per-tensor or per-channel scaling factors, and stochastic rounding).
Scaling laws, established by Kaplan et al. at OpenAI and refined by Hoffmann et al. (the “Chinchilla” paper) at DeepMind, provide empirical relationships between model size, dataset size, compute budget, and model performance (measured as pre-training loss). The Chinchilla insight was that many early models were significantly undertrained — for a given compute budget, it’s better to train a smaller model for longer on more data than to train a larger model for fewer steps. These scaling laws guide the most expensive allocation decision in the entire stack: given a compute budget, how big should the model be, and how much data should it see? More recent work has refined these laws and extended them to account for factors like data quality, mixture, and inference cost (since a smaller, better-trained model may be cheaper to deploy).
Post-training encompasses everything after pre-training and is where models are refined from raw next-token predictors into useful, aligned assistants. Supervised Fine-Tuning (SFT) trains the model on curated examples of desired input-output behavior (e.g., instruction-response pairs). RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preference data (which response is better?) and then uses reinforcement learning (typically PPO or DPO) to optimize the language model against this reward. DPO (Direct Preference Optimization) simplifies RLHF by directly optimizing the language model on preference data without training a separate reward model, and has become popular due to its simplicity. Constitutional AI and related approaches use the model itself to generate and evaluate training data, reducing reliance on human annotators.
Reinforcement learning for reasoning, as pioneered by DeepSeek-R1 and adopted broadly, trains models to produce extended chains of thought before answering, using RL with outcome-based rewards (was the final answer correct?). This has proven to be a remarkably effective way to improve performance on math, coding, and reasoning tasks, and has become the standard approach for frontier “reasoning” or “thinking” models. The training involves generating many completions per prompt, scoring them, and using algorithms like GRPO (Group Relative Policy Optimization) to update the model to favor successful reasoning traces.
Synthetic data has become increasingly central to post-training. Rather than relying entirely on human-generated training data, models are used to generate training data for themselves or for other models. This includes using a strong model to generate instruction-response pairs, using models to judge the quality of other models’ responses, and using models to generate diverse reasoning traces for RL training. The quality and diversity of synthetic data is a key differentiator between labs.
Inference-Time Optimizations at the Model Level
Beyond the systems-level inference optimizations discussed earlier, there are model-level techniques that dramatically improve inference efficiency.
Quantization reduces the precision of model weights (and sometimes activations) after training. Post-training quantization (PTQ) applies quantization to a pre-trained model without retraining, using calibration data to determine optimal scaling factors. Common approaches include GPTQ (which uses approximate second-order information to minimize quantization error), AWQ (Activation-Aware Weight Quantization, which observes that some weights are much more important than others based on activation magnitudes), and simpler round-to-nearest approaches. Quantization-aware training (QAT) simulates quantization during training, allowing the model to adapt to the reduced precision. In practice, well-executed 4-bit weight quantization (W4A16 — 4-bit weights, 16-bit activations) preserves nearly all model quality for most tasks, while reducing model size by 4x and enabling proportional improvements in memory bandwidth utilization during inference.
Knowledge distillation trains a smaller “student” model to mimic a larger “teacher” model, either by matching the teacher’s output logits (soft targets) or by training on data generated by the teacher. This is a key technique for producing smaller, faster models that retain much of the teacher’s capability.
Pruning removes unnecessary weights or structures from the model. Unstructured pruning (zeroing out individual weights) can achieve high sparsity ratios but requires sparse matrix hardware support to achieve actual speedups. Structured pruning (removing entire attention heads, neurons, or layers) produces directly smaller dense models but typically requires more careful tuning to maintain quality.
Architecture search and model design for efficiency remain active research areas. Techniques like neural architecture search (NAS) can optimize model architectures for specific hardware and latency constraints, though in practice most frontier models use relatively standard transformer architectures with hand-tuned hyperparameters.
Context Length
Extending the context length of transformers — the amount of text they can process at once — has been a major research focus. The challenges are both computational (attention’s quadratic cost) and learning-related (models trained on short contexts don’t automatically generalize to long ones). RoPE allows some length extrapolation, and techniques like YaRN and NTK-aware scaling extend this further. Long-context training typically involves a progressive training strategy: training on shorter contexts first and then extending, using a lower learning rate, on longer-context data. Achieving reliable performance at very long contexts (100k+ tokens) also requires data that exercises long-range dependencies, which is harder to curate.
Evaluation
How do we know if a model is good? Evaluation is one of the most important and most difficult parts of the modeling layer. Standard benchmarks like MMLU (knowledge), HumanEval and SWE-bench (coding), GSM8K and MATH (mathematics), and various reasoning benchmarks provide quantitative scores, but each has significant limitations: benchmark contamination (the test data appearing in training data), saturation (scores approaching 100%, leaving little room to differentiate), and the gap between benchmark performance and real-world usefulness. LM-as-judge approaches, where a strong model evaluates the outputs of other models, have become standard for more open-ended tasks but introduce their own biases. The field increasingly relies on human evaluation (e.g., Chatbot Arena’s Elo ratings from human pairwise comparisons) as the gold standard, but this is expensive and noisy. A significant open challenge is developing evaluations that reliably measure the capabilities we actually care about, particularly for safety-relevant properties.
Layer 4: Applications
The application layer is where models meet users and produce value. While this article is focused on the lower layers, it’s worth briefly noting the key components.
The serving stack handles routing user requests to model instances, managing load balancing, rate limiting, content filtering, and response streaming. Orchestration frameworks like LangChain, LlamaIndex, and more recently focused agent frameworks manage the increasingly complex pipelines that surround a single model call: retrieval-augmented generation (RAG, where relevant documents are retrieved from a database and provided as context), tool use (models calling APIs, executing code, browsing the web), multi-step agent loops, and structured output generation.
Retrieval-Augmented Generation has become the standard approach for grounding model outputs in up-to-date or domain-specific information. The RAG pipeline involves chunking documents, embedding them with a vector embedding model, storing them in a vector database (Pinecone, Weaviate, Chroma, pgvector, etc.), retrieving relevant chunks at query time via approximate nearest neighbor search, and providing them as context to the LLM. The quality of the retrieval step — chunk size, embedding model quality, retrieval strategy (dense, sparse, or hybrid) — often matters more than the choice of LLM.
Agent frameworks enable models to take actions in the world: executing code, calling APIs, browsing the web, and managing multi-step reasoning and planning. The agent paradigm is rapidly evolving, with approaches ranging from simple ReAct-style prompt-based agents to more sophisticated architectures with explicit planning, memory, and tool-use modules. The Model Context Protocol (MCP, introduced by Anthropic) is an emerging standard for structured tool integration.
Fine-tuning and customization allow organizations to specialize general-purpose models for specific domains or tasks. Parameter-efficient fine-tuning methods like LoRA (Low-Rank Adaptation, which trains small low-rank update matrices rather than modifying all weights) have made fine-tuning accessible to organizations without massive compute budgets. QLoRA extends this by quantizing the base model to 4-bit precision during LoRA training, further reducing memory requirements.
Guardrails and safety systems implement content filtering, output validation, and policy enforcement. These typically involve a combination of classifier models (detecting harmful content, PII, etc.), rule-based filters, and structured output validation. The tension between safety and capability — between preventing misuse and avoiding over-refusal — is one of the defining challenges of deployed LLM systems.
Cross-Cutting Themes
Several themes cut across all layers of the stack.
The first is co-design. The most impactful advances often come from optimizing across layer boundaries rather than within a single layer. FlashAttention is a kernel optimization that changes how an architectural component (attention) is computed. Quantization-aware training bridges the model and hardware layers. MoE architectures require specific interconnect and systems support. The most effective organizations have deep vertical integration across the stack.
The second is the compute-memory tradeoff. At every layer, there is a fundamental tension between doing more computation and accessing more memory. The shift to lower-precision arithmetic, the development of HBM, the design of attention variants, and the engineering of KV cache management systems are all different manifestations of navigating this same fundamental tradeoff.
The third is scale as an engineering problem. Many of the most important challenges in the stack are not about cleverness but about making things work reliably at enormous scale: training on tens of thousands of unreliable GPUs for months, serving millions of concurrent users, processing trillions of tokens of training data, and doing all of this while maintaining quality and controlling costs.
Where the Stack Is Heading
The direction of travel is toward more specialization at every layer: specialized chips for training vs. inference, specialized network topologies for different parallelism patterns, specialized kernels for different precisions and operation types, specialized model architectures for different modalities and tasks, and specialized application frameworks for different use cases. At the same time, there’s a counter-trend toward unification and abstraction — compilers that can target multiple hardware backends, frameworks that handle parallelism automatically, and foundation models that handle multiple modalities and tasks within a single architecture.
The economic dynamics are also shifting. Training costs for frontier models are growing rapidly (likely exceeding $1 billion for the next generation), concentrating frontier training among a handful of organizations. But inference costs are falling rapidly due to quantization, distillation, specialized hardware, and systems optimizations, democratizing access to powerful models. The strategic question for the industry is whether the value accrues primarily to those who can train frontier models or to those who can most efficiently deploy and customize them.
Every layer of this stack is evolving rapidly, and the interactions between layers are becoming more complex. Understanding the full stack — not just the model architecture or the application framework, but the entire tower from power plants to user interfaces — is increasingly what separates those who can build effective AI systems from those who are just calling an API.
Thanks for reading. If you found this useful, subscribe for more deep dives into the infrastructure and engineering behind modern AI systems.

