We thank the reviewers once more for visiting this anonymous link. This page serves primarily to provide visualization and further results for the end-to-end benchmarks, as requested by the reviewers, that were not possible to fit in the review. We begin by explaining the live benchmark setup in great detail to avoid any confusion and then provide the results.
We use SGLang, a production-oriented service framework, and run every experiment in its live server mode, ensuring that HTTP parsing, dynamic queueing, and GPU kernel invocation are timed together. Evaluating in this mode exposes the queueing overhead and network latency absent from offline testing, thus revealing how GLA and MLA behave under real deployment constraints. The load generator sends 1280 prompts with a chosen concurrency limit, which controls the number of active requests at once. The server combines these active requests into small batches on the fly, so the limit affects load pressure rather than the fixed batch size. We used the pre-trained weights of the DeepSeek-Coder-V2 Base (236B parameters with 21B active parameters), quantized to FP8, and served with our kernels. For the benchmarks, we set the page size to 64. To simulate GLA, we restructure the MLA latent dimension to GLA with randomly initializing weights since we benchmark performance, not accuracy, in this phase. We also employ chunked-prefills (Agrawal et al., 2023), with a tile length of 8192 tokens, and run the prefill kernel one block at a time. Decode batches are formed independently, so prefill tokens never mix with decode tokens by default.
Every transformer block is sharded across eight GPUs with tensor parallel, while the MoEs feedforward layers are further partitioned by expert parallel. We also benchmark a mix of data parallelism and tensor parallelism, and whenever data parallelism is enabled, only the attention submodule is replicated across data parallel groups. Its outputs are all-gathered before the MoEs feed-forward layer, then redistributed to mitigate the KV cache duplication of MLA. We benchmark a broad spectrum of inference workloads to assess GLA and MLA under both identical parallelism configurations and in cases where GLA employs only tensor parallelism, while MLA combines tensor and data parallelism. We report four service-level metrics: end-to-end (E2E) latency, time-to-first token (TTFT), inter-token latency (ITL), and output throughput. All values in the figures are summarized by their median, which is less sensitive to heavy-tail behavior in large-scale interactive systems.
In this configuration with TP degree 8 across x8 H100 GPUs, GLA-8 employs eight latent heads, where each token has to cache a latent dimension of 256, whereas MLA maintains a 512-dimensional latent cache duplicated across devices. Both methods have decoupled the RoPE dimension of 64. Figures 7 and Table 27 reveal consistent gains for GLA-8 at every load level. With 16 concurrent requests, GLA-8 reduces the median end-to-end latency from 136 to 117 seconds, a reduction of approximately 15%, while increasing token throughput by approximately 17%. When the concurrency limit rises to 64, GLA-8 completes in 179 seconds compared to 381 seconds for MLA, cutting latency by 53% percent; the first token now arrives after 12 seconds rather than about 3 minutes, and throughput grows by about 70% to 1461 tokens per second. Even with 128 concurrent requests, GLA-8 still reduces latency by around 24% and maintains a throughput lead of nearly 60%. These advantages stem from the smaller KV cache footprint of GLA-8 per device, which reduces memory traffic, allows more active requests to fit on the GPUs, and shortens the waiting time before computation can begin.

