Did Alibaba just pull off a ‘DeepSeek’?
The Chinese group has, like its compatriot earlier this year, managed to draw attention with a scientific paper addressing the frugality of AI.
Autoscaling planning down to the token level
DeepSeek caused a shock wave by presenting LLMs trained with significantly fewer compute resources than the market’s reference models.
On Alibaba’s side, the logic is the same, but applied to inference. It involves a GPU pooling system named Aegaeon.
In this field, two major approaches are traditionally implemented: multiplexing and autoscaling.
Multiplexing places several model instances on each GPU, with spatial or temporal sharing (NVIDIA MPS). The mechanism is limited by the amount of VRAM available.
The autoscaling method is more “aggressive.” It adapts the model placement over time, loading it from host memory or from an external storage.
The effectiveness of autoscaling is limited by the ratio of active models within workloads. Yet, the execution duration of LLM queries means that at any moment a large number of models are active, even if invocations are sporadic. When all GPU instances are busy with active models, new requests must wait (you cannot batch prefill and decoding operations tied to different models in a single batch). In this context, reserving fewer instances than there are models increases the risk of compromising SLO compliance.
To overcome this limit, Alibaba implemented a planning mechanism not at the request level, but at the token level. The approach had already been tested in a monomodel configuration. In multi-model setups, it becomes all the more critical as the number of batches grows (since you cannot batch requests from different models).
Distinct partitions for prefill and decoding
Aegaeon schedules in parallel the execution of a request and the autoscaling. Given a set of GPU instances and target SLOs (TTFT, time to first token; TBT, token-to-token latency), it selects the next task (prefill or decoding) for each instance. And it may undertake a preemptive autoscale if a scheduled task uses a model different from the active one.
Alibaba opted for disaggregation of the prefill and decoding phases. It split the GPU pool into two partitions. The decoding portion uses a weighted round-robin scheduling, intended to maximize adherence to TBT. Prefill uses a grouped scheduling system: it gathers requests targeting the same model, while maintaining a first-come, first-served approach to avoid resource starvation. Tasks are added to an existing group if possible. Otherwise, Aegaeon creates one and attaches it to the least populated queue (each instance has its own queue). The batch size is capped at 1, given the relatively linear relationship between the number of tokens and execution time – and because smaller batches reduce waiting times.
Reuse to avoid (all) reinitialization
Preemptive autoscaling solutions tend to focus on accelerating model loading. Alibaba Cloud looked at the other steps of the procedure: engine reinitialization, memory management, and transfers of the key-value cache.
Engine reinitialization is a sequence that, in the absence of optimization, can take tens of seconds. It includes, notably, the initialization of the key-value cache, loading the weights, profiling and optimization (allocating space for the key-value cache) and the startup of orchestrators like Ray intended to distribute execution.
Alibaba reasoned that the initialization of these different components could be safely reused across models. Thus, Aegaeon initializes the engine only once per instance, loading everything into cache except weights and the key-value cache. For the latter, it uses a preallocated pool in host memory, avoiding the need to pin pages during autoscaling. Altogether, this reduces latency by more than 80%.
CUDA events put to work
Memory management is made explicit. Supporting this are, among other things, a self-managed VRAM buffer and a “unified” key-value cache (each region, in VRAM or DRAM, is divided into fixed-size fragments that host different blocks depending on the cache layout).
As for transferring the key-value cache between host and GPU, the objective is to allow their overlap while minimizing data contention. CUDA events were employed for this purpose, to track transfers individually.

From 1192 to 213 GPUs for Alibaba Cloud’s Model Studio
To evaluate Aegaeon, Alibaba chose a two-node setup with 8 GPUs per node (H80-80), 2 TB of DRAM (DDR5), and 192 Xeon Platinum 8469C CPUs. They ran LLMs from several families (Qwen, Llama, InternLM, Yi, etc., predominantly 6 to 14 billion parameters) on the ShareGPT dataset and two “augmented” variants (longer inputs and outputs). The comparison was conducted against MuxServer and ServerlessLLM, two solutions that adopt respectively multiplexing and autoscaling.
Illustrating the limits of multiplexing, MuxServer consistently refused to place more than 2 models per GPU, due to VRAM shortage.
At a rate of 0.1 requests per second, Aegaeon sustains useful throughput twice that of ServerlessLLM. It handles up to 70 models with 10 decoding instances. ServerlessLLM suffers from long wait times. ServerlessLLM+ (an ad hoc implementation adding a Shortest Job First scheduling based on an oracle informed by output lengths) mitigates the effect, but performance inevitably degrades as more models are active.
At 0.5 requests/s, the gap in useful throughput is 2.5 compared to ServerlessLLM.

This gap persists on the “augmented” datasets. And, albeit to a lesser extent, with stricter SLOs that leave less room for pooling. This is also observed on hardware-constrained configurations (for example, a node with 4 A10 GPUs). For Alibaba, this is evidence that Aegaeon could be applicable to a wide range of workloads.

The system has been powering Alibaba Cloud’s Model Studio for several months. It runs on a multi-region cluster of 213 H20 GPUs serving 47 models from 1.8 to 72 billion parameters. These models were originally served by 1192 H20 GPUs. The fleet has thus shrunk by 82%.
