Small fashions are quickly turning into extra succesful and relevant throughout all kinds of enterprise use instances. On the similar time,every new GPU technology packs dramatically extra compute and reminiscence bandwidth. The outcome? Even underneath high-concurrency workloads, small LLMs usually go away a big fraction of GPU compute and reminiscence bandwidth idle.
With use instances resembling code completion, retrieval, grammar correction, or specialised fashions, our enterprise clients serve many such small language fashions on Databricks, and we’re continually pushing GPUs to their limits. NVIDIA’s Multi-Course of Service (MPS) seemed like a promising device: it permits a number of inference processes to share a single GPU context, enabling their reminiscence and compute operations to overlap — successfully squeezing way more work out of the identical {hardware}.
We got down to rigorously take a look at whether or not MPS delivers greater throughput per GPU in our manufacturing environments. We discovered that MPS delivers significant throughput wins in these regimes:
- Very small language fashions (≤3B parameters) with short-to-medium context (<2k tokens)
- Very small language fashions (<3B) in prefill-only workloads
- Engines with important CPU overhead
The important thing rationalization, primarily based on our ablations, is twofold: on the GPU stage, MPS permits significant kernel overlap when particular person engines go away compute or reminiscence bandwidth underutilized—notably throughout attention-dominant phases in small fashions; and, as a helpful aspect impact, it may well additionally mitigate CPU bottlenecks like scheduler overhead or image-processing overhead in multimodal workloads by sharding the entire batch throughout engines, decreasing per-engine CPU load.
What’s MPS?
NVIDIA’s Multi-Course of Service (MPS) is a characteristic that permits a number of processes to share a single GPU extra effectively by multiplexing their CUDA kernels onto the {hardware}. As NVIDIA’s official documentation places it:
The Multi-Course of Service (MPS) is an alternate, binary-compatible implementation of the CUDA Software Programming Interface (API). The MPS runtime structure is designed to transparently allow co-operative multi-process CUDA purposes.
In less complicated phrases, MPS gives a binary-compatible CUDA implementation inside the driver that permits a number of processes (like inference engines) to share the GPU extra effectively. As a substitute of processes serializing entry (and leaving the GPU idle between turns), their kernels and reminiscence operations are multiplexed and overlapped by the MPS server when assets can be found.
The Scaling Panorama: When Does MPS Assist?
On a given {hardware} setup, the efficient utilization relies upon closely on mannequin measurement, structure, and context size. Since current massive language fashions are inclined to converge on related architectures, we use the Qwen2.5 mannequin household as a consultant instance to discover the influence of mannequin measurement and context size.
Beneath experiments in contrast two similar inference engines working on the identical NVIDIA H100 GPU (with MPS enabled) in opposition to a single-instance baseline, utilizing completely balanced homogeneous workloads.
Key observations from the scaling research:
- MPS delivers >50% throughput uplift for small fashions with brief contexts
- Beneficial properties drop log-linearly as context size will increase — for a similar mannequin measurement.
- Beneficial properties additionally shrink quickly as mannequin measurement grows — even briefly contexts.
- For the 7B mannequin or 2k context, the profit falls beneath 10% and ultimately incurs a slowdown.

Key observations from the scaling research on prefill heavy workload
- Small Fashions (<3B): MPS constantly delivers a throughput enchancment of over 100%.
- Mid-sized Fashions (~3B): Advantages diminish as context size will increase, ultimately resulting in efficiency regression.
- Massive Fashions (>3B): MPS gives no efficiency profit for these mannequin sizes.
The scaling outcomes above present the advantages of MPS are most pronounced for low GPU utilization setups, small mannequin and brief context, which facilitate efficient overlapping.
Dissecting the Beneficial properties: The place Do MPS Advantages Actually Come From?
To pinpoint precisely why, we broke down the issue alongside the 2 core constructing blocks of contemporary transformers: the MLP (multi-layer perceptron) layers and the Consideration mechanism. By isolating every part (and eradicating different confounding elements like CPU overhead), we might attribute the good points extra exactly.
GPU Assets Wanted |
|||
| N = Context Size | Prefill (Compute) | Decode (Reminiscence Bandwidth) | Decode (Compute) |
| MLP | O(N) | O(1) | O(1) |
| Attn | O(N^2) | O(N) | O(N) |
Transformers encompass Consideration and MLP layers with completely different scaling habits:
- MLP: Hundreds weights as soon as; processes every token independently -> Fixed reminiscence bandwidth and compute per token.
- Consideration: Hundreds KV cache and compute dot product with all earlier tokens → Linear reminiscence bandwidth and compute per token.
With this in thoughts, we ran focused ablations.
MLP-only fashions (Consideration eliminated)
For small fashions, the MLP layer won’t saturate compute even with extra tokens per batch. We remoted the influence of MLP by eradicating the eye block from the mannequin.

As proven within the above determine, the good points are modest and vanish rapidly. As mannequin measurement or context size will increase, a single engine already saturates the compute (extra FLOPs per token in bigger MLPs, extra tokens with longer sequences). As soon as an engine is compute-bound, working two saturated engines provides nearly no profit — 1 + 1 <= 1.
Consideration-only fashions (MLP eliminated)
After seeing restricted good points from the MLP, we took Qwen2.5-3B and measured the attention-only setup analogously.


The outcomes was hanging:
- Consideration-only workloads present considerably bigger MPS good points than the complete mannequin for each prefill and decode.
- For decode, the good points are diminishing linearly with context size, which aligns with our expectation within the decode stage the useful resource necessities for consideration develop with context size.
- For prefill, the good points dropped extra quickly than decode.
Does the MPS acquire come purely from consideration good points, or is there some Consideration MLP overlapping impact? To check this, we calculated Full Mannequin Anticipated Achieve to be a weighted common of Consideration Solely and MLP solely, with the weights being their contribution to the wall time. This Full Mannequin Anticipated Achieve is mainly good points purely from Attn-Attn and MLP-MLP overlaps, whereas it doesn’t account for Attn-MLP overlap.
For decode workload, the Full Mannequin Anticipated Achieve is barely greater than the precise acquire, which signifies restricted influence of Attn-MLP overlap. Moreover, for prefill workload, the actual Full Mannequin Achieve is far decrease than the anticipated good points from seq 128, hypothetical rationalization could possibly be that there is much less alternatives for the unsaturated Consideration kernel being overlapped as a result of the opposite engine is spending important fraction of time doing saturated MLP. Due to this fact, nearly all of the MPS acquire comes from 2 engines with consideration being unsaturated.
Bonus Profit: Recovering GPU Time Misplaced to CPU Overhead
The ablations above centered on GPU-bound workloads, however probably the most extreme type of underutilization occurs when the GPU sits idle ready for CPU work — resembling scheduler, tokenization, or picture preprocessing in multimodal fashions.
In a single-engine setup, these CPU stalls instantly waste GPU cycles. With MPS, a second engine can take over the GPU each time the primary is blocked on the CPU, turning lifeless time into productive compute.
To isolate this impact, we intentionally selected a regime the place the sooner GPU-level good points had vanished: Gemma-4B (a measurement and context size the place consideration and MLP are already well-saturated, so kernel-overlap advantages are minimal).

At a latency of 8s, the baseline single engine (blue) is restricted by the scheduler CPU overhead, which could be lifted by both enabling asynchronous scheduling in vLLM (inexperienced line, +33% throughput), or working two engines with MPS with out asynchronous scheduling (yellow line, +35% throughput). This near-identical acquire confirms that, in CPU-constrained eventualities, MPS can reclaim primarily the identical idle GPU time that async scheduling eliminates. MPS could be helpful since vanilla vLLM v1.0 nonetheless has CPU overhead within the scheduler layer the place optimizations like asynchronous scheduling will not be totally out there.
A Bullet, Not a Silver Bullet
Primarily based on our experiments, MPS can yield important good points for small mannequin inference in a number of working zones:
- Engines with important CPU overhead
- Very small language fashions (≤3B parameters) with short-to-medium context (<2k tokens)
- Very small language fashions (<3B) in prefill-heavy workloads
Exterior of these candy spots (e.g., 7B+ fashions, long-context >8k, or already compute-bound workloads), the GPU-level advantages can’t be captured by MPS simply.
However, MPS additionally launched operational complexity:
- Additional transferring components: MPS daemon, shopper setting setup, and a router/load-balancer to separate site visitors throughout engines
- Elevated debugging complexity: no isolation between engines → a reminiscence leak or OOM in a single engine can corrupt or kill all others sharing the GPU
- Monitoring burden: we now have to observe daemon well being, shopper connection state, inter-engine load steadiness, and so on.
- Fragile failure modes: as a result of all engines share a single CUDA context and MPS daemon, a single misbehaving shopper can corrupt or starve your complete GPU, immediately affecting each co-located engine.
In brief: MPS is a pointy, specialised device — extraordinarily efficient within the slim regimes described above, however hardly ever a general-purpose win. We actually loved pushing the boundaries of GPU sharing and determining the place the actual efficiency cliffs are. There’s nonetheless an enormous quantity of untapped efficiency and cost-efficiency throughout your complete inference stack. In case you’re enthusiastic about distributed serving methods, or making LLMs run 10× cheaper in manufacturing, we’re hiring!
Authors: Xiaotong Jiang
