Wednesday, February 4, 2026

Tencent Hunyuan Releases HPC-Ops: A Excessive Efficiency LLM Inference Operator Library


Tencent Hunyuan has open sourced HPC-Ops, a manufacturing grade operator library for giant language mannequin inference structure gadgets. HPC-Ops focuses on low stage CUDA kernels for core operators similar to Consideration, Grouped GEMM, and Fused MoE, and exposes them via a compact-C and Python API for integration into present inference stacks.

HPC-Ops runs in giant scale inside companies. In these deployments it delivers about 30 % queries per minute enchancment for Tencent-HY fashions and about 17 % enchancment for DeepSeek fashions on mainstream inference playing cards. These good points are reported on the service stage, so that they mirror the cumulative impact of sooner kernels inside an actual inference pipeline.

Scope and design of HPC-Ops

HPC-Ops is a manufacturing grade, excessive efficiency, and simple to make use of operator library for LLM inference, developed by the Tencent Hunyuan AI Infra crew. The venture doesn’t attempt to exchange serving frameworks. As an alternative it supplies kernels and clear APIs that may be referred to as from methods that already deal with scheduling, KV cache administration, batching, and transport.

The API is designed for seamless use inside fashionable inference frameworks similar to vLLM and SGLang. Which means the framework crew can swap in HPC-Ops kernels behind their very own abstractions with out altering the exterior conduct of their servers.

HPC-Ops makes use of C++ and CUDA with CuTe and CUTLASS as constructing blocks. Kernels are written as comparatively small examples that additionally function a contemporary CUDA tutorial.

Kernel efficiency traits

The venture publishes most noticed speedup numbers for every operator relative to established baselines. These are microbenchmarks, and the analysis crew stress that efficiency varies throughout shapes and workloads, however they present the optimization ceiling.

For Consideration in bf16, in contrast with FlashInfer, FlashAttention two, FlashAttention three, and TensorRT LLM, HPC Ops reviews as much as 1.33 instances speedup in prefill and as much as 2.22 instances in decode. For Consideration in fp8, in contrast with FlashInfer, FlashAttention three, and TensorRT LLM, it reviews as much as 1.12 instances in prefill and as much as 2.0 instances in decode.

For FusedMoE fp8, in contrast with TensorRT LLM and vLLM, most noticed speedup is as much as 1.49 instances in prefill and 1.14 instances in decode. For GroupGEMM fp8, in contrast with DeepGEMM, the reported good points are as much as 1.1 instances in prefill and 1.88 instances in decode.

These numbers matter as a result of decode is normally the latency bottleneck in autoregressive technology, the place batch sizes shrink and reminiscence visitors dominates. The truth that Consideration and GroupGEMM present the most important relative good points in decode means that HPC-Ops focuses on the a part of the pipeline that almost all customers discover.

Supported kernels and precision

The present launch teams its performance into three operator households:

  • Consideration kernels cowl each prefill and decode and embrace help for paged consideration. Paged consideration is the reminiscence structure that frameworks like vLLM use to put key and worth cache blocks in a paged construction, which improves reminiscence reuse for lengthy sequences.
  • Grouped GEMM is applied as quantized GroupGEMM with fp8 weights. HPC-Ops helps block sensible and per tensor scaling, so groups can commerce off quantization granularity in opposition to parameter storage and calibration price.
  • Fused-MoE combines combination of consultants routing and skilled computation in a single quantized operator. It additionally makes use of fp8 skilled weights and helps block sensible and per tensor scaling methods.

Throughout these kernels, HPC-Ops supplies native help for bf16 and fp8 information varieties. That matches the present manufacturing pattern to maneuver inference towards decrease precision codecs that protect accuracy whereas lowering reminiscence bandwidth and enhancing tensor core utilization.

Key Takeaways

  • Tencent Hunyuan open-sourced HPC-Ops as a manufacturing grade operator library for LLM inference on NVIDIA SM90 GPUs, together with H20, with C++ and CUDA kernels constructed on CuTe and CUTLASS.
  • In manufacturing deployments HPC-Ops reviews about 30 % QPM acquire for Tencent-HY fashions and about 17 % QPM acquire for DeepSeek fashions on mainstream inference playing cards.
  • Operator microbenchmarks present most speedups as much as 2.22 instances for bf16 Consideration decode, as much as 2.0 instances for fp8 Consideration decode, as much as 1.49 instances for fp8 FusedMoE prefill, and as much as 1.88 instances for fp8 GroupGEMM decode in contrast with robust baselines like FlashInfer, FlashAttention, TensorRT LLM, and DeepGEMM.
  • The library focuses on three operator households, Consideration with paged consideration help, quantized GroupGEMM with fp8 weights, and quantized Fused MoE with fp8 skilled weights, with each block sensible and per tensor scaling, and native bf16 plus fp8 precision help.
  • HPC-Ops is designed as an operator layer that integrates into present inference frameworks similar to vLLM and SGLang, and the roadmap targets sparse consideration for lengthy context LLMs, prolonged quantization together with 4 bit and eight bit methods, and kernels that higher overlap computation with multi GPU communication.

Try the Repo right here. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as effectively.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking complicated datasets into actionable insights.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles