Managed Tiered KV Cache and Clever Routing for Amazon SageMaker HyperPod

December 2, 2025

64

Managed Tiered KV Cache and Clever Routing for Amazon SageMaker HyperPod — ML 19892 1 630x630

Fashionable AI functions demand quick, cost-effective responses from giant language fashions, particularly when dealing with lengthy paperwork or prolonged conversations. Nonetheless, LLM inference can turn out to be prohibitively gradual and costly as context size will increase, with latency rising exponentially and prices mounting with every interplay.

LLM inference requires recalculating consideration mechanisms for the earlier tokens when producing every new token. This creates vital computational overhead and excessive latency for lengthy sequences. Key-value (KV) caching addresses this bottleneck by storing and reusing key-value vectors from earlier computations, decreasing inference latency and time-to-first-token (TTFT). Clever routing in LLMs is a way that sends requests with shared prompts to the identical inference occasion to maximise the effectivity of the KV cache. It routes a brand new request to an occasion that has already processed the identical prefix, permitting it to reuse the cached KV information to speed up processing and scale back latency. Nonetheless, clients have informed us that establishing and configuring the precise framework for KV caching and clever routing at manufacturing scale is difficult and takes lengthy experimental cycles.

At present we’re excited to announce that Amazon SageMaker HyperPod now helps Managed Tiered KV Cache and Clever Routing capabilities by the HyperPod Inference Operator. These new capabilities can ship vital efficiency enhancements for LLM inference workloads by decreasing time to first token (TTFT) by as much as 40%, rising throughput, and reducing compute prices by as much as 25% when used for lengthy context prompts and multi-turn chat conversations utilizing our inside instruments. These capabilities can be found to be used with the HyperPod Inference Operator, which routinely manages the routing and distributed KV caching infrastructure, considerably decreasing operational overhead whereas delivering enterprise-grade efficiency for manufacturing LLM deployments. By utilizing the brand new Managed Tiered KV Cache function you’ll be able to effectively offload consideration caches to CPU reminiscence (L1 cache) and distribute L2 cache for cross-instance sharing by a tiered storage structure in HyperPod for optimum useful resource utilization and price effectivity at scale.

Environment friendly KV caching mixed with clever routing maximizes cache hits throughout staff so you’ll be able to obtain larger throughput and decrease prices on your mannequin deployments. These options are significantly helpful in functions which might be processing lengthy paperwork the place the identical context or prefix is referenced, or in multi-turn conversations the place context from earlier exchanges must be maintained effectively throughout a number of interactions.

For instance, authorized groups analyzing 200 web page contracts can now obtain prompt solutions to follow-up questions as an alternative of ready 5+ seconds per question, healthcare chatbots preserve pure dialog stream throughout 20+ flip affected person dialogues, and customer support programs course of hundreds of thousands of each day requests with each higher efficiency and decrease infrastructure prices. These optimizations make doc evaluation, multi-turn conversations, and high-throughput inference functions economically viable at enterprise scale.

Optimizing LLM inference with Managed Tiered KV Cache and Clever Routing

Let’s break down the brand new options:

Managed Tiered KV Cache: Computerized administration of consideration states throughout CPU reminiscence (L1) and distributed tiered storage (L2) with configurable cache sizes and eviction insurance policies. SageMaker HyperPod handles the distributed cache infrastructure by the newly launched tiered storage, assuaging operational overhead for cross node cache sharing throughout clusters. KV cache entries are accessible cluster-wide (L2) so {that a} node can profit from computations carried out by different nodes.
Clever Routing: Configurable request routing to maximise cache hits utilizing methods like prefix-aware, KV-aware, and round-robin routing.
Observability: Constructed-in HyperPod Observability integration for observability of metrics and logs for Managed Tiered KV Cache and Clever Routing in Amazon Managed Grafana.

Pattern stream for inference requests with KV caching and Clever Routing

As a consumer sends an inference request to HyperPod Load Balancer, it forwards the request to the Clever Router inside the HyperPod cluster. The Clever Router dynamically distributes requests to essentially the most acceptable mode pod (Occasion A or Occasion B) based mostly on the routing technique to maximise KV cache hit and reduce inference latency. Because the request reaches the mannequin pod, the pod first checks L1 cache (CPU) for ceaselessly used key-value pairs, then queries the shared L2 cache (Managed Tiered KV Cache) if wanted, earlier than performing full computation of the token. Newly generated KV pairs are saved in each cache tiers for future reuse. After computation completes, the inference consequence flows again by the Clever Router and Load Balancer to the consumer.

Managed Tiered KV Cache

Managed Tiered KV Cache and Clever Routing are configurable opt-in options. When enabling Managed KV Cache, L1 cache is enabled by default, whereas each L1 and L2 cache might be configured to be enabled or disabled. The L1 cache resides regionally on every inference node using CPU reminiscence. This native cache supplies considerably quick entry, making it splendid for ceaselessly accessed information inside a single mannequin occasion. The cache routinely manages reminiscence allocation and eviction insurance policies to optimize for essentially the most priceless cached content material. The L2 cache operates as a distributed cache layer spanning the whole cluster, enabling cache sharing throughout a number of mannequin situations. We help two backend choices for L2 cache, every with the next advantages:

Managed Tiered KV Cache (Really helpful): A HyperPod disaggregated reminiscence resolution that gives wonderful scalability to Terabyte swimming pools, low latency, AWS community optimized, GPU-aware design with zero-copy help, and price effectivity at scale.
Redis: Easy to arrange, works properly for small to medium workloads, and presents a wealthy surroundings of instruments and integrations.

The 2-tier structure works collectively seamlessly. When a request arrives, the system first checks the L1 cache for the required KV pairs. If discovered, they’re used instantly with minimal latency. If not present in L1, the system queries the L2 cache. If discovered there, the information is retrieved and optionally promoted to L1 for quicker future entry. Provided that the information shouldn’t be current in both cache does the system carry out the total computation, storing the ends in each L1 and L2 for future reuse.

Clever Routing

Our Clever Routing system presents 4 configurable methods to optimize request distribution based mostly in your workload traits, with the routing technique being user-configurable at deployment time to match your software’s particular necessities.

Prefix-aware routing serves because the default technique, sustaining a tree construction to trace which prefixes are cached on which endpoints, delivering sturdy general-purpose efficiency for functions with widespread immediate templates comparable to multi-turn conversations, customer support bots with commonplace greetings, and code technology with widespread imports.
KV-aware routing supplies essentially the most subtle cache administration by a centralized controller that tracks cache areas and handles eviction occasions in real-time, excelling at lengthy dialog threads, doc processing workflows, and prolonged coding classes the place most cache effectivity is important.
Spherical-robin routing presents essentially the most simple method, distributing requests evenly throughout the obtainable staff, greatest fitted to situations the place requests are unbiased, comparable to batch inference jobs, stateless API calls, and cargo testing situations.

Technique	Finest for
Prefix-aware routing (default)	Multi-turn conversations, customer support bots, code technology with widespread headers
KV-aware routing	Lengthy conversations, doc processing, prolonged coding classes
Spherical-robin routing	Batch inference, stateless API calls, load testing

Deploying the Managed Tiered KV Cache and Clever Routing resolution

Conditions

Create a HyperPod cluster with Amazon EKS as an orchestrator.

In Amazon SageMaker AI console, navigate to HyperPod Clusters, then Cluster Administration.
On the Cluster Administration web page, choose Create HyperPod cluster, then Orchestrated by Amazon EKS.
You need to use one-click deployment from the SageMaker AI console. For cluster arrange particulars see Making a SageMaker HyperPod cluster with Amazon EKS orchestration.
Confirm that the HyperPod cluster standing is InService.

Confirm that the inference operator is up and working. The Inference add-on is put in as a default possibility whenever you create the HyperPod cluster from the console. If you wish to use an present EKS cluster, see Establishing your HyperPod clusters for mannequin deployment to manually set up the inference operator.

From the command line, run the next command:

kubectl get pods -n hyperpod-inference-system

Output:

hyperpod-inference-operator-conroller-manager-xxxxxx pod is in working state in namespace hyperpod-inference-system

Or, confirm that the operator is working from console. Navigate to EKS cluster, Assets, Pods, Decide namespace, hyperpod-inference-system.

Making ready your mannequin deployment manifest recordsdata

You possibly can allow these options by including configurations to your InferenceEndpointConfig customized CRD file.

For the whole instance, go to the AWS samples GitHub repository.

export MODEL_NAME="Llama-3.1-8B-Instruct"
export INSTANCE_TYPE="ml.g5.24xlarge"
export MODEL_IMAGE="public.ecr.aws/deep-learning-containers/vllm:0.11.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.0"
export S3_BUCKET="my-model-bucket"
export S3_MODEL_PATH="fashions/Llama-3.1-8B-Instruct"
export AWS_REGION="us-west-2"
export CERT_S3_URI="s3://my-bucket/certs/"
export NAMESPACE="default"
export NAME="demo"

cat << EOF > inference_endpoint_config.yaml
apiVersion: inference.sagemaker.aws.amazon.com/v1
variety: InferenceEndpointConfig
metadata:
  title: ${NAME}
  namespace: ${NAMESPACE}
spec:
  modelName: ${MODEL_NAME}
  instanceType: ${INSTANCE_TYPE}
  replicas: 1
  invocationEndpoint: v1/chat/completions
  modelSourceConfig:
    modelSourceType: s3
    s3Storage:
      bucketName: ${S3_BUCKET}
      area: ${AWS_REGION}
    modelLocation: ${S3_MODEL_PATH}
    prefetchEnabled: false
  kvCacheSpec:
    enableL1Cache: true
    enableL2Cache: true
    l2CacheSpec:
      l2CacheBackend: "tieredstorage" # can be "redis"
      # Set l2CacheLocalUrl if deciding on "redis"
      # l2CacheLocalUrl: "redis:redisdefaultsvcclusterlocal:6379"
  intelligentRoutingSpec:
    enabled: true
    routingStrategy: prefixaware
  tlsConfig:
    tlsCertificateOutputS3Uri: ${CERT_S3_URI}
  metrics:
    enabled: true
    modelMetrics:
      port: 8000
  loadBalancer:
    healthCheckPath: /well being
  employee:
    sources:
      limits:
        nvidia.com/gpu: "4"
      requests:
        cpu: "6"
        reminiscence: 30Gi
        nvidia.com/gpu: "4"
    picture: ${MODEL_IMAGE}
    args:
      - "--model"
      - "/choose/ml/mannequin"
      - "--max-model-len"
      - "20000"
      - "--tensor-parallel-size"
      - "4"
    modelInvocationPort:
      containerPort: 8000
      title: http
    modelVolumeMount:
      title: model-weights
      mountPath: /choose/ml/mannequin
    environmentVariables:
      - title: OPTION_ROLLING_BATCH
        worth: "vllm"
      - title: SAGEMAKER_SUBMIT_DIRECTORY
        worth: "/choose/ml/mannequin/code"
      - title: MODEL_CACHE_ROOT
        worth: "/choose/ml/mannequin"
      - title: SAGEMAKER_MODEL_SERVER_WORKERS
        worth: "1"
      - title: SAGEMAKER_MODEL_SERVER_TIMEOUT
        worth: "3600"
EOF

kubectl apply -f inference_endpoint_config.yaml

# Verify inferenceendpointconfig standing
kubectl get inferenceendpointconfig ${NAME} -n ${NAMESPACE}
NAME  AGE
demo  8s

# Verify pods standing - you must see employee pods
kubectl get pods -n ${NAMESPACE}
NAME                    READY   STATUS    RESTARTS        AGE
demo-675886c7bb-7bhhg   3/3     Working   0               30s

# Router pods are beneath hyperpod-inference-system namespace
kubectl get pods -n hyperpod-inference-system
NAME                                                             READY   STATUS    RESTARTS   AGE
hyperpod-inference-operator-controller-manager-dff64b947-m5nqk   1/1     Working   0          5h49m
demo-default-router-8787cf46c-jmgqd                              2/2     Working   0          2m16s

Observability

You possibly can monitor Managed KV Cache and Clever Routing metrics by the SageMaker HyperPod Observability options. For extra data, see Speed up basis mannequin improvement with one-click observability in Amazon SageMaker HyperPod.

KV Cache Metrics can be found within the Inference dashboard.

Benchmarking

We carried out complete benchmarking to validate real-world efficiency enhancements for manufacturing LLM deployments. Our benchmarks have been run with Managed Tiered KV Cache and Clever Routing function utilizing the Llama-3.1-70B-Instruct mannequin deployed throughout 7 replicas on p5.48xlarge situations (every geared up with eight NVIDIA GPUs), beneath a steady-load site visitors sample. The benchmark surroundings used a devoted shopper node group—with one c5.12xlarge occasion per 100 concurrent requests to generate a managed load, and a devoted server node group, ensuring mannequin servers operated in isolation to assist forestall useful resource competition beneath excessive concurrency.

Our benchmarks display {that a} mixture of L1 and L2 Managed Tiered KV Cache and Clever Routing delivers substantial efficiency enhancements throughout a number of dimensions. For medium context situations (8k tokens), we noticed a 40% discount in time to first token (TTFT) at P90, 72% discount at P50, 24% improve in throughput, and 21% price discount in comparison with baseline configurations with out optimization. The advantages are much more pronounced for lengthy context workloads (64K tokens), attaining a 35% discount in TTFT at P90, 94% discount at P50, 38% throughput improve, and 28% price financial savings. The optimization advantages scale dramatically with context size. Whereas 8K token situations display strong enhancements throughout the metrics, 64K token workloads expertise transformative good points that basically change the consumer expertise. Our testing additionally confirmed that AWS-managed tiered storage constantly outperformed Redis-based L2 caching throughout the situations. The tiered storage backend delivered higher latency and throughput with out requiring the operational overhead of managing separate Redis infrastructure, making it the advisable selection for many deployments. Lastly, not like conventional efficiency optimizations that require tradeoffs between price and velocity, this resolution delivers each concurrently.

TTFT (P90)

TTFT (P50)

Throughput (TPS)

Value/1000 token ($)

Conclusion

Managed Tiered KV Cache and Clever Routing in Amazon SageMaker HyperPod Mannequin Deployment provide help to optimize LLM inference efficiency and prices by environment friendly reminiscence administration and sensible request routing. You will get began at this time by including these configurations to your HyperPod mannequin deployments in the AWS Areas the place SageMaker HyperPod is out there.

To be taught extra, go to the Amazon SageMaker HyperPod documentation or observe the mannequin deployment getting began information.

In regards to the authors

Chaitanya Hazarey is the Software program Improvement Supervisor for SageMaker HyperPod Inference at Amazon, bringing in depth experience in full-stack engineering, ML/AI, and information science. As a passionate advocate for accountable AI improvement, he combines technical management with a deep dedication to advancing AI capabilities whereas sustaining moral concerns. His complete understanding of recent product improvement drives innovation in machine studying infrastructure.

Pradeep Cruz is a Senior SDM at Amazon Net Providers (AWS), driving AI infrastructure and functions at enterprise scale. Main cross-functional organizations at Amazon SageMaker AI, he has constructed and scaled a number of high-impact providers for enterprise clients together with SageMaker HyperPod-EKS Inference, Activity Governance, Characteristic Retailer, AIOps, and JumpStart Mannequin Hub at AWS, alongside enterprise AI platforms at T-Cell and Ericsson. His technical depth spans distributed programs, GenAI/ML, Kubernetes, cloud computing, and full-stack software program improvement.

Vinay Arora is a Specialist Answer Architect for Generative AI at AWS, the place he collaborates with clients in designing cutting-edge AI options leveraging AWS applied sciences. Previous to AWS, Vinay has over 20 years of expertise in finance—together with roles at banks and hedge funds—he has constructed danger fashions, buying and selling programs, and market information platforms. Vinay holds a grasp’s diploma in pc science and enterprise administration.

Piyush Daftary is a Senior Software program Engineer at AWS, engaged on Amazon SageMaker with a give attention to constructing performant, scalable inference programs for big language fashions. His technical pursuits span AI/ML, databases, and search applied sciences, the place he makes a speciality of creating production-ready options that allow environment friendly mannequin deployment and inference at scale. His work entails optimizing system efficiency, implementing clever routing mechanisms, and designing architectures that help each analysis and manufacturing workloads, with a ardour for fixing complicated distributed programs challenges and making superior AI capabilities extra accessible to builders and organizations. Outdoors of labor, he enjoys touring, climbing, and spending time with household.

Ziwen Ning is a Senior Software program Improvement Engineer at AWS, at present engaged on SageMaker Hyperpod Inference with a give attention to constructing scalable infrastructure for large-scale AI mannequin inference. His technical experience spans container applied sciences, Kubernetes orchestration, and ML infrastructure, developed by in depth work throughout the AWS ecosystem. He has deep expertise in container registries and distribution, container runtime improvement and open supply contributions, and containerizing ML workloads with customized useful resource administration and monitoring. Ziwen is enthusiastic about designing production-grade programs that make superior AI capabilities extra accessible. In his free time, he enjoys kickboxing, badminton, and immersing himself in music.

Roman Blagovirnyy is a Sr. Consumer Expertise Designer on the SageMaker AI group with 19 years of various expertise in interactive, workflow, and UI design, engaged on enterprise and B2B functions and options for the finance, healthcare, safety, and HR industries previous to becoming a member of Amazon. At AWS Roman was a key contributor to the design of SageMaker AI Studio, SageMaker Studio Lab, information and mannequin governance capabilities, and HyperPod. Roman’s at present works on new options and enhancements to the administrator expertise for HyperPod. Along with this, Roman has a eager curiosity in design operations and course of.

Caesar Chen is the Software program Improvement Supervisor for SageMaker HyperPod at AWS, the place he leads the event of cutting-edge machine studying infrastructure. With in depth expertise in constructing production-grade ML programs, he drives technical innovation whereas fostering group excellence. His work in scalable mannequin internet hosting infrastructure empowers information scientists and ML engineers to deploy and handle fashions with larger effectivity and reliability.

Chandra Lohit Reddy Tekulapally is a Software program Improvement Engineer with the Amazon SageMaker HyperPod group. He’s enthusiastic about designing and constructing dependable, high-performance distributed programs that energy large-scale AI workloads. Outdoors of labor, he enjoys touring and exploring new espresso spots.

Kunal Jha is a Principal Product Supervisor at AWS. He’s targeted on constructing Amazon SageMaker Hyperpod because the best-in-class selection for Generative AI mannequin’s coaching and inference. In his spare time, Kunal enjoys snowboarding and exploring the Pacific Northwest.

Vivek Gangasani is a Worldwide Lead GenAI Specialist Options Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product technique for SageMaker Inference. He additionally helps enterprises and startups deploy, handle, and scale their GenAI fashions with SageMaker and GPUs. At the moment, he’s targeted on creating methods and content material for optimizing inference efficiency and GPU effectivity for internet hosting Giant Language Fashions. In his free time, Vivek enjoys climbing, watching films, and making an attempt completely different cuisines.

Managed Tiered KV Cache and Clever Routing for Amazon SageMaker HyperPod

Optimizing LLM inference with Managed Tiered KV Cache and Clever Routing

Pattern stream for inference requests with KV caching and Clever Routing

Managed Tiered KV Cache

Clever Routing

Deploying the Managed Tiered KV Cache and Clever Routing resolution

Conditions

Making ready your mannequin deployment manifest recordsdata

Observability

Benchmarking

Conclusion

In regards to the authors

Related Articles

High 10 MCP Servers for AI Builders in 2026

Enterprise AI for Manufacturing & Logistics

Ugee U1200 / U1600 Evaluation:- A small, cute & inexpensive pen show to take wherever |

LEAVE A REPLY Cancel reply

Latest Articles

High 10 MCP Servers for AI Builders in 2026

Enterprise AI for Manufacturing & Logistics

Ugee U1200 / U1600 Evaluation:- A small, cute & inexpensive pen show to take wherever |

How Software program Testing Prepares College students for Actual-World Tech Careers

What’s In a Title? Mainframe GDGs Get the Job Accomplished