Monday, December 1, 2025

Decrease value and latency for AI utilizing Amazon ElastiCache as a semantic cache with Amazon Bedrock


Giant language fashions (LLMs) are the inspiration for generative AI and agentic AI functions that energy many use circumstances from chatbots and search assistants to code technology instruments and suggestion engines. As now we have seen with rising database workloads, the rising use of AI functions in manufacturing is driving prospects to hunt methods to optimize value and efficiency. Most AI functions invoke the LLM for each consumer question, even when queries are repeated or very related. For instance, think about an IT assist chatbot the place hundreds of shoppers ask the identical query, invoking the LLM to regenerate the identical reply from a shared enterprise data base. Semantic caching is a technique to scale back value and latency in generative AI functions by reusing responses for equivalent or semantically related requests through the use of vector embeddings. As detailed within the Impression part of this submit, our experiments with semantic caching decreased LLM inference value by as much as 86 p.c and improved common end-to-end latency for queries by as much as 88 p.c.

This submit exhibits the best way to construct a semantic cache utilizing vector search on Amazon ElastiCache for Valkey. On the time of writing, Amazon ElastiCache for Valkey delivers the bottom latency vector search with the best throughput and finest price-performance at 95%+ recall price amongst well-liked vector databases on AWS. This submit addresses frequent questions similar to:

  1. What’s a semantic cache and the way does it cut back the price and latency of generative AI functions?
  2. Why is ElastiCache properly suited as a semantic cache for Amazon Bedrock?
  3. What real-world accuracy, value, and latency financial savings are you able to obtain with a semantic cache?
  4. Easy methods to arrange a semantic cache with ElastiCache and Amazon Bedrock AgentCore?
  5. What are some key concerns and finest practices for semantic caching?

Overview of semantic caching

In contrast to conventional caches that depend on precise string matches, a semantic cache retrieves information primarily based on semantic similarity. A semantic cache makes use of vector embeddings produced by fashions like Amazon Titan Textual content Embeddings to seize semantic that means in a high-dimensional vector house. In generative AI functions, a semantic cache shops vector representations of queries and their corresponding responses. The system compares the vector embedding of every new question towards cached vectors of prior queries to see if an identical question has been answered earlier than. If the cache comprises an identical question, the system returns the beforehand generated response as an alternative of invoking the LLM once more. In any other case, the system invokes an LLM to generate a response and caches the question embedding and response collectively for future reuse.

This strategy is especially efficient for generative AI functions that deal with repeated queries, together with Retrieval Augmented Era (RAG)-based assistants and copilots, the place many queries are duplicate requests from completely different customers. For instance, an IT assist chatbot may see queries like “how do I set up the VPN app on my laptop computer?” and “are you able to information me by means of establishing the corporate VPN?”, that are semantically equal and might reuse the identical cached reply. Agentic AI functions break down duties into a number of small steps which will repeatedly search for related data, permitting reuse of cached software outputs or solutions. For instance, a compliance agent can reuse data retrieved from a coverage software to deal with queries similar to “Is my information storage in compliance with our privateness insurance policies?” and “Set my information retention interval to adjust to our insurance policies.” Semantic caching additionally works for information past textual content, similar to matching related audio segments in automated telephone methods to reuse the identical steerage for repeated requests like checking retailer hours or places. The important thing advantages of utilizing a semantic cache embrace, in all these functions:

  • Diminished prices – Reusing solutions for related questions reduces the variety of LLM calls and general inference spend.
  • Decrease latency – Serving solutions from the cache supplies sooner responses to customers than operating LLM inference.
  • Improved scalability – Lowering LLM requires related or repeated queries allows you to serve extra requests throughout the similar mannequin throughput limits with out rising capability.
  • Improved consistency – Utilizing the identical cached response for semantically related requests helps ship a constant reply for a similar underlying query.

ElastiCache as a semantic cache retailer for Amazon Bedrock

Semantic caching workloads constantly write, search, and evict cache entries to serve the stream of incoming consumer queries whereas retaining responses contemporary. Subsequently, a semantic cache should help real-time vector updates so new queries and responses are instantly obtainable within the cache, sustaining cache hit charges and enabling dynamic information adjustments. Since the semantic cache sits within the on-line request path of each question, it should present low-latency lookups to mitigate the impression on end-user response time. Lastly, a semantic cache should effectively handle an ephemeral scorching set of entries which are written, learn, and evicted ceaselessly.

ElastiCache for Valkey is a completely managed and scalable cache service trusted by a whole bunch of hundreds of AWS prospects. Vector search in ElastiCache allows you to index, search, and replace billions of high-dimensional vector embeddings from suppliers like Amazon Bedrock, Amazon SageMaker, Anthropic, and OpenAI, with latency as little as microseconds with as much as 99% recall. Vector search on ElastiCache makes use of a multithreaded structure that helps real-time vector updates and excessive write throughput whereas sustaining low-latency for search requests. Constructed-in cache options similar to time to stay (TTL), eviction insurance policies, and atomic operations assist you to handle the ephemeral scorching set of entries that semantic caching creates. These properties make ElastiCache well-suited to implement a semantic cache. Additional, ElastiCache for Valkey integrates with Amazon Bedrock AgentCore by means of the LangGraph framework, so you may implement a Valkey-backed semantic cache for brokers constructed on Amazon Bedrock, following the steerage supplied on this submit. AgentCore and LangGraph present increased stage agent orchestration with instruments and a managed runtime that handles scaling and Bedrock integration.

Answer overview

The next structure implements a read-through semantic cache for an agent on AgentCore. The important thing elements of this answer are an agent, an embedding mannequin that converts textual content queries into vectors, an LLM to generate solutions, and a vector retailer to cache these embeddings with their related responses for similarity search. On this instance, LangGraph orchestrates the workflow, Amazon Bedrock AgentCore Runtime hosts the agent and calls Amazon Titan Textual content Embeddings and Amazon Nova Premier fashions, and Amazon ElastiCache for Valkey is the semantic cache retailer. On this software, AgentCore calls the embedding mannequin for every consumer question to generate a vector illustration. AgentCore then invokes a semantic cache software to ship this vector to ElastiCache for Valkey to seek for related previous queries saved within the cache. A request follows both of two paths within the software:

  • Cache hit: If the software finds a previous question above a configured similarity threshold, AgentCore returns the cached reply instantly to the consumer. This path solely invokes the embedding mannequin and doesn’t require an LLM inference. Subsequently, this path has millisecond-level end-to-end latency and doesn’t incur an LLM inference value.
  • Cache miss: If the appliance doesn’t discover a related prior question, AgentCore invokes a LangGraph agent that calls the Amazon Nova Premier mannequin to generate a brand new reply and returns it to the consumer. The appliance then caches this end result by sending the immediate’s embedding and reply to Valkey in order that future related prompts will be served from the semantic cache.

Deploy the answer

You should have the next stipulations:

  1. An AWS account with entry to Amazon Bedrock, together with Amazon Bedrock AgentCore Runtime, Amazon Titan Textual content Embeddings v2 mannequin, and Amazon Nova Premier enabled in the US East (N. Virginia) Area.
  2. The AWS Command Line Interface (AWS CLI) configured with Python 3.11 or later
  3. SSH into your Amazon Elastic Compute Cloud (Amazon EC2) occasion inside your Digital Personal Cloud (VPC) and set up the Valkey Python shopper and associated SDKs:

Arrange an ElastiCache for Valkey cluster and Amazon Titan embedding mannequin

Launch an ElastiCache for Valkey cluster with model 8.2 or later that helps vector search utilizing the AWS CLI:

aws elasticache create-replication-group 
  --replication-group-id "valkey-semantic-cache" 
  --cache-node-type cache.r7g.giant                  #cache.r7g.giant cases
  --engine valkey --engine-version 8.2 
  --num-node-groups 1 --replicas-per-node-group 1     #1 shard with 1 duplicate

Out of your software code operating in your EC2 occasion, hook up with the Valkey configuration endpoint:

Arrange Amazon Bedrock Titan embeddings:

from langchain_aws import BedrockEmbeddings

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                                region_name="us-east-1")

ElastiCache for Valkey makes use of an index to offer quick and correct vector search. See the ElastiCache documentation to be taught extra. Configure a ValkeyStore that mechanically embeds “question” utilizing a Hierarchical Navigable Small World (HNSW) index and COSINE for vector search:

from langgraph_checkpoint_aws import ValkeyStore
from hashlib import md5

retailer = ValkeyStore(
    shopper=valkey_client,
    index={"collection_name": "semantic_cache",
           "embed": embeddings,                 
           "fields": ["query"],             #Fields to vectorize
           "index_type": "HNSW",            #Vector search algorithm
           "distance_metric": "COSINE",     #Similarity metric
           "dims": 1024})                   #Titan V2 produces 1024-d vectors 
retailer.setup()

def cache_key_for_query(question: str):        #Generate a cache key for this question
    return md5(question.encode("utf-8")).hexdigest()

Arrange features to go looking and replace the semantic cache

Search for a semantically related cached response from Valkey that’s above a similarity threshold:

def search_cache(user_message: str, ok: int = 3, min_similarity: float = 0.8) -> str | None:
    hits = retailer.search(namespace="semantic-cache",
                           question=user_message,
                            restrict=ok)
    if not hits: return None

    hits = sorted(hits, key=lambda h: h["score"], reverse=True) 
    top_hit = hits[0]
    rating = top_hit["score"]
    if rating < min_similarity:return None      #Impose similarity threshold
              
    return top_hit["value"]["answer"]          #Return cached reply textual content

Replace the semantic cache with the brand new question and reply pairs for future reuse:

def store_cache(user_message: str, result_message: str) -> None:
    key = cache_key_for_query(user_message)
    retailer.put(namespace="semantic-cache",
              key=key,
              worth={"question": user_message,
                     "reply": result_message})

Arrange an AgentCore Runtime app with a read-through semantic cache

Observe the configuration steps in this information to configure AgentCore Runtime to name ElastiCache from inside a VPC. Implement a read-through semantic cache with Valkey Retailer utilizing AgentCore entrypoint:

from bedrock_agentcore.runtime import BedrockAgentCoreApp
from langchain_aws import ChatBedrock
from langchain.brokers import create_agent
from langchain_core.messages import HumanMessage

mannequin = ChatBedrock(model_id="amazon.nova-premier-v1:0",region_name="us-east-1")
app = BedrockAgentCoreApp()
app_agent = create_agent(mannequin)
   
@app.entrypoint
def invoke(payload):
    user_message = payload.get("immediate", "Hey! How can I assist you to right now?")
    threshold = payload.get("min_similarity", 0.8)
    
    cached = search_cache(user_message,       # 1. Attempt semantic cache
                          min_similarity=threshold)       
    if cached: return cached 

    end result = app_agent.invoke({               # 2. Name the llm agent on cache miss
                "messages": [HumanMessage(content=user_message)]})                            
    reply = end result["messages"][-1].content material
   
    store_cache(user_message, reply)         # 3. Retailer new end in cache  
    return reply

if __name__ == "__main__":
    app.run()

Though the answer on this weblog makes use of Amazon Bedrock, semantic caching is deployment, framework, and mannequin agnostic. You may apply the identical ideas to construct a semantic cache together with your most popular instruments, for instance through the use of an AWS Lambda operate to orchestrate these steps. To be taught extra, discuss with the next pattern implementation of a semantic cache for a generative AI software, utilizing ElastiCache for Valkey because the vector retailer with AWS Lambda.

Clear up

To keep away from incurring further value, delete all of the sources you created beforehand:

The impression of a semantic cache

To quantify the impression, we arrange a semantic cache on 63,796 actual consumer chatbot queries and their paraphrased variants from the general public SemBenchmarkLmArena dataset. This dataset captures consumer interactions with the general public Chatbot Enviornment platform throughout normal assistant use circumstances similar to query answering, writing, and evaluation.

The appliance used ElastiCache cache.r7g.giant occasion because the semantic cache retailer, Amazon Titan Textual content Embeddings V2 for embeddings and Claude 3 Haiku for LLM inference.

The cache was began empty, and the 63,000 queries have been streamed as random incoming consumer visitors, much like what an software may expertise in a day. Much like conventional caching, we outline cache hit price because the fraction of queries answered from cache. Nevertheless, not like conventional caching, a semantic cache hit could not all the time reply the intent of consumer’s question. To quantify the standard of cached responses, we outline accuracy because the fraction of cache hits whose responses appropriately reply the consumer’s intent. Price is measured because the sum of embedding technology, LLM inference, and ElastiCache occasion prices. The price of ancillary elements similar to EC2 cases are excluded since they’re decrease. Response latency is measured because the end-to-end time for embedding technology, vector search, LLM processing and community transfers.

Desk 1 summarizes the trade-off between value and latency discount versus accuracy throughout completely different similarity thresholds for the semantic cache. You may tune the similarity threshold to decide on a steadiness between value financial savings and accuracy that most closely fits your workload. On this analysis, enabling a semantic cache decreased LLM value by as much as 86% whereas sustaining 91% reply accuracy at a similarity threshold of 0.75. We should always be aware that the selection of LLM, embedding mannequin, and backing retailer for the semantic cache impacts each value and latency. In our analysis, we use Claude 3 Haiku, a smaller, decrease value, quick mannequin, however semantic caching can typically ship bigger advantages when used with greater, increased value LLMs.

Desk 1: Impression of Semantic Cache on Price and Accuracy
Similarity Threshold Variety of Cache Hits Cache Hit Ratio Accuracy of Cached Responses Whole Every day Price Financial savings With Cache Common Latency with Cache (s) Latency Discount
Baseline with no cache $49.5 4.35
0.99 14,989 23.5% 92.1% $41.70 15.8% 3.60 17.1%
0.95 35,749 56.0% 92.6% $23.8 51.9% 1.84 57.7%
0.9 47,544 74.5% 92.3% $13.6 72.5% 1.21 72.2%
0.8 55,902 87.6% 91.8% $7.6 84.6% 0.60 86.1%
0.75 57,577 90.3% 91.2% $6.8 86.3% 0.51 88.3%
0.5 60,126 94.3% 87.5% $5.9 88.0% 0.46 89.3%

Desk 2 exhibits examples of consumer queries with related intents and the impression of a semantic cache on response latency. A cache hit decreased latency by as much as 59x, from a number of seconds to some hundred milliseconds.

Desk 2: Impression of Semantic Cache on Particular person Question Latency
Intent Question Cache Miss (s) Cache Hit (s) Discount
Intent 1.a Are there cases the place SI prefixes deviate from denoting powers of 10, excluding their software? 6.51 0.11 59x
Intent 1.b Are there conditions the place SI prefixes deviate from denoting powers of 10, other than how they’re conventionally used?
Intent 2.a Sally is a lady with 3 brothers, and every of her brothers has 2 sisters. What number of sisters are there in Sally’s household? 1.64 0.13 12x
Intent 2.b Sally is a lady with 3 brothers. If every of her brothers has 2 sisters, what number of sisters does Sally have in complete?

Greatest Practices for Semantic Caching

Selecting information that may be cached : Semantic caching is properly fitted to repeated queries whose responses are comparatively secure, whereas real-time or extremely dynamic responses are sometimes poor candidates for caching. You need to use tag and numeric filters derived from current software context (similar to product ID, class, area, or consumer section) to resolve which queries and responses are eligible for semantic caching and to enhance the relevance of cache hits. For instance, in an e-commerce procuring assistant, you may route queries about static product data by means of the semantic cache whereas sending stock and order standing queries on to the LLM and underlying methods. When customers ask a couple of particular product, your software can move the product ID and class from the product web page as filters to the semantic cache so it retrieves related prior queries and responses for that product.

Standalone question versus conversations: An AI assistant can both obtain remoted, one-off queries or multi-turn conversations that construct on earlier messages. In case your software has largely standalone queries, you need to use semantic cache immediately on the consumer question textual content. For multi-turn conversational bots, first use your dialog reminiscence (for instance, session state or a reminiscence retailer) to retrieve the important thing info and up to date messages wanted to reply the present flip. Then apply semantic caching to the mix of the present consumer message and that retrieved context, as an alternative of embedding your complete uncooked dialogue. This lets related questions reuse cached solutions with out making the embeddings overly delicate to small adjustments.

Set cache invalidation durations: Cached responses should be refreshed over time to maintain solutions correct as underlying information similar to product data, pricing, and insurance policies evolve or as mannequin conduct adjustments. You need to use a TTL to manage how lengthy cached responses are served earlier than they’re regenerated on a cache miss. Select a TTL that matches your software use case, in addition to how typically your information or mannequin outputs change, to steadiness response freshness, cache effectivity, and software value. Longer TTLs enhance cache hit charges however increase the danger of outdated solutions, whereas shorter TTLs maintain responses brisker however decrease cache hit charges and require extra LLM inference, rising latency and value. For instance, in case your product catalog, pricing, or help data base is up to date day by day, you may set the TTL to round 23 hours and add random jitter to unfold out cache invalidations over time.

Personalizing responses: On a cache hit, the appliance may return the identical response for each consumer, with out taking their profile or present context into consideration. To serve personalised responses, you may both scope cache lookups to the consumer or section, or generate personalised outputs by feeding the cached response plus consumer context into a light-weight, low-cost mannequin for remaining response technology. Whereas this strategy provides smaller value financial savings, it allows you to ship personalised responses to customers.

Conclusion

On this submit, we explored semantic caching for generative AI functions utilizing ElastiCache, overlaying a pattern implementation of semantic caching, efficiency ideas, and finest practices. Vector search on ElastiCache permits low-latency, in-memory storage of vectors, with zero downtime scalability and high-performance search throughout billions of vectors. To get began, create a brand new ElastiCache Valkey 8.2 cluster utilizing the AWS console, AWS SDK, or AWS CLI. You need to use this functionality with well-liked Valkey shopper libraries similar to valkey-glide, valkey-py, valkey-java, and valkey-go. To be taught extra about vector search or the checklist of supported instructions to get began with semantic caching, see the ElastiCache documentation


Concerning the authors

Meet Bhagdev

Meet Bhagdev

Meet is a Senior Supervisor PMT-ES at AWS. He leads the product administration group for Amazon ElastiCache & Amazon MemoryDB. Meet is keen about open supply, databases, and analytics and spends his time working with prospects to know their necessities and constructing pleasant experiences.

Chaitanya Nuthalapati

Chaitanya Nuthalapati

Chaitanya is a Senior Technical Product Supervisor in AWS In-Reminiscence Database Companies, centered on Amazon ElastiCache for Valkey. Beforehand, he constructed options with generative AI, machine studying, and graph networks. Off the clock, Chaitanya is busy gathering hobbies, which at the moment embrace tennis, skateboarding, and paddle-boarding.

Utkarsh Shah

Utkarsh Shah

Utkarsh is a Software program Engineer at AWS who has made vital contributions to AWS non-relational Database merchandise. Together with Amazon ElastiCache. Over the previous 9 years, he has led the design and supply of advanced initiatives which have had an enduring impression on the ElastiCache product, expertise, and structure. Utkarsh can also be actively concerned within the broader engineering group, sharing his experience by means of trainings and publications.

Jungwoo Song

Jungwoo Tune

Jungwoo is a Options Architect at AWS. He works with prospects to design optimum architectures for reaching their enterprise outcomes. He additionally enjoys contributing to open supply initiatives typically.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles