Monday, December 15, 2025

Amazon Advertisements upgrades to Amazon ElastiCache for Valkey to attain 12% larger throughput and save over 45% in infrastructure prices


Amazon Advertisements permits companies to meaningfully interact with clients all through their buying journey, reaching over 300 million viewers within the US alone. Delivering the proper advert to the proper buyer in actual time at a worldwide scale requires extremely out there, low-latency infrastructure able to processing tens of tens of millions of requests per second.

On this publish, we’ll present a behind-the-scenes take a look at how Amazon Sponsored Merchandise, one of many core promoting merchandise at Amazon, depends on Amazon ElastiCache to ship billions of advert impressions a day utilizing a customized multi-cluster structure that coordinates information throughout large-scale cache deployments. We additionally stroll by way of how upgrading from ElastiCache for Redis to ElastiCache for Valkey supplied improved efficiency whereas decreasing infrastructure value by over 45%, with zero-downtime whereas sustaining strict latency and availability necessities.

The problem

Amazon Sponsored Merchandise makes use of in-memory caching to ship billions of advert impressions a day to tons of of tens of millions of our world consumers, peaking at tens of tens of millions of learn requests per second throughout terabytes of knowledge. We constantly ship p99 latencies beneath 5ms with 99.99% availability—even throughout site visitors surges like Prime Day or the vacation season. Moreover, we maintain strict latency and availability necessities at the same time as site visitors, throughput, and reminiscence footprint continues to develop.

At this scale, constantly optimizing for efficiency and value is crucial. Even small effectivity good points can translate into important financial savings and improved responsiveness for our consumers. Nonetheless, with the strict availability and latency necessities, any disruption to our caching layer immediately impacts advert supply and thus the consumer expertise. To facilitate quick deployment and enhancements to advert retrieval strategies we designed a multi-tenant atmosphere that should preserve isolation, whereas enabling useful resource utilization throughout various workloads.

To fulfill these necessities at scale, we depend on Amazon ElastiCache, which offers the mandatory reliability, flexibility, and operational controls to execute system upgrades and optimizations with out requiring downtime or service degradation.

The answer

To realize the dimensions and efficiency we required, we invested in two keys. First, we constructed a multi-cluster structure that isolates workloads and scales every one independently. Second, we adopted ElastiCache for Valkey by way of a zero-downtime migration path that strengthens reliability whereas enhancing effectivity.

Multi-cluster structure for scale

Our cache administration service, Horus, makes use of ElastiCache to scale out a cluster throughout 500 nodes. To maintain tempo with the ever-increasing information quantity, we developed a multi-cluster answer. Completely different advert retrieval methods are distributed throughout cache clusters relying on information quantity and scaling wants. For instance, if one of many strategies wants a big cache house, we create a devoted cluster for it, permitting for unbiased house planning, scaling, efficiency and value monitoring.

The Horus cache consumer makes use of AWS AppConfig to dynamically route every retrieval technique to its cache cluster and employs multi-threading to challenge parallel cache calls. Inside every baby thread, failures are dealt with in isolation so {that a} single key failure doesn’t influence others, guaranteeing requests are routed appropriately and system resilience is maintained with out cascading influence. The next diagram offers a high-level view of how Horus coordinates dynamic request routing and cross-cluster information publication inside its multi-cluster ElastiCache structure.

Upgrading to ElastiCache for Valkey with zero downtime

Whereas evaluating the choice of migrating to Graviton primarily based clusters, we wished an answer that would supply zero disruption to our advert supply and supply the choice of instantaneous rollbacks in case of any failure. The transfer to ElastiCache for Valkey 8.0 allowed us to satisfy each objectives.

We first used ElastiCache’s streamlined deployment capabilities to create a zero downtime migration plan. The help for Valkey as a drop-in substitute for Redis OSS eradicated the necessity for any utility code modifications. We then provisioned new Valkey 8.0 clusters on Graviton-based situations, which have been spun up in parallel to our present Redis primarily based clusters. The great Amazon CloudWatch metrics emitted by ElastiCache allowed monitoring each clusters and supplied baseline efficiency metrics earlier than beginning the migration.

Subsequent, a dual-write implementation configured the Horus cache consumer to jot down to each Redis and Valkey clusters concurrently. This strategy ensured information consistency throughout each clusters and validated that each one new information was correctly replicated, creating a security internet for the transition.

Within the following code pattern, we present the implementation of MultiWriteCacheClient, a specialised cache wrapper that used a multi-write sample to concurrently publish information to a number of cache clusters. It makes use of CompletableFuture for parallel execution throughout all cache clusters with a 100ms timeout and aggregates exceptions to make sure partial failures don’t block different operations. The consumer intentionally blocks learn operations since these are dealt with by separate purchasers throughout the migration part, whereas offering correct useful resource cleanup when closing underlying connections.

import lombok.AllArgsConstructor;
import lombok.NonNull;
import lombok.extern.slf4j.Slf4j;

import java.util.Listing;
import java.util.Map;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

@FunctionalInterface
public interface CacheOperation {
    void execute(AdsDeliveryCacheClient consumer) throws Exception;
}  

/** A cache consumer that performs multi writes throughout a number of caches.
 * This consumer is designed for write-through situations the place information must be written to a number of
 * cache clusters concurrently for redundancy or migration functions.
 */
@AllArgsConstructor
@Slf4j
public class MultiWriteCacheClient implements AdsDeliveryCacheClient {
    non-public static closing lengthy PUT_TIMEOUT = 100; 
    non-public static closing TimeUnit PUT_TIMEOUT_UNIT = TimeUnit.MILLISECONDS;
    
    @NonNull
    non-public closing Listing> cacheClients;
    @NonNull
    non-public closing ExecutorService executorService;

    @Override
    public V get(Okay key) throws UnsupportedOperationException {
        throw new UnsupportedOperationException("This operation will not be supported in MultiWriteCacheClient");
    }

    @Override
    public Map getAll(Listing keys) throws UnsupportedOperationException {
        throw new UnsupportedOperationException("This operation will not be supported in MultiWriteCacheClient");
    }

    @Override
    public void put(Okay key, V worth, lengthy ttl, TimeUnit ttlTimeUnit) throws AdsDeliveryCacheClientException {
        executeParallelCacheOperation(consumer -> consumer.put(key, worth, ttl, ttlTimeUnit));
    }

    @Override
    public void putAll(Map gadgets, lengthy ttl, TimeUnit ttlTimeUnit) throws AdsDeliveryCacheClientException {
        executeParallelCacheOperation(consumer -> consumer.putAll(gadgets, ttl, ttlTimeUnit));
    }
    
    /** Core logic for multi-write parallel execution **/
    non-public void executeParallelCacheOperation(CacheOperation operation) throws AdsDeliveryCacheClientException {
        // Step 1: Create futures for every cache consumer in parallel
        Listing> futures = cacheClients.stream()
                .map(consumer -> CompletableFuture.supplyAsync(() -> {
                    strive {
                        operation.execute(consumer);
                        return null; // Success
                    } catch (Exception ex) {
                        return ex; // Return the exception
                    }
                }, executorService))
                .acquire(Collectors.toList());
    
    	// Step 2: Combination the operation outcomes and acquire exceptions if any
        AdsDeliveryCacheClientException adcException = null;
        for (CompletableFuture future : futures) {
    		strive {
                    Exception ex = future.get(PUT_TIMEOUT, PUT_TIMEOUT_UNIT);
                    if (ex != null) {
                        adcException = addException(adcException, ex);
                    }
                } catch (InterruptedException ex) {
                    Thread.currentThread().interrupt();
                    adcException = addException(adcException, ex);
                } catch (ExecutionException ex) {
                    adcException = addException(adcException, new Exception(ex));
                }
            }
	
    	// Step 3: Throw aggregated exception in spite of everything operations full
        if (adcException != null) {
                throw adcException;
            }
        }
    
    non-public AdsDeliveryCacheClientException addException(AdsDeliveryCacheClientException present, Exception newException) {
        return present != null ? present.add(newException) : new AdsDeliveryCacheClientException(newException);
    }

    /**Useful resource Cleanup: Safely shut all underlying cache connections **/
    @Override
    public void shut() {
        IntStream.vary(0, cacheClients.measurement())
            .forEach(i -> closeCacheClient(cacheClients.get(i), "MultiWriteCacheClient[" + i + "]"));
    }
   
    /**Fault-Tolerant Cleanup: Proceed closing different purchasers even when one fails**/
    non-public void closeCacheClient(AdsDeliveryCacheClient cacheClient, String title) {
        strive {
            cacheClient.shut();
        } catch (Exception ex) {
            log.error("Failed to shut " + title, ex);
        }
    }
}

We then steadily shifted the learn site visitors to the brand new Valkey cluster whereas intently monitoring efficiency metrics and error charges. All through this course of, we had the flexibility to roll again to the Redis cluster as a backup. After profitable validation of the Valkey cluster beneath full load, the outdated Redis cluster was decommissioned, and assets have been reclaimed. All through the migration we had no influence on both service availability or efficiency, guaranteeing steady advert supply for our consumers.

With the fee effectivity of Graviton-based situations, Valkey pricing benefit, and efficiency benefits over ElastiCache for Redis, this migration effort lowered our infrastructure spend by over 45% whereas enhancing throughput by 12%.

Issues

Evaluating single and multi-cluster: Whereas the multi-cluster structure offers important advantages, it introduces further operational overhead that have to be rigorously managed. This strategy is most fitted for bigger cache methods (>30TB), whereas smaller cache necessities (<30TB) could also be higher served by a single cluster setup for operational simplicity and lowered upkeep wants.

Guaranteeing information consistency: Information consistency presents one other essential consideration throughout transitions. The twin-write strategy requires cautious monitoring to make sure consistency throughout environments. Sustaining information integrity throughout cluster boundaries is essential to forestall service degradation or incorrect advert supply.

Efficiency monitoring: Cautious efficiency monitoring is crucial when working throughout a number of clusters. We relied on CloudWatch metrics emitted by ElastiCache to trace cache well being, together with engine CPU utilization, reminiscence utilization, learn/write transactions per second (TPS), keys reclaimed or expired, and present connection counts. On the consumer aspect, we captured detailed service efficiency metrics comparable to cache hit charge and TPS throughout a number of dimensions (cache cluster, sourcing method, market, widget, and extra), in addition to learn/write latencies and error charges. These metrics have been visualized by way of dashboards and supported by alerting methods, enabling us to shortly detect anomalies and proactively deal with points earlier than it causes service degradation.

Price Financial savings: Balancing value towards efficiency is at all times a key consideration, particularly on the scale Horus operates. Whereas migrations—comparable to adopting Valkey 8.0 and shifting to Graviton situations—require cautious planning, the long-term good points might be substantial. On this case, the efficiency enhancements and infrastructure value financial savings delivered by Valkey 8.0 clearly outweighed the short-term complexity of the migration, making it a high-impact improve that paid off shortly.

Conclusion

The profitable migration to ElastiCache for Valkey, mixed with a scalable multi-cluster structure, helped Amazon Sponsored Merchandise meet the rising calls for of real-time advert supply at Amazon scale. By leveraging ElastiCache’s efficiency, observability, and ease of use—along with Valkey’s improved reminiscence effectivity and throughput—we executed a zero-downtime improve with none modifications to utility code or disruption to advert supply.By way of this migration, we lowered infrastructure spend by over 45% whereas enhancing throughput by 12%.

Extra importantly, the improve established a recipe for modernizing cache infrastructure with minimal danger. Whether or not you’re seeking to increase efficiency, reduce prices, or enhance scalability, ElastiCache for Valkey offers the flexibleness and reliability to take action at any scale.

Be taught extra and get began with Amazon ElastiCache for Valkey by visiting the ElastiCache product pages and documentation. For a step-by-step information to improve to the most recent model of ElastiCache for Valkey, see the improve documentation.


In regards to the authors

Jiahui Tao

Jiahui Tao

Jiahui is a Senior Software program Engineer at Amazon Advertisements, the place she works on the Advert Sourcing Companies, designing and scaling large-scale, low-latency distributed methods. She enjoys tackling advanced engineering challenges and turning options into reusable frameworks that drive impacts throughout groups. Outdoors of labor, Jiahui loves spending time along with her canines, touring, and having fun with dwell music at concert events round New York Metropolis.

Hao Yuan

Hao Yuan

Hao is a Software program Engineering Supervisor at Amazon Advertisements, the place he leads engineering groups constructing infrastructure and merchandise for Sponsored Merchandise and Sponsored Manufacturers Sourcing retrieval methods. He focuses on constructing gifted groups to resolve advanced technical and enterprise issues that makes significant influence for patrons.

Zheyu Zhu

Zheyu Zhu

Zheyu is a Software program Engineer at Amazon Advertisements, engaged on Advert Sourcing Companies. He focuses on constructing scalable distributed methods and optimizing infrastructure for high-performance, low-latency companies. Outdoors of labor, Zheyu enjoys enjoying tennis, exploring new mountaineering adventures, and snowboarding.

Mas Kubo

Mas Kubo

Mas is a Product Supervisor within the In-Reminiscence Databases staff at AWS targeted on Valkey, the open-source high-performance datastore engine for Amazon ElastiCache. Outdoors of labor Mas follows the wind and the ocean whereas freediving, paragliding, kitesurfing, and crusing.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles