Sunday, January 25, 2026

How the Amazon.com Catalog Group constructed self-learning generative AI at scale with Amazon Bedrock


The Amazon.com Catalog is the muse of each buyer’s buying expertise—the definitive supply of product data with attributes that energy search, suggestions, and discovery. When a vendor lists a brand new product, the catalog system should extract structured attributes—dimensions, supplies, compatibility, and technical specs—whereas producing content material equivalent to titles that match how prospects search. A title isn’t a easy enumeration like colour or measurement; it should steadiness vendor intent, buyer search habits, and discoverability. This complexity, multiplied by thousands and thousands of day by day submissions, makes catalog enrichment a really perfect proving floor for self-learning AI.

On this put up, we exhibit how the Amazon Catalog Group constructed a self-learning system that constantly improves accuracy whereas lowering prices at scale utilizing Amazon Bedrock.

The problem

In generative AI deployment environments, bettering mannequin efficiency requires fixed consideration. As a result of fashions course of thousands and thousands of merchandise, they inevitably encounter edge circumstances, evolving terminology, and domain-specific patterns the place accuracy could degrade. The normal method—utilized scientists analyzing failures, updating prompts, testing adjustments, and redeploying—works however is resource-intensive and struggles to maintain tempo with real-world quantity and selection. The problem isn’t whether or not we will enhance these methods, however the way to make enchancment scalable and computerized reasonably than depending on guide intervention. At Amazon Catalog, we confronted this problem head-on. The tradeoffs appeared unimaginable: massive fashions would ship accuracy however wouldn’t scale effectively to our quantity, whereas smaller fashions struggled with the advanced, ambiguous circumstances the place sellers wanted essentially the most assist.

Answer overview

Our breakthrough got here from an unconventional experiment. As a substitute of selecting a single mannequin, we deployed a number of smaller fashions to course of the identical merchandise. When these fashions agreed on an attribute extraction, we might belief the consequence. However once they disagreed—whether or not from real ambiguity, lacking context, or one mannequin making an error—we found one thing profound. These disagreements weren’t at all times errors, however they have been nearly at all times indicators of complexity. This led us to design a self-learning system that reimagines how generative AI scales. A number of smaller fashions course of routine circumstances by consensus, invoking bigger fashions solely when disagreements happen. The bigger mannequin is applied as a supervisor agent with entry to specialised instruments for deeper investigation and evaluation. However the supervisor doesn’t simply resolve disputes; it generates reusable learnings saved in a dynamic data base that helps stop total lessons of future disagreements. We invoke extra highly effective fashions solely when the system detects excessive studying worth at inference time, whereas correcting the output. The result’s a self-learning system the place prices lower and high quality will increase—as a result of the system learns to deal with edge circumstances that beforehand triggered supervisor calls. Error charges fell constantly, not by retraining however by gathered learnings from resolved disagreements injected into smaller mannequin prompts. The next determine reveals the structure of this self-learning system.

Within the self-learning structure, product information flows by generator-evaluator staff, with disagreements routed to a supervisor for investigation. Publish-inference, the system additionally captures suggestions indicators from sellers (equivalent to itemizing updates and appeals) and prospects (equivalent to returns and destructive critiques). Learnings from the sources are saved in a hierarchical data base and injected again into employee prompts, making a steady enchancment loop.

The next describes a simplified reference structure that demonstrates how this self-learning sample might be applied utilizing AWS providers. Whereas our manufacturing system has further complexity, this instance illustrates the core parts and information flows.

This technique might be constructed with Amazon Bedrock, which offers the important infrastructure for multi-model architectures. The power of Amazon Bedrock to entry various basis fashions allows groups to deploy smaller, environment friendly fashions like Amazon Nova Lite as staff and extra succesful fashions like Anthropic Claude Sonnet as supervisors—optimizing each price and efficiency. For even higher price effectivity at scale, groups can even deploy open supply small fashions on Amazon Elastic Compute Cloud (Amazon EC2) GPU situations, offering full management over employee mannequin choice and batch throughput optimization. For productionizing a supervisor agent with its specialised instruments and dynamic data base, Bedrock AgentCore offers the runtime scalability, reminiscence administration, and observability wanted to deploy self-learning methods reliably at scale.

Our supervisor agent integrates with Amazon’s in depth Choice and Catalog Methods. The above diagram is a simplified view exhibiting the important thing options of the agent and a number of the AWS providers that make it potential. Product information flows by generator-evaluator staff (Amazon EC2 and Amazon Bedrock Runtime), with agreements saved instantly and disagreements routed to a supervisor agent (Bedrock AgentCore). The educational aggregator and reminiscence supervisor make the most of Amazon DynamoDB for the data base, with learnings injected again into employee prompts. Human evaluation (Amazon Easy Queue Service (Amazon SQS)) and observability (Amazon CloudWatch) full the structure. Manufacturing implementations will probably require further parts for scale, reliability, and integration with present methods.

However how did we arrive at this structure? The important thing perception got here from an sudden place.

The perception: Turning disagreements into alternatives

Our perspective shifted throughout a debugging session. When a number of smaller fashions (equivalent to Nova Lite) disagreed on product attributes—deciphering the identical specification in another way primarily based on how they understood technical terminology—we initially noticed this as a failure. However the information instructed a distinct story: merchandise the place our smaller fashions disagreed correlated with circumstances requiring extra guide evaluation and clarification. When fashions disagreed, these have been exactly the merchandise that wanted further investigation. The disagreements have been surfacing studying alternatives, however we couldn’t have engineers and scientists deep-dive on each case. The supervisor agent does this robotically at scale. And crucially, the objective isn’t simply to find out which mannequin was proper—it’s to extract learnings that assist stop related disagreements sooner or later. That is the important thing to environment friendly scaling. Disagreements don’t simply come from AI staff at inference time. Publish-inference, sellers categorical disagreement by itemizing updates and appeals—indicators that our unique extraction might need missed vital context. Prospects disagree by returns and destructive critiques, usually indicating that product data didn’t match expectations. These post-inference human indicators feed into the identical studying pipeline, with the supervisor investigating patterns and producing learnings that assist stop related points throughout future merchandise. We discovered a candy spot: attributes with reasonable AI employee disagreement charges yielded the richest learnings—excessive sufficient to floor significant patterns, low sufficient to point solvable ambiguity. When disagreement charges are too low, they sometimes mirror noise or basic mannequin limitations reasonably than learnable patterns—for these, we think about using extra succesful staff. When disagreement charges are too excessive, it indicators that employee fashions or prompts aren’t but mature sufficient, triggering extreme supervisor calls that undermine the effectivity beneficial properties of the structure. These thresholds will range by job and area; the hot button is figuring out your personal candy spot the place disagreements characterize real complexity price investigating, reasonably than basic gaps in employee functionality or random noise.

Deep dive: The way it works

On the coronary heart of our system are a number of light-weight employee fashions working in parallel—some as mills extracting attributes, others as evaluators assessing these extractions. These staff might be applied in a non-agentic method with fastened inputs, making them batch-friendly and scalable. The generator-evaluator sample creates productive rigidity, conceptually just like the productive rigidity in generative adversarial networks (GANs), although our method operates at inference time by prompting reasonably than coaching. We explicitly immediate evaluators to be crucial, instructing them to scrutinize extractions for ambiguities, lacking context, or potential misinterpretations. This adversarial dynamic surfaces disagreements that characterize real complexity reasonably than letting ambiguous circumstances cross by undetected. When the generator and evaluator agree, now we have excessive confidence within the consequence and course of it at minimal computational price. This consensus path handles most product attributes. Once they disagree, we’ve recognized a case price investigating—triggering the supervisor to resolve the dispute and extract reusable learnings.

Our structure treats disagreement as a common studying sign. At inference time, worker-to-worker disagreements catch ambiguity. Publish-inference, vendor suggestions catches misalignments with intent and buyer suggestions catches misalignments with expectations. The three channels feed the supervisor, which extracts learnings that enhance accuracy throughout the board. When staff disagree, we invoke a supervisor agent—a extra succesful mannequin that resolves the dispute and investigates why it occurred. The supervisor determines what context or reasoning the employees lacked, and these insights turn out to be reusable learnings for future circumstances. For instance, when staff disagreed about utilization classification for a product primarily based on sure technical phrases, the supervisor investigated and clarified that these phrases alone have been inadequate—visible context and different indicators wanted to be thought-about collectively. The supervisor generated a studying about the way to correctly weight completely different indicators for that product class. This studying instantly up to date our data base, and when injected into employee prompts for related merchandise, helped stop future disagreements throughout hundreds of things. Whereas the employees might theoretically be the identical mannequin because the supervisor, utilizing smaller fashions is essential for effectivity at scale. The architectural benefit emerges from this asymmetry: light-weight staff deal with routine circumstances by consensus, whereas the extra succesful supervisor is invoked solely when disagreements floor high-value studying alternatives. Because the system accumulates learnings and disagreement charges drop, supervisor calls naturally decline—effectivity beneficial properties are baked instantly into the structure. This worker-supervisor heterogeneity additionally allows richer investigation. As a result of supervisors are invoked selectively, they’ll afford to drag in further indicators—buyer critiques, return causes, vendor historical past—that will be impractical to retrieve for each product however present essential context when resolving advanced disagreements. When these indicators yield generalizable insights about how prospects need product data offered—which attributes to focus on, what terminology resonates, the way to body specs—the ensuing learnings profit future inferences throughout related merchandise with out retrieving these resource-intensive indicators once more. Over time, this creates a suggestions loop: higher product data results in fewer returns and destructive critiques, which in flip displays improved buyer satisfaction.

The data base: Making learnings scalable

The supervisor investigates disagreements on the particular person product stage. With thousands and thousands of things to course of, we’d like a scalable technique to remodel these product-specific insights into reusable learnings. Our aggregation technique adapts to context: high-volume patterns get synthesized into broader learnings, whereas distinctive or crucial circumstances are preserved individually. We use a hierarchical construction the place a big language mannequin (LLM)-based reminiscence supervisor navigates the data tree to position every studying. Ranging from the basis, it traverses classes and subcategories, deciding at every stage whether or not to proceed down an present path, create a brand new department, merge with present data, or substitute outdated data. This dynamic group permits the data base to evolve with rising patterns whereas sustaining logical construction. Throughout inference, staff obtain related learnings of their prompts primarily based on product class, robotically incorporating area data from previous disagreements. The data base additionally introduces traceability—when an extraction appears incorrect, we will pinpoint precisely which studying influenced it. This shifts auditing from an unscalable job to a sensible one: as an alternative of reviewing a pattern of thousands and thousands of outputs—the place human effort grows proportionally with scale—groups can audit the data base itself, which stays comparatively fastened in measurement no matter inference quantity. Area specialists can instantly contribute by including or refining entries, no retraining required. A single well-crafted studying can instantly enhance accuracy throughout hundreds of merchandise. The data base bridges human experience and AI functionality, the place automated learnings and human insights work collectively.

Classes realized and finest practices

When this self-learning structure works finest:

  • Excessive-volume inference the place enter range drives compounded studying
  • High quality-critical functions the place consensus offers pure high quality assurance
  • Evolving domains with new patterns and terminology continually rising

It’s much less appropriate for low-volume eventualities (inadequate disagreements for studying) or use circumstances with fastened, unchanging guidelines.

Crucial success components:

  • Defining disagreements: With a generator-evaluator pair, disagreement happens when the evaluator flags the extraction as needing enchancment. With a number of staff, scale thresholds accordingly. The bottom line is sustaining productive rigidity between staff. If disagreement charges fall outdoors the productive vary (too low or too excessive), contemplate extra succesful staff or refined prompts.
  • Monitoring studying effectiveness: Disagreement charges should lower over time—that is your main well being metric. If charges keep flat, examine data retrieval, immediate injection, or evaluator criticality.
  • Information group: Construction learnings hierarchically and preserve them actionable. Summary steering doesn’t assist; particular, concrete learnings instantly enhance future inferences.

Widespread pitfalls

  • Specializing in price over intelligence: Value discount is a byproduct, not the objective
  • Rubber-stamp evaluators: Evaluators that merely approve generator outputs received’t floor significant disagreements—immediate them to actively problem and critique extractions
  • Poor studying extraction: Supervisors should establish generalizable patterns, not simply repair particular person circumstances
  • Information rot: With out group, learnings turn out to be unsearchable and unusable

The important thing perception: deal with declining disagreement charges as your north star metric—they present the system is really studying.

Deployment methods: Two approaches

  • Study-then-deploy: Begin with fundamental prompts and let the system study aggressively in a pre-production surroundings. Area specialists then audit the data base—not particular person outputs—to verify realized patterns align with desired outcomes. When authorised, deploy with validated learnings. That is ideally suited for brand spanking new use circumstances the place you don’t but know what good seems like—disagreements assist uncover the best patterns, and data base auditing allows you to form them earlier than manufacturing.
  • Deploy-and-learn: Begin with refined prompts and good preliminary high quality, then constantly enhance by ongoing studying in manufacturing. This works finest for well-understood use circumstances the place you’ll be able to outline high quality upfront however nonetheless wish to seize domain-specific nuances over time.

Each approaches use the identical structure—the selection depends upon whether or not you’re exploring new territory or optimizing acquainted floor.

Conclusion

What began as an experiment in catalog enrichment revealed a basic fact: AI methods don’t need to be frozen in time. By embracing disagreements as studying indicators reasonably than failures, we’ve constructed an structure that accumulates area data by precise utilization. We watched the system evolve from generic understanding to domain-specific experience. It realized industry-specific terminology. It found contextual guidelines that modify throughout classes. It tailored to necessities no pre-trained mannequin would encounter—all with out retraining, by learnings saved in a data base and injected again into employee prompts. For groups operationalizing related architectures, Amazon Bedrock AgentCore presents purpose-built capabilities:

  • AgentCore Runtime  handles fast consensus selections for routine circumstances whereas supporting prolonged reasoning when supervisors examine advanced disagreements
  • AgentCore Observability offers visibility into which learnings drive impression, serving to groups refine data propagation and keep reliability at scale

The implications lengthen past catalog administration. Excessive-volume AI functions may benefit from this course of—and the flexibility of Amazon Bedrock to entry various fashions makes this structure easy to implement. The important thing perception is that this: we’ve shifted from asking “which mannequin ought to we use?” to “how can we construct methods that study our particular patterns? “Whether or not you learn-then-deploy for brand spanking new use circumstances or deploy-and-learn for established ones, the implementation is easy: begin with staff suited to your job, select a supervisor, and let disagreements drive studying. With the best structure, each inference can turn out to be a possibility to seize area data. That’s not simply scaling—that’s constructing institutional data into your AI methods.

Acknowledgement

This work wouldn’t have been potential with out the contributions and help from Ankur Datta (Senior Principal Utilized Scientist – chief of science in On a regular basis Necessities Shops), Zhu Cheng (Utilized Scientist), Xuan Tang (Software program Engineer), Mohammad Ghasemi (Utilized Scientist). We sincerely admire the contributions in designs, implementations, quite a few fruitful brain-storming periods, and all of the insightful concepts and strategies.


In regards to the authors

Tarik Arici is a Principal Scientist at Amazon Choice and Catalog Methods (ASCS), the place he pioneers self-learning generative AI methods design for catalog high quality enhancement at scale. His work focuses on constructing AI methods that robotically accumulate area data by manufacturing utilization—studying from buyer critiques and returns, vendor suggestions, and mannequin disagreements to enhance high quality whereas lowering prices. Tarik holds a PhD in Electrical and Laptop Engineering from Georgia Institute of Know-how.

Sameer Thombare is a Senior Product Supervisor at Amazon with over a decade of expertise in Product Administration, Class/P&L Administration throughout various industries, together with heavy engineering, telecommunications, finance, and eCommerce. Sameer is keen about creating constantly bettering closed-loop methods and leads strategic initiatives inside Amazon Choice and Catalog Methods (ASCS) to construct a complicated self-learning closed-loop system that synthesize indicators from prospects, sellers, and provide chain operations to optimize outcomes. Sameer holds an MBA from the Indian Institute of Administration Bangalore and an engineering diploma from Mumbai College.

Amin Banitalebi acquired his PhD within the Digital Media on the College of British Columbia (UBC), Canada, in 2014. Since then, he has taken numerous utilized science roles spanning over areas in laptop imaginative and prescient, pure language processing, suggestion methods, classical machine studying, and generative AI. Amin has co-authored over 90 publications and patents. He’s at the moment an Utilized Science Supervisor in Amazon On a regular basis Necessities.

Puneet Sahni is a Senior Principal Engineer at Amazon Choice and Catalog Methods (ASCS), the place he has spent over 8 years bettering the completeness, consistency, and correctness of catalog information. He focuses on catalog information modeling and its utility to enhancing Promoting Companion and buyer experiences, whereas utilizing ML/DL and LLM-based enrichment to drive enhancements in catalog information high quality.

Erdinc Basci joined Amazon in 2015 and brings over 23 years of know-how {industry} expertise. At Amazon, he has led the evolution of Catalog system architectures—together with ingestion pipelines, prioritized processing, and visitors shaping—in addition to catalog information structure enhancements equivalent to segmented presents, product specs for manufacture-on-demand merchandise, and catalog information experimentation. Erdinc has championed a hands-on efficiency engineering tradition throughout Amazon providers unlocking $1B+ annualized price financial savings and 20%+ latency wins throughout core Shops providers. He’s at the moment centered on bettering generative AI utility efficiency and GPU effectivity throughout Amazon. Erdinc holds a BS in Laptop Science from Bilkent College, Turkey, and an MBA from Seattle College, US.

Mey Meenakshisundaram is a Director in Amazon Choice and Catalog Methods, the place he leads modern GenAI options to ascertain Amazon’s worldwide catalog because the best-in-class supply for product data. His staff pioneers superior machine studying methods, together with multi-agent methods and huge language fashions, to robotically enrich product attributes and enhance catalog high quality at scale. Excessive-quality product data within the catalog is crucial for delighting prospects find the best merchandise, empowering promoting companions to record their merchandise successfully, and enabling Amazon operations to scale back guide effort.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles