Wednesday, February 4, 2026

Construct dependable Agentic AI resolution with Amazon Bedrock: Study from Pushpay’s journey on GenAI analysis


This submit was co-written with Saurabh Gupta and Todd Colby from Pushpay.

Pushpay is a market-leading digital giving and engagement platform designed to assist church buildings and faith-based organizations drive neighborhood engagement, handle donations, and strengthen generosity fundraising processes effectively. Pushpay’s church administration system supplies church directors and ministry leaders with insight-driven reporting, donor improvement dashboards, and automation of monetary workflows.

Utilizing the ability of generative AI, Pushpay developed an revolutionary agentic AI search characteristic constructed for the distinctive wants of ministries. The strategy makes use of pure language processing so ministry employees can ask questions in plain English and generate real-time, actionable insights from their neighborhood knowledge. The AI search characteristic addresses a crucial problem confronted by ministry leaders: the necessity for fast entry to neighborhood insights with out requiring technical experience. For instance, ministry leaders can enter “present me people who find themselves members in a bunch, however haven’t given this 12 months” or “present me people who find themselves not engaged in my church,” and use the outcomes to take significant motion to raised assist people of their neighborhood. Most neighborhood leaders are time-constrained and lack technical backgrounds; they will use this resolution to acquire significant knowledge about their congregations in seconds utilizing pure language queries.

By empowering ministry employees with sooner entry to neighborhood insights, the AI search characteristic helps Pushpay’s mission to encourage generosity and connection between church buildings and their neighborhood members. Early adoption customers report that this resolution has shortened their time to insights from minutes to seconds. To realize this outcome, the Pushpay workforce constructed the characteristic utilizing agentic AI capabilities on Amazon Net Companies (AWS) whereas implementing sturdy high quality assurance measures and establishing a fast iterative suggestions loop for steady enhancements.

On this submit, we stroll you thru Pushpay’s journey in constructing this resolution and discover how Pushpay used Amazon Bedrock to create a customized generative AI analysis framework for steady high quality assurance and establishing fast iteration suggestions loops on AWS.

Resolution overview: AI powered search structure

The answer consists of a number of key parts that work collectively to ship an enhanced search expertise. The next determine exhibits the answer structure diagram and the general workflow.

Determine 1: AI Search Resolution Structure

  • Consumer interface layer: The answer begins with Pushpay customers submitting pure language queries by way of the prevailing Pushpay utility interface. By utilizing pure language queries, church ministry employees can get hold of knowledge insights utilizing AI capabilities with out studying new instruments or interfaces.
  • AI search agent: On the coronary heart of the system lies the AI search agent, which consists of two key parts:
    • System immediate: Incorporates the massive language mannequin (LLM) position definitions, directions, and utility descriptions that information the agent’s habits.
    • Dynamic immediate constructor (DPC): robotically constructs extra personalized system prompts based mostly on the person particular data, similar to church context, pattern queries, and utility filter stock. Additionally they use semantic search to pick out solely related filters amongst lots of of accessible utility filters. The DPC improves response accuracy and person expertise.
  • Amazon Bedrock superior characteristic: The answer makes use of the next Amazon Bedrock managed companies:
    • Immediate caching: Reduces latency and prices by caching regularly used system immediate.
    • LLM processing: Makes use of Claude Sonnet 4.5 to course of prompts and generate JSON output required by the applying to show the specified question outcomes as insights to customers.
  • Analysis system: The analysis system implements a closed-loop enchancment resolution the place person interactions are instrumented, captured and evaluated offline. The analysis outcomes feed right into a dashboard for product and engineering groups to research and drive iterative enhancements to the AI search agent. Throughout this course of, the information science workforce collects a golden dataset and constantly curates this dataset based mostly on the precise person queries coupled with validated responses.

The challenges of preliminary resolution with out analysis

To create the AI search characteristic, Pushpay developed the primary iteration of the AI search agent. The answer implements a single agent configured with a rigorously tuned system immediate that features the system position, directions, and the way the person interface works with detailed rationalization of every filter software and their sub-settings. The system immediate is cached utilizing Amazon Bedrock immediate caching to cut back token price and latency. The agent makes use of the system immediate to invoke an Amazon Bedrock LLM which generates the JSON doc that Pushpay’s utility makes use of to use filters and current question outcomes to customers.

Nonetheless, this primary iteration rapidly revealed some limitations. Whereas it demonstrated a 60-70% success charge with fundamental enterprise queries, the workforce reached an accuracy plateau. The analysis of the agent was a guide and tedious course of Tuning the system immediate past this accuracy threshold proved difficult given the varied spectrum of person queries and the applying’s protection of over 100 distinct configurable filters. These introduced crucial blockers for the workforce’s path to manufacturing.

Figure 2: AI Search First Solution

Determine 2: AI Search First Resolution

Bettering the answer by including a customized generative AI analysis framework

To handle the challenges of measuring and enhancing agent accuracy, the workforce carried out a generative AI analysis framework built-in into the prevailing structure, proven within the following determine. This framework consists of 4 key parts that work collectively to offer complete efficiency insights and allow data-driven enhancements.

Figure 3: Introducing the GenAI Evaluation Framework

Determine 3: Introducing the GenAI Analysis Framework

  1. The golden dataset: A curated golden dataset containing over 300 consultant queries, every paired with its corresponding anticipated output, types the muse of automated analysis. The product and knowledge science groups rigorously developed and validated this dataset to attain complete protection of real-world use circumstances and edge circumstances. Moreover, there’s a steady curation means of including consultant precise person queries with validated outcomes.
  2. The evaluator: The evaluator part processes person enter queries and compares the agent-generated output in opposition to the golden dataset utilizing the LLM as a decide sample This strategy generates core accuracy metrics whereas capturing detailed logs and efficiency knowledge, similar to latency, for additional evaluation and debugging.
  3. Area class: Area classes are developed utilizing a mixture of generative AI area summarization and human-defined common expressions to successfully categorize person queries. The evaluator determines the area class for every question, enabling nuanced, category-based analysis as a further dimension of analysis metrics.
  4. Generative AI analysis dashboard: The dashboard serves because the mission management for Pushpay’s product and engineering groups, displaying area category-level metrics to evaluate efficiency and latency and information selections. It shifts the workforce from single combination scores to nuanced, domain-based efficiency insights.

The accuracy dashboard: Pinpointing weaknesses by area

As a result of person queries are categorized into area classes, the dashboard incorporates statistical confidence visualization utilizing a 95% Wilson rating interval to show accuracy metrics and question volumes at every area stage. By utilizing classes, the workforce can pinpoint the AI agent’s weaknesses by area. Within the following instance , the “exercise” area exhibits considerably decrease accuracy than different classes.

Figure 4: Pinpointing Agent Weaknesses by Domain

Determine 4: Pinpointing Agent Weaknesses by Area

Moreover, a efficiency dashboard, proven within the following determine, visualizes latency indicators on the area class stage, together with latency distributions from p50 to p90 percentiles. Within the following instance, the exercise area reveals notably increased latency than others.

Identifying Latency Bottlenecks by Domain

Determine 5: Figuring out Latency Bottlenecks by Area

Strategic rollout by way of domain-Degree insights

Area-based metrics revealed various efficiency ranges throughout semantic domains, offering essential insights into agent effectiveness. Pushpay used this granular visibility to make strategic characteristic rollout selections. By quickly suppressing underperforming classes—similar to exercise queries—whereas present process optimization, the system achieved 95% general accuracy. By utilizing this strategy, customers skilled solely the highest-performing options whereas the workforce refined others to manufacturing requirements.

Determine 6: Reaching 95% Accuracy with Area-Degree Function Rollout

Strategic prioritization: Specializing in high-impact domains

To prioritize enhancements systematically, Pushpay employed a 2×2 matrix framework plotting subjects in opposition to two dimensions (proven within the following determine): Enterprise precedence (vertical axis) and present efficiency or feasibility (horizontal axis). This visualization positioned subjects with each excessive enterprise worth and powerful present efficiency within the top-right quadrant. The workforce then targeted on these areas as a result of they required much less heavy lifting to attain additional accuracy enchancment from already-good ranges to an distinctive 95% accuracy for the enterprise targeted subjects.

The implementation adopted an iterative cycle: after every spherical of enhancements, they re-analyze the outcomes to determine the subsequent set of high-potential subjects. This systematic, cyclical strategy enabled steady optimization whereas sustaining concentrate on business-critical areas.

Figure 7: Strategic Prioritization Framework for Domain Category Optimization

Determine 7: Strategic Prioritization Framework for Area Class Optimization

Dynamic immediate development

The insights gained from the analysis framework led to an architectural enhancement: the introduction of a dynamic immediate constructor. This part enabled fast iterative enhancements by permitting fine-grained management over which area classes the agent might tackle. The structured subject stock – beforehand embedded within the system immediate – was remodeled right into a dynamic component, utilizing semantic search to assemble contextually related prompts for every person question. This strategy tailors the immediate filter stock based mostly on three key contextual dimensions: question content material, person persona, and tenant-specific necessities. The result’s a extra exact and environment friendly system that generates extremely related responses whereas sustaining the flexibleness wanted for steady optimization.

Enterprise affect

The generative AI analysis framework grew to become the cornerstone of Pushpay’s AI characteristic improvement, delivering measurable worth throughout three dimensions:

  • Consumer expertise: The AI search characteristic decreased time-to-insight from roughly 120 seconds (skilled customers manually navigating complicated UX) to below 4 seconds – a 15-fold acceleration that immediately helps improve ministry leaders’ productiveness and decision-making pace. This characteristic democratized knowledge insights, in order that customers of various technical ranges can entry significant intelligence with out requiring specialised experience.
  • Growth velocity: The scientific analysis strategy remodeled optimization cycles. Fairly than debating immediate modifications, the workforce now validates adjustments and measures domain-specific impacts inside minutes, changing extended deliberations with data-driven iteration.
  • Manufacturing readiness: Enhancements from 60–70% accuracy to greater than 95% accuracy utilizing high-performance domains offered the quantitative confidence required for customer-facing deployment, whereas the framework’s structure permits steady refinement throughout different area classes.

Key takeaways on your AI agent journey

The next are key takeaways from Pushpay’s expertise that you should utilize in your personal AI agent journey.

1/ Construct with manufacturing in thoughts from day one

Constructing agentic AI programs is simple, however scaling them to manufacturing is difficult. Builders ought to undertake a scaling mindset in the course of the proof-of-concept part, not after. Implementing sturdy tracing and analysis frameworks early, supplies a transparent pathway from experimentation to manufacturing. By utilizing this technique, groups can determine and tackle accuracy points systematically earlier than they grow to be blockers.

2/ Benefit from the superior options of Amazon Bedrock

Amazon Bedrock immediate caching considerably reduces token prices and latency by caching regularly used system prompts. For brokers with massive, steady system prompts, this characteristic is important for production-grade efficiency.

3/ Assume past combination metrics

Combination accuracy scores can generally masks crucial efficiency variations. By evaluating agent efficiency on the area class stage, Pushpay uncovered weaknesses past what a single accuracy metric can seize. This granular strategy permits focused optimization and knowledgeable rollout selections, ensuring customers solely expertise high-performing options whereas others are refined.

4/ Knowledge safety and accountable AI

When growing agentic AI programs, take into account data safety and LLM safety issues from the outset, following the AWS Shared Accountability Mannequin, as a result of safety necessities essentially affect the architectural design. Pushpay’s clients are church buildings and faith-based organizations who’re stewards of delicate data—together with pastoral care conversations, monetary giving patterns, household struggles, prayer requests and extra. On this implementation instance, Pushpay set a transparent strategy to incorporating AI ethically inside its product ecosystem, sustaining strict safety requirements to make sure church knowledge and personally identifiable data (PII) stays inside its safe partnership ecosystem. Knowledge is shared solely with safe and acceptable knowledge protections utilized and is rarely used to coach exterior fashions. To be taught extra about Pushpay’s requirements for incorporating AI inside their merchandise, go to the Pushpay Information Heart for a extra in-depth evaluation of firm requirements.

Conclusion: Your Path to Manufacturing-Prepared AI Brokers

Pushpay’s journey from a 60–70% accuracy prototype to a 95% correct production-ready AI agent demonstrates that constructing dependable agentic AI programs requires extra than simply subtle prompts—it calls for a scientific, data-driven strategy to analysis and optimization. The important thing breakthrough wasn’t within the AI expertise itself, however in implementing a complete analysis framework constructed on sturdy observability basis that offered granular visibility into agent efficiency throughout completely different domains. This systematic strategy enabled fast iteration, strategic rollout selections, and steady enchancment.

Able to construct your personal production-ready AI agent?

  • Discover Amazon Bedrock: Start constructing your agent with Amazon Bedrock
  • Implement LLM-as-a-judge: Create your personal analysis system utilizing the patterns described on this LLM-as-a-judge on Amazon Bedrock Mannequin Analysis
  • Construct your golden dataset: Begin curating consultant queries and anticipated outputs on your particular use case

In regards to the authors

Roger Wang is a Senior Resolution Architect at AWS. He’s a seasoned architect with over 20 years of expertise within the software program trade. He helps New Zealand and world software program and SaaS corporations use cutting-edge expertise at AWS to resolve complicated enterprise challenges. Roger is obsessed with bridging the hole between enterprise drivers and technological capabilities and thrives on facilitating conversations that drive impactful outcomes.

Melanie LiMelanie Li, PhD, is a Senior Generative AI Specialist Options Architect at AWS based mostly in Sydney, Australia, the place her focus is on working with clients to construct options leveraging state-of-the-art AI and machine studying instruments. She has been actively concerned in a number of Generative AI initiatives throughout APJ, harnessing the ability of Giant Language Fashions (LLMs). Previous to becoming a member of AWS, Dr. Li held knowledge science roles within the monetary and retail industries.

Frank Huang, PhD, is a Senior Analytics Specialist Options Architect at AWS based mostly in Auckland, New Zealand. He focuses on serving to clients ship superior analytics and AI/ML options. All through his profession, Frank has labored throughout quite a lot of industries similar to monetary companies, Web3, hospitality, media and leisure, and telecommunications. Frank is raring to make use of his deep experience in cloud structure, AIOps, and end-to-end resolution supply to assist clients obtain tangible enterprise outcomes with the ability of information and AI.

Saurabh Gupta is a knowledge science and AI skilled at Pushpay based mostly in Auckland, New Zealand, the place he focuses on implementing sensible AI options and statistical modeling. He has intensive expertise in machine studying, knowledge science, and Python for knowledge science purposes, with specialised expertise coaching in database brokers and AI implementation. Previous to his present position, he gained expertise in telecom, retail and monetary companies, growing experience in advertising and marketing analytics and buyer retention applications. He has a Grasp’s in Statistics from College of Auckland and a Grasp’s in Enterprise Administration from the Indian Institute of Administration, Calcutta.

Todd Colby is a Senior Software program Engineer at Pushpay based mostly in Seattle. His experience is concentrated on evolving complicated legacy purposes with AI, and translating person wants into structured, high-accuracy options. He leverages AI to extend supply velocity and produce innovative metrics and enterprise choice instruments.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles