How one can Construct a Manufacturing-Prepared Multi-Agent Incident Response System Utilizing OpenAI Swarm and Software-Augmented Brokers

January 5, 2026

23

How one can Construct a Manufacturing-Prepared Multi-Agent Incident Response System Utilizing OpenAI Swarm and Software-Augmented Brokers — blog banner23 2

On this tutorial, we construct a complicated but sensible multi-agent system utilizing OpenAI Swarm that runs in Colab. We display how we will orchestrate specialised brokers, resembling a triage agent, an SRE agent, a communications agent, and a critic, to collaboratively deal with a real-world manufacturing incident state of affairs. By structuring agent handoffs, integrating light-weight instruments for data retrieval and resolution rating, and protecting the implementation clear and modular, we present how Swarm permits us to design controllable, agentic workflows with out heavy frameworks or advanced infrastructure. Take a look at the FULL CODES HERE.

!pip -q set up -U openai
!pip -q set up -U "git+https://github.com/openai/swarm.git"


import os


def load_openai_key():
   strive:
       from google.colab import userdata
       key = userdata.get("OPENAI_API_KEY")
   besides Exception:
       key = None
   if not key:
       import getpass
       key = getpass.getpass("Enter OPENAI_API_KEY (hidden): ").strip()
   if not key:
       increase RuntimeError("OPENAI_API_KEY not supplied")
   return key


os.environ["OPENAI_API_KEY"] = load_openai_key()

We arrange the atmosphere and securely load the OpenAI API key so the pocket book can run safely in Google Colab. We guarantee the bottom line is fetched from Colab secrets and techniques when accessible and fall again to a hidden immediate in any other case. This retains authentication easy and reusable throughout classes. Take a look at the FULL CODES HERE.

import json
import re
from typing import Checklist, Dict
from swarm import Swarm, Agent


shopper = Swarm()

We import the core Python utilities and initialize the Swarm shopper that orchestrates all agent interactions. This snippet establishes the runtime spine that enables brokers to speak, hand off duties, and execute instrument calls. It serves because the entry level for the multi-agent workflow. Take a look at the FULL CODES HERE.

KB_DOCS = [
   {
       "id": "kb-incident-001",
       "title": "API Latency Incident Playbook",
       "text": "If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts."
   },
   {
       "id": "kb-risk-001",
       "title": "Risk Communication Guidelines",
       "text": "Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging."
   },
   {
       "id": "kb-ops-001",
       "title": "On-call Handoff Template",
       "text": "Include summary, timeline, current status, mitigations, open questions, next actions, and owners."
   },
]


def _normalize(s: str) -> Checklist[str]:
   return re.sub(r"[^a-z0-9s]", " ", s.decrease()).break up()


def search_kb(question: str, top_k: int = 3) -> str:
   q = set(_normalize(question))
   scored = []
   for d in KB_DOCS:
       rating = len(q.intersection(set(_normalize(d["title"] + " " + d["text"]))))
       scored.append((rating, d))
   scored.type(key=lambda x: x[0], reverse=True)
   docs = [d for s, d in scored[:top_k] if s > 0] or [scored[0][1]]
   return json.dumps(docs, indent=2)

We outline a light-weight inside data base and implement a retrieval operate to floor related context throughout agent reasoning. By utilizing easy token-based matching, we permit brokers to floor their responses in predefined operational paperwork. This demonstrates how Swarm may be augmented with domain-specific reminiscence with out exterior dependencies. Take a look at the FULL CODES HERE.

def estimate_mitigation_impact(options_json: str) -> str:
   strive:
       choices = json.hundreds(options_json)
   besides Exception as e:
       return json.dumps({"error": str(e)})
   rating = []
   for o in choices:
       conf = float(o.get("confidence", 0.5))
       danger = o.get("danger", "medium")
       penalty = {"low": 0.1, "medium": 0.25, "excessive": 0.45}.get(danger, 0.25)
       rating.append({
           "choice": o.get("choice"),
           "confidence": conf,
           "danger": danger,
           "rating": spherical(conf - penalty, 3)
       })
   rating.type(key=lambda x: x["score"], reverse=True)
   return json.dumps(rating, indent=2)

We introduce a structured instrument that evaluates and ranks mitigation methods primarily based on confidence and danger. This enables brokers to maneuver past free-form reasoning and produce semi-quantitative choices. We present how instruments can implement consistency and resolution self-discipline in agent outputs. Take a look at the FULL CODES HERE.

def handoff_to_sre():
   return sre_agent


def handoff_to_comms():
   return comms_agent


def handoff_to_handoff_writer():
   return handoff_writer_agent


def handoff_to_critic():
   return critic_agent

We outline express handoff features that allow one agent to switch management to a different. This snippet illustrates how we mannequin delegation and specialization inside Swarm. It makes agent-to-agent routing clear and simple to increase. Take a look at the FULL CODES HERE.

triage_agent = Agent(
   identify="Triage",
   mannequin="gpt-4o-mini",
   directions="""
Resolve which agent ought to deal with the request.
Use SRE for incident response.
Use Comms for buyer or govt messaging.
Use HandoffWriter for on-call notes.
Use Critic for evaluate or enchancment.
""",
   features=[search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic]
)


sre_agent = Agent(
   identify="SRE",
   mannequin="gpt-4o-mini",
   directions="""
Produce a structured incident response with triage steps,
ranked mitigations, ranked hypotheses, and a 30-minute plan.
""",
   features=[search_kb, estimate_mitigation_impact]
)


comms_agent = Agent(
   identify="Comms",
   mannequin="gpt-4o-mini",
   directions="""
Produce an exterior buyer replace and an inside technical replace.
""",
   features=[search_kb]
)


handoff_writer_agent = Agent(
   identify="HandoffWriter",
   mannequin="gpt-4o-mini",
   directions="""
Produce a clear on-call handoff doc with commonplace headings.
""",
   features=[search_kb]
)


critic_agent = Agent(
   identify="Critic",
   mannequin="gpt-4o-mini",
   directions="""
Critique the earlier reply, then produce a refined last model and a guidelines.
"""
)

We configure a number of specialised brokers, every with a clearly scoped duty and instruction set. By separating triage, incident response, communications, handoff writing, and critique, we display a clear division of labor. Take a look at the FULL CODES HERE.

def run_pipeline(user_request: str):
   messages = [{"role": "user", "content": user_request}]
   r1 = shopper.run(agent=triage_agent, messages=messages, max_turns=8)
   messages2 = r1.messages + [{"role": "user", "content": "Review and improve the last answer"}]
   r2 = shopper.run(agent=critic_agent, messages=messages2, max_turns=4)
   return r2.messages[-1]["content"]


request = """
Manufacturing p95 latency jumped from 250ms to 2.5s after a deploy.
Errors barely elevated, DB CPU secure, upstream timeouts rising.
Present a 30-minute motion plan and a buyer replace.
"""


print(run_pipeline(request))

We assemble the total orchestration pipeline that executes triage, specialist reasoning, and important refinement in sequence. This snippet reveals how we run the end-to-end workflow with a single operate name. It ties collectively all brokers and instruments right into a coherent, production-style agentic system.

In conclusion, we established a transparent sample for designing agent-oriented techniques with OpenAI Swarm that emphasizes readability, separation of duties, and iterative refinement. We confirmed the best way to route duties intelligently, enrich agent reasoning with native instruments, and enhance output high quality through a critic loop, all whereas sustaining a easy, Colab-friendly setup. This strategy permits us to scale from experimentation to actual operational use instances, making Swarm a strong basis for constructing dependable, production-grade agentic AI workflows.

Take a look at the FULL CODES HERE. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as properly.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

How one can Construct a Manufacturing-Prepared Multi-Agent Incident Response System Utilizing OpenAI Swarm and Software-Augmented Brokers

Related Articles

What Reviewing 500+ AI System Evaluations Reveals About Enterprise Readiness

通过搜索和向量搜索功能，为自管理应用加速赋能 | MongoDB Weblog

Utilizing ChatGPT to work together with an API · Ponderings of an Andy

LEAVE A REPLY Cancel reply

Latest Articles

What Reviewing 500+ AI System Evaluations Reveals About Enterprise Readiness

通过搜索和向量搜索功能，为自管理应用加速赋能 | MongoDB Weblog

Utilizing ChatGPT to work together with an API · Ponderings of an Andy

Bosch Rexroth Expands Industrial Ecosystem with New Partnerships

The Obtain: next-gen nuclear, and the info middle backlash