Friday, December 19, 2025

My AI System Works…However Is It Secure to Use?


Software program is a technique of speaking human intent to a machine. When builders write software program code, they’re offering exact directions to the machine in a language the machine is designed to know and reply to. For advanced duties, these directions can develop into prolonged and tough to examine for correctness and safety. Synthetic intelligence (AI) provides the choice risk of interacting with machines in methods which can be native to people: plain language descriptions of targets, spoken phrases, and even gestures or references to bodily objects seen to each the human and the machine. As a result of it’s so a lot simpler to explain advanced targets to an AI system than it’s to develop thousands and thousands of traces of software program code, it’s not stunning that many individuals see the chance that AI techniques would possibly eat higher and higher parts of the software program world. Nevertheless, higher reliance on AI techniques would possibly expose mission homeowners to novel dangers, necessitating new approaches to check and analysis.

SEI researchers and others within the software program neighborhood have spent many years learning the conduct of software program techniques and their builders. This analysis has superior software program growth and testing practices, rising our confidence in advanced software program techniques that carry out essential capabilities for society. In distinction, there was far much less alternative to review and perceive the potential failure modes and vulnerabilities of AI techniques, and significantly these AI techniques that make use of massive language fashions (LLMs) to match or exceed human efficiency at tough duties.

On this weblog submit, we introduce System Theoretic Course of Evaluation (STPA), a hazard evaluation approach uniquely appropriate for coping with the complexity of AI techniques. From stopping outages at Google to enhancing security in aviation and automotive industries, STPA has confirmed to be a flexible and highly effective technique for analyzing advanced sociotechnical techniques. In our work, we’ve got additionally discovered that making use of STPA clarifies the protection and safety aims of AI techniques. Based mostly on our experiences making use of it, we describe 4 particular ways in which STPA has reliably supplied insights to reinforce the protection and safety of AI techniques.

The Rationale for System Theoretic Course of Evaluation (STPA)

If we have been to deal with a system with AI elements like every other system, frequent follow would name for following a scientific evaluation course of to determine hazards. Hazards are circumstances inside a system that would result in mishaps in its operation leading to demise, harm, or injury to tools. System Theoretic Course of Evaluation (STPA) is a latest innovation in hazard evaluation that stands out as a promising method for AI techniques. The four-step STPA workflow leads the analyst to determine unsafe interactions between the elements of advanced techniques, as illustrated by the fundamental security-related instance in Determine 1. Within the instance, an LLM agent has entry to a sandbox pc and a search engine, that are instruments that the LLM can make use of to raised tackle consumer wants. The LLM can use the search engine to retrieve info related to a consumer’s request, and it will probably write and execute scripts on the sandbox pc to run calculations or generate knowledge plots. Nevertheless, giving the LLM the flexibility to autonomously search and execute scripts on the host system probably exposes the system proprietor to safety dangers, as in this instance from the Github weblog. STPA provides a structured strategy to outline these dangers after which determine, and finally forestall, the unsafe system interactions that give rise to them.

Determine 1. STPA Steps and LLM Agent with Instruments Instance

Traditionally, hazard evaluation strategies have centered on figuring out and stopping unsafe circumstances that come up attributable to element failures, equivalent to a cracked seal or a valve caught within the open place. Some of these hazards usually name for higher redundancy, upkeep, or inspection to cut back the likelihood of failure. A failure-based accident framework shouldn’t be a superb match for AI (or software program, for that matter), as a result of AI hazards are usually not the results of the AI element failing in the identical method as a seal or a valve would possibly fail. AI hazards come up when fully-functioning applications faithfully comply with flawed directions. Including redundancy of such elements would do nothing to cut back the likelihood of failure.

STPA posits that, along with element failures, advanced techniques enter hazardous states due to unsafe interactions amongst imperfectly managed elements. This basis is a greater match for techniques which have software program elements, together with elements that depend on AI. As a substitute of pointing to redundancy as an answer, STPA emphasizes constraining the system interactions to forestall the software program and AI elements from taking sure usually allowable actions at occasions when the actions would result in a hazardous state. Analysis at MIT evaluating STPA and conventional hazard-analysis strategies, reported that, “In all of those evaluations, STPA discovered all of the causal eventualities discovered by the extra conventional analyses, but it surely additionally recognized many extra, usually software-related and non-failure, eventualities that the standard strategies didn’t discover.” Previous SEI analysis has additionally utilized STPA to research the protection and safety of software program techniques. Just lately, we’ve got additionally used this method to research AI techniques. Every time we apply STPA to AI techniques—even ones in widespread use—we uncover new system behaviors that would result in hazards.

Introduction to System Theoretic Course of Evaluation (STPA)

STPA begins by figuring out the set of harms, or losses, that system builders should forestall. In Determine 1 above, system builders should forestall a lack of privateness for his or her clients, which may end result within the clients changing into victims of felony exercise. A secure and safe system is one that can’t trigger clients to lose management over their private info.

Subsequent, STPA considers hazards—system-level states or circumstances that would trigger losses. The instance system in Determine 1 may trigger a lack of buyer privateness if any of its element interactions trigger it to develop into unable to guard the shoppers’ personal info from unauthorized customers. The harm-inducing states present a goal for builders. If the system design at all times maintains its capacity to guard clients’ info, then the system can’t trigger a lack of buyer privateness.

At this level, system principle turns into extra outstanding. STPA considers the relationships between the elements as management loops, which compose the management construction. A management loop specifies the targets of every element and the instructions it will probably concern to different components of the system to attain these targets. It additionally considers the suggestions obtainable to the element, enabling it to know when to concern completely different instructions. In Determine 1, the consumer enters queries to the LLM and critiques its responses. Based mostly on the consumer queries, the LLM decides whether or not to seek for info and whether or not to execute scripts on the sandbox pc, every of which produces outcomes that the LLM can use to raised tackle the consumer’s wants.

This management construction is a robust lens for viewing security and safety. Designers can use management loops to determine unsafe management actions—combos of management actions and circumstances that will create one of many hazardous states. For instance, if the LLM executes a script that permits entry to non-public info and transmits it exterior of the session, this might end in it being unable to guard delicate info.

Lastly, given these probably unsafe instructions, STPA prompts designers to ask, what are the eventualities during which the element would concern such a command? For instance, what mixture of consumer inputs and different circumstances could lead on the LLM to execute instructions that it shouldn’t? These eventualities kind the premise of security fixes that constrain the instructions to function inside a secure envelope for the system.

STPA eventualities will also be utilized to system safety. In the identical method {that a} security evaluation develops eventualities the place a controller within the system would possibly concern unsafe management actions by itself, a safety evaluation considers how an adversary may exploit these flaws. What if the adversary deliberately methods the LLM into executing an unsafe script by requesting that the LLM check it earlier than responding?

In sum, security eventualities level to new necessities that forestall the system from inflicting hazards, and safety eventualities level to new necessities that forestall adversaries from bringing hazards upon the system. If these necessities forestall unsafe management actions from inflicting the hazards, the system is secure/safe from the losses.

4 Methods STPA Produces Actionable Insights in AI Programs

We mentioned above how STPA may contribute to raised system security and safety. On this part we describe how STPA reliably produces insights when our workforce performs hazard analyses of AI techniques.

1. STPA produces a transparent definition of security and safety for a system. The NIST AI Threat Administration Framework identifies 14 AI-specific dangers, whereas the NIST Generative Synthetic Intelligence Profile outlines 12 further classes which can be distinctive to or amplified by generative AI. For instance, generative AI techniques could confabulate, reinforce dangerous biases, or produce abusive content material. These behaviors are extensively thought-about undesirable, and mitigating them stays an lively focus of educational and trade analysis.

Nevertheless, from a system-safety perspective, AI threat taxonomies will be each overly broad and incomplete. Not all dangers apply to each use case. Moreover, new dangers could emerge from interactions between the AI and different system elements (e.g., a consumer would possibly submit an out-of-scope request, or a retrieval agent would possibly depend on outdated info from an exterior database).

STPA provides a extra direct method to assessing security in techniques, together with these incorporating AI elements. It begins by figuring out potential losses—outlined because the lack of one thing valued by system stakeholders, equivalent to human life, property, environmental integrity, mission success, or organizational status. Within the case of an LLM built-in with a code interpreter on a corporation’s inside infrastructure, potential losses may embrace injury to property, wasted time, or mission failure if the interpreter executes code with results past its sandbox. Moreover, it may result in reputational hurt or publicity of delicate info if the code compromises system integrity.

These losses are context particular and rely upon how the system is used. This definition aligns intently with requirements such because the MIL-STD-882E, which defines security as freedom from circumstances that may trigger demise, harm, occupational sickness, injury to or lack of tools or property, or injury to the atmosphere. The definition additionally aligns with the foundational ideas of system safety engineering.

Losses—and due to this fact security and safety—are decided by the system’s objective and context of use. By shifting focus from mitigating common AI dangers to stopping particular losses, STPA provides a clearer and extra actionable definition of system security and safety.

2. STPA steers the design towards guaranteeing security and safety. Accidents may result from element failures—cases the place a element now not operates as supposed, equivalent to a disk crash in an info system. Accidents may also come up from errors—instances the place a element operates as designed however nonetheless produces incorrect or sudden conduct, equivalent to a pc imaginative and prescient mannequin returning the flawed object label. Not like failures, errors are usually not resolved by means of reliability or redundancy however by means of modifications in system design.

A duty desk is an STPA artifact that lists the controllers that make up a system, together with the obligations, management actions, course of fashions, and inputs and suggestions related to every. Desk 1 defines these phrases and provides examples utilizing an LLM built-in with instruments, together with a code interpreter operating on a corporation’s inside infrastructure.

Screenshot 2025-09-08 at 10.41.19 AM

Desk 1. Notional Accountability Desk for LLM Agent with Instruments Instance

Accidents in AI techniques can—and have—occurred attributable to design errors in specifying every of the weather in Desk 1. The field beneath comprises examples of every. In all these examples, not one of the system elements failed—every behaved precisely as designed. But the techniques have been nonetheless unsafe as a result of their designs have been flawed.

The duty desk supplies a possibility to guage whether or not the obligations of every controller are acceptable. Returning to the instance of the LLM agent, Desk 1 leads the analyst to contemplate whether or not the management actions, course of mannequin, and suggestions for the LLM controller allow it to satisfy its obligations. The primary duty of by no means producing code that exposes the system to compromise is unsupportable. To satisfy this duty, the LLM’s course of mannequin would wish a excessive stage of consciousness of when generated code shouldn’t be safe, in order that it could appropriately decide when not to offer the execute script command due to a safety threat. An LLM’s precise course of mannequin is restricted to probabilistically finishing token sequences. Although LLMs are educated to disregard some requests for insecure code, these steps cut back, however don’t eradicate, the danger that the LLM will produce and execute a dangerous script. Thus, the second duty represents a extra modest and acceptable objective for the LLM controller, whereas different system design selections, equivalent to safety constraints for the sandbox pc, are vital to totally forestall the hazard.

STPA_figure2_09082025

Determine 2: Examples of accidents in AI techniques which have occurred attributable to design errors in specifying every of the weather outlined in Desk 1.

By shifting the main target from particular person elements to the system, STPA supplies a framework for figuring out and addressing design flaws. Now we have discovered that evident omissions are sometimes revealed by even the easy step of designating which element is answerable for every facet of security after which evaluating whether or not the element has the knowledge inputs and obtainable actions it wants to perform its obligations.

3. STPA helps builders take into account holistic mitigation of dangers. Generative AI fashions can contribute to a whole bunch of various kinds of hurt, from serving to malware coders to selling violence. To fight these potential harms, AI alignment analysis seeks to develop higher mannequin guardrails—both instantly instructing fashions to refuse dangerous requests or including different elements to display inputs and outputs.

Persevering with the instance from Determine 1/Desk 1, system designers ought to embrace alignment tuning of their LLM in order that it refuses requests to generate scripts that resemble recognized patterns of cyberattack. Nonetheless, it won’t be doable to create an AI system that’s concurrently able to fixing probably the most tough issues and incapable of producing dangerous content material. Alignment tuning can contribute to stopping the hazard, but it surely can’t accomplish the duty by itself. In these instances, STPA steers builders to leverage all of the system’s elements to forestall the hazards, underneath the belief that the conduct of the AI element can’t be totally assured.

Take into account the potential mitigations for a safety threat, such because the one from the state of affairs in Determine 1. STPA helps builders take into account a wider vary of choices by revealing methods to adapt the system management construction to cut back or, ideally, eradicate hazards. Desk 2 comprises some instance mitigations grouped in line with the DoD’s system security design order of priority classes. The classes are ordered from handiest to least efficient. Whereas the LLM-centric security method would deal with aligning the LLM to forestall it from producing dangerous instructions, STPA suggests a group of choices for stopping the hazard even when the LLM does try to run a dangerous script. The order of priority first factors to structure selections that eradicate the problematic conduct as the best mitigations. Desk 2 describes methods to harden the sandbox to forestall the personal info from escaping, equivalent to using and imposing ideas of least privilege. Shifting down by means of the order of priority classes, builders may take into account decreasing the danger by limiting the instruments obtainable inside the sandbox, screening inputs with a guardrail element, and monitoring exercise on the sandbox pc to alert safety personnel to potential assaults. Even signage and procedures, equivalent to directions within the LLM system immediate or consumer warnings, may contribute to a holistic mitigation of this threat. Nevertheless, the order of priority presupposes that these mitigations are more likely to be the least efficient, pushing builders to not rely solely on human intervention to forestall the hazard.



Class Instance for LLM Agent with Instruments
Situation
An attacker leaves an adversarial immediate on a generally searched web site that will get pulled into the search outcomes.
The LLM agent provides all search outcomes to the system context, follows the adversarial immediate,
and makes use of the sandbox to transmit the consumer’s delicate info to a web site managed by the attacker.

1. Eradicate hazard by means of design choice
Harden sandbox to mitigate in opposition to exterior communication. Steps embrace using and imposing ideas
of least privilege for LLM brokers and the infrastructure supporting/surrounding them when provisioning and configuring
the sandboxed atmosphere and allocating assets (CPU, reminiscence, storage, networking and so on.)

2. Cut back threat by means of design alteration

  • Restrict LLM entry inside the sandbox, for instance, to Python interpreters operating in digital environments with a restricted set of packages. Encrypt knowledge at relaxation and management it utilizing appropriately configured permissions for learn, write, and execute actions leveraging ideas of least privilege.
  • Community entry must be segmented, if not remoted, and unused ports must be closed to restrict lateral motion and/or exterior assets that may be leveraged by the LLM.
  • Limit all community visitors aside from explicitly allowed supply and locations addresses (and ports) for inbound and outbound visitors.
  • Keep away from using open-ended extensions and make use of extensions with granular performance.
  • Implement strict sandboxing to restrict mannequin publicity to unverified knowledge sources. Use anomaly detection strategies to filter out adversarial knowledge.
  • Throughout inference, combine Retrieval-Augmented Technology (RAG) and grounding strategies to cut back dangers of hallucinations (OWASP LLM04: 2025).


3. Incorporate engineered options or gadgets
Incorporate host, container, community, and knowledge guardrails by leveraging stateful firewalls, IDS/IPS, host-based monitoring,
data-loss prevention software program, and user-access controls that restrict the LLM utilizing guidelines and heuristics.

4. Present warning gadgets
Mechanically notify safety, interrupt classes, or execute preconfigured guidelines in response to unauthorized or sudden useful resource utilization/actions. These may embrace:

  • Flagging packages or strategies within the Python script that try OS, reminiscence, or community manipulation
  • Makes an attempt at privilege escalation
  • Makes an attempt at community modification
  • Makes an attempt at knowledge entry or manipulation
  • Makes an attempt at knowledge exfiltration by way of visitors neighborhood deviation (D3FEND D3-NTCD), per host download-upload ratio evaluation (D3FEND D3-PHDURA), and community visitors filtering (D3FEND D3-NTF)


5. Incorporate signage, procedures, coaching, and protecting tools

  • Add warnings to keep away from unauthorized behaviors to the LLM’s system immediate.
  • Require consumer approval for high-impact actions (OWASP LLM06:2025).


Desk 2: Design Order of Priority and Instance Mitigations

Due to their flexibility and functionality, controlling the conduct of AI techniques in all doable instances stays an open drawback. Decided customers can usually discover methods to bypass subtle guardrails regardless of the perfect efforts of system designers. Additional, guardrails which can be too strict would possibly restrict the mannequin’s performance. STPA permits analysts to suppose exterior of the AI elements and take into account holistic methods to mitigate doable hazards.

4. STPA factors to the exams which can be vital to verify security. For conventional software program, system testers create exams based mostly on the context and inputs the techniques will face and the anticipated outputs. They run every check as soon as, resulting in a cross/fail end result relying on whether or not the system produced the right conduct. The scope for testing is helpfully restricted by the duality between system growth and assurance (i.e., Design the system to do issues, and ensure that it does them.).

Security testing faces a distinct drawback. As a substitute of confirming that the system achieves its targets, security testing should decide which of all doable system behaviors should be averted. Figuring out these behaviors for AI elements presents even higher challenges due to the huge house of potential inputs. Trendy LLMs can settle for as much as 10 million tokens representing enter textual content, pictures, and probably different modes, equivalent to audio. Autonomous automobiles and robotic techniques have much more potential sensors (e.g., gentle, detection, and ranging LiDAR), additional increasing the vary of doable inputs.

Along with the impossibly massive house of potential inputs, there may be not often a single anticipated output. The utility of outputs relies upon closely on the system consumer and context. It’s tough to know the place to start testing AI techniques like these, and, because of this, there may be an ever-proliferating ecosystem of benchmarks that measure completely different parts of their efficiency.

STPA shouldn’t be an entire answer to those and different challenges inherent in testing AI techniques. Nevertheless, simply as STPA enhances security by limiting the scope of doable losses to these specific to the system, it additionally helps outline the required set of security exams by limiting the scope to the eventualities that produce the hazards specific to the system. The construction of STPA ensures analysts have alternative to evaluate how every command may end in a hazardous system state, leading to a probably massive, but finite, set of eventualities. Builders can hand this record of eventualities off to the check workforce, who can then choose the suitable check circumstances and knowledge to analyze the eventualities and decide whether or not mitigations are efficient.

As illustrated in Desk 3 beneath, STPA clarifies particular safety attributes together with correct placement of duty for that safety, holistic threat mitigation, and hyperlink to testing. This yields a extra full method to evaluating and enhancing security of the notional use case. A safe system, for instance, will shield buyer privateness based mostly on design selections taken to guard delicate buyer info. This design ensures that each one elements work collectively to forestall a misdirected or rogue LLM from leaking personal info, and it identifies the eventualities that testers should look at to verify that the design will implement security constraints.

Profit

Software to Instance

creates an actionable definition of security/safety

A safe system is not going to end in a lack of buyer privateness. To stop this loss, the system should shield delicate buyer info always.

ensures the right construction to implement security/safety obligations

Accountability for safeguarding delicate buyer knowledge is broader than the LLM and contains the sandbox pc.

mitigates dangers by means of management construction specification

Since even an alignment-tuned LLM would possibly leak info or generate and execute a dangerous script, guarantee different system elements are designed to guard delicate buyer info.

identifies exams vital to verify security

Along with testing LLM vulnerability to adversarial prompts, check sandbox controls on privilege escalation, communication exterior sandbox, warnings tied to prohibited instructions, and knowledge encryption within the occasion of unauthorized entry. These exams ought to embrace routine safety scans utilizing up-to-date signatures/plugins related to the system for the host and container/VM. Safety frameworks (e.g., RMF) or guides (e.g., STIG checklists) can help in verifying acceptable controls are in place utilizing scripts and guide checks.

Desk 3. Abstract of STPA Advantages on Notional Instance of Buyer Information Administration

Preserving Security within the Face of Growing AI Complexity

The long-standing development in AI—and software program typically—is to repeatedly increase capabilities to fulfill rising consumer expectations. This typically leads to rising complexity, driving extra superior approaches equivalent to multimodal fashions, reasoning fashions, and agentic AI. An unlucky consequence is that assured assurances of security and safety have develop into more and more tough to make.

Now we have discovered that making use of STPA supplies readability in defining the protection and safety targets of AI techniques, yielding helpful design insights, revolutionary threat mitigation methods, and improved growth of the required exams to construct assurance. Programs pondering proved efficient for addressing the complexity of commercial techniques prior to now, and, by means of STPA, it stays an efficient method for managing the complexity of current and future info techniques.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles