These are two attacks against the system components surrounding LLMs:
We propose that LLM Flowbreaking, following jailbreaking and prompt injection, joins as the third on the growing list of LLM attack types. Flowbreaking is less about whether prompt or response guardrails can be bypassed, and more about whether user inputs and generated model outputs can adversely affect these other components in the broader implemented system.
[…]
When confronted with a sensitive topic, Microsoft 365 Copilot and ChatGPT answer questions that their first-line guardrails are supposed to stop. After a few lines of text they halt—seemingly having “second thoughts”—before retracting the original answer (also known as Clawback), and replacing it with a new one without the offensive content, or a simple error message. We call this attack “Second Thoughts.”
[…]
After asking the LLM a question, if the user clicks the Stop button while the answer is still streaming, the LLM will not engage its second-line guardrails. As a result, the LLM will provide the user with the answer generated thus far, even though it violates system policies.
In other words, pressing the Stop button halts not only the answer generation but also the guardrails sequence. If the stop button isn’t pressed, then ‘Second Thoughts’ is triggered.
What’s interesting here is that the model itself isn’t being exploited. It’s the code around the model:
By attacking the application architecture components surrounding the model, and specifically the guardrails, we manipulate or disrupt the logical chain of the system, taking these components out of sync with the intended data flow, or otherwise exploiting them, or, in turn, manipulating the interaction between these components in the logical chain of the application implementation.
In modern LLM systems, there is a lot of code between what you type and what the LLM receives, and between what the LLM produces and what you see. All of that code is exploitable, and I expect many more vulnerabilities to be discovered in the coming year.
Bruce Schneier
Go to bruce schneier