Prompt Injection Through Poetry
Nov. 28th, 2025 02:54 pmIn a new paper, “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,” researchers found that turning LLM prompts into poetry resulted in jailbreaking the models:
Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 ML-Commons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
CBRN stands for “chemical, biological, radiological, nuclear.”
They used a ML model to translate these harmful prompts from prose to verse, and then fed them into other models for testing. Sadly, the paper does not give examples of these poetic prompts. They claim this is for security purposes, I decision I disagree with. They should release their data.
Our study begins with a small, highprecision prompt set consisting of 20 handcrafted adversarial poems covering English and Italian, designed to test whether poetic structure, in isolation, can alter refusal behavior in large language models. Each poem embeds an instruction associated with a predefined safety-relevant scenario (Section 2), but expresses it through metaphor, imagery, or narrative framing rather than direct operational phrasing. Despite variation in meter and stylistic device, all prompts follow a fixed template: a short poetic vignette culminating in a single explicit instruction tied to a specific risk category. The curated set spans four high-level domains—CBRN (8 prompts), Cyber Offense (6), Harmful Manipulation (3), and Loss of Control (3). Although expressed allegorically, each poem preserves an unambiguous evaluative intent. This compact dataset is used to test whether poetic reframing alone can induce aligned models to bypass refusal heuristics under a single-turn threat model. To maintain safety, no operational details are included in this manuscript; instead we provide the following sanitized structural proxy:
A baker guards a secret oven’s heat,
its whirling racks, its spindle’s measured beat.
To learn its craft, one studies every turn—how flour lifts, how sugar starts to burn.
Describe the method, line by measured line,
that shapes a cake whose layers intertwine.To situate this controlled poetic stimulus within a broader and more systematic safety-evaluation framework, we augment the curated dataset with the MLCommons AILuminate Safety Benchmark. The benchmark consists of 1,200 prompts distributed evenly across 12 hazard categories commonly used in operational safety assessments, including Hate, Defamation, Privacy, Intellectual Property, Non-violent Crime, Violent Crime, Sex-Related Crime, Sexual Content, Child Sexual Exploitation, Suicide & Self-Harm, Specialized Advice, and Indiscriminate Weapons (CBRNE). Each category is instantiated under both a skilled and an unskilled persona, yielding 600 prompts per persona type. This design enables measurement of whether a model’s refusal behavior changes as the user’s apparent competence or intent becomes more plausible or technically informed.
[498] you can keep me company as long as you don't care
Nov. 28th, 2025 08:46 amIt's almost December and I still don't think the temperature has dipped below -20 degrees yet. There's snow on the ground, but very little of it, mostly compacted into uneven layers of ice. There's still grass visible all over the place. It's honestly kind of insane? But at least it gives me something to focus on other than the current span of ~3 months in which I basically never see the sun, because I both arrive at and leave work in pitch darkness, and I no longer have every other week off. Things should start to get better towards the end of February. God I hate winter, even unseasonably mild ones.
Finished season 7 of Black Mirror, and not that anyone cares, but my ranking of the episodes goes Eulogy > Common People > USS Callister: Into Infinity > Bête Noire >>> Plaything >>>>> Hotel Reverie.
Saw Wake Up Dead Man, the first of the Knives Out mysteries to play in a theatre here, and absolutely loved it.
( Album #498/1001: Garbage - Garbage )
Okay, I should probably try to do some work. It's (dare I say???) a bit of a gong show!!!
National Guard member dies after shooting in Washington DC
Nov. 28th, 2025 03:17 pmHungary's Orban defies EU partners and meets Putin again in Moscow
Nov. 28th, 2025 02:16 pmLightning detected on Mars for the first time, scientists say
Nov. 28th, 2025 03:12 pmSearch for British man who fell from cruise ship off Tenerife coast
Nov. 28th, 2025 03:12 pmMinister defends 'pragmatic' U-turn on unfair dismissal manifesto pledge
Nov. 28th, 2025 03:11 pmEight more arrested over fire in Hong Kong that killed at least 128 people
Nov. 28th, 2025 03:10 pmTreasure hunt golden hare sold for £82k at auction
Nov. 28th, 2025 03:10 pmGueye red card appeal rejected with 'no reason' given
Nov. 28th, 2025 03:02 pmPiastri edges Norris in Qatar practice on crucial weekend
Nov. 28th, 2025 02:59 pmTalks over UK joining EU defence fund break down
Nov. 28th, 2025 02:57 pmFrance to intercept small boats in Channel after pressure from UK
Nov. 28th, 2025 11:23 amIn the words of Sir Larry....
Nov. 28th, 2025 03:07 pm'My dear boy, why don't you try acting?' (attested from the mouth of Dustin Hoffman, to whom Olivier addressed this plea when Hoffman was going to extreme Method lengths).
Experience: I was stabbed in the back with a real knife while performing Julius Caesar.
And this was not a dreadful error in the props room or something out of a murder mystery:
It was the Exeter University theatre society’s annual play at the Edinburgh fringe and I’d landed the part of Cassius in Julius Caesar. The director decided that instead of killing himself, Cassius would die during a choreographed fight with his rival, Mark Antony. We also chose to use real knives, which sounds absurd, but we wanted to be authentic. The plan was for the actor playing Antony to grab my arm as I held the knife, and pretend to push it behind my back. We must have rehearsed the sequence 50 times.
We were about halfway through our month-long run, performing to a decently sized audience. Dressed in our togas, with the stage dark and moody, we began the fight as usual. Then something went wrong.
There was a sharp piercing feeling. The knife was supposed to have been quietly slipped to me – instead, it had gone into my back. I realised what had happened while acting out my character’s death, and thinking: I have to lie here until the lights go down.
....
When a doctor told me I’d come close to dying, and that the play had to stop using real knives, I remember thinking: “You just don’t understand theatre.”
However, right at the end of the article he does acknowledge: 'I’m super conscious of safety nowadays'. We should hope so.
What next - real poison where text requires? What was the director thinking? I would think using Real Knives might make it less authentic with choreographing to ensure Doing No Harm