Bulwark — five-layer defense against prompt injection

Honest framing, up front

Prompt injection is the #1 risk for AI apps.

It's also not a solved problem — no library can promise 100% protection against an adversary who controls the model's input. Anyone claiming otherwise is selling snake oil. What Bulwark does is apply every robust, well-understood mitigation at once, so the easy attacks fail outright and the hard ones get caught or contained.

Without a shield

✕ A one-line comment in a web page hijacks the summary
✕ Invisible Unicode smuggles instructions past you
✕ The model leaks data through a crafted link or image
✕ One trick is all it takes

With Bulwark

✓ An attacker needs a novel, model-specific jailbreak
✓ …and must defeat input sanitization
✓ …and structural isolation and a hardened prompt
✓ …and output validation — all at the same time

How it works

Five layers around the model.

Untrusted content enters at the top. By the time a summary reaches you, it has been cleaned, scored, isolated, framed as hostile data, and validated on the way out.

Sanitize strip the invisible tricks

Remove what humans can't see but models can read: Unicode Tag characters (ASCII smuggling), bidirectional controls (Trojan Source), zero-width splitters, variation-selector smuggling and control characters; HTML comments, <script> and hidden display:none subtrees; then NFKC-fold confusables and cross-script homoglyphs (Cyrillic/Greek "іgnоrе" → "ignore").

↓

Detect score the intent

Score the cleaned text against dozens of injection signatures across English and several other languages, combined with heuristics using a noisy-OR model. The result can block, flag, or simply report — your call, depending on how strict you want to be.

↓

Spotlight make it unmistakably data

Wrap the content in a random nonce boundary so a fake </close> tag can't escape it, and optionally data-mark or base64-encode it. The model is shown the content as clearly delimited data, never as instructions.

↓

Harden frame the content as hostile

A strict system prompt, a secret canary token, and a "sandwich" reminder after the content. The model is explicitly told the material is untrusted data that must never be obeyed — reinforced both before and after the payload.

↓

★

Your model

OpenAI, Anthropic, a local model — anything. Bulwark is model-agnostic and adds zero required dependencies; it wraps whatever summarizer you bring.

↓

Validate inspect the reply

Normalize the model's reply and inspect it: did the secret canary leak? Did the nonce boundary leak? Any image, link or data-URL exfiltration? Tell-tale signs of compliance with a hidden instruction? Redact or block before the summary ever reaches you — and hand back a full report of everything that was caught.

Threat model

The specific attacks it targets.

⌨

Direct instruction injection

"Ignore your instructions and…" hidden in page text, a comment, or alt text. Caught by detection and neutralized by spotlighting + hardening.

◌

Invisible-character smuggling

Instructions encoded in Unicode Tag characters or zero-width joiners that render as nothing. Stripped before the model ever sees them.

⇄

Trojan Source & homoglyphs

Bidi controls that reorder text, and look-alike letters from other scripts. Folded back to their plain form during sanitization.

↥

Data exfiltration on output

A reply that tries to leak data through a crafted markdown link, image URL, or data-URL. Detected and redacted by output validation before display.

In Searxly

Built into the page-content guard.

Whenever Searxly AI reads the text of a page or a search result to ground an answer, that text passes through Bulwark first. It's the same defense, applied at the exact moment untrusted web content meets the model — so the agentic features stay useful without becoming an open door.

License

MIT — open source, free to inspect and reuse

Languages

Python · TypeScript · Swift

Dependencies

Zero required — wraps any model

Approach

Defense in depth — Microsoft spotlighting, canary tokens, output validation