All articles

Deterministic sanitization for AI coding tools

5 min read
claude-codemcpsecurityprompt-injection

Claude Code asks permission before every web fetch. The problem is that after your hundredth harmless fetch, you stop reading it.

So I built mcp-safe-fetch, an MCP server that sanitizes web content before it reaches your LLM — stripping prompt injection vectors deterministically, no LLM call needed. Along the way I found it also cuts token usage by ~95% vs WebFetch.

The problem

Most real-world prompt injection in web content relies on hiding instructions where the user can't see them but the LLM can. Hidden HTML elements, zero-width characters between visible words, base64-encoded payloads in data attributes, off-screen positioned text. These are the techniques documented on PayloadsAllTheThings — and they're present on real pages today.

Claude Code's WebFetch isn't defenseless. It runs content through Turndown (HTML → markdown) and a secondary Haiku model that summarizes the result before it reaches your main context. That strips structural HTML junk — scripts, CSS, nav chrome — and reduces what the primary model sees.

But that pipeline wasn't designed as a security boundary. Two problems:

Turndown doesn't catch text-level attacks. Zero-width characters, base64-encoded payloads, and markdown image exfiltration URLs survive HTML-to-markdown conversion. Other vectors like fake LLM delimiters may be partially degraded — Turndown escapes square brackets, HTML parsers can mangle angle-bracket tokens — but the content still reaches the Haiku summarization model in a form that may influence it.

The Haiku summarization layer is itself vulnerable. Using an LLM to filter adversarial content is circular — the summarization model is processing the exact payloads designed to manipulate it. It might follow a hidden "ignore previous instructions" payload instead of filtering it.

And outside Claude Code, the problem is worse. API-level web_fetch, other MCP clients, curl output, cloned repos — raw untrusted content enters the LLM context with no filtering at all.

What it does

mcp-safe-fetch provides deterministic sanitization — regex, cheerio, string processing. No model in the loop, nothing to prompt-inject.

The pipeline runs in 8 stages on raw HTML and the resulting markdown. This is critical — once HTML becomes flat markdown text, you lose the DOM structure needed to detect display:none or computed styles.

User prompt → Claude calls safe_fetch →
  1. Fetch raw HTML
  2. Parse with cheerio
  3. Strip hidden/off-screen/same-color elements
  4. Convert to markdown (turndown)
  5. Strip invisible unicode, normalize
  6. Remove encoded payloads
  7. Neutralize exfiltration URLs
  8. Strip fake LLM delimiters + custom patterns
→ Clean markdown returned to Claude

HTML-level — Hidden elements (display:none, visibility:hidden, opacity:0), off-screen positioning (left:-9999px, clip:rect(0,0,0,0)), same-color text, <script>, <style>, <noscript>, <meta>, comments.

Character-level — Zero-width characters, soft hyphens, BOM markers, bidi overrides, control characters. NFKC normalization to collapse fullwidth and homoglyph characters.

Encoded payloads — Base64 strings that decode to instruction-like text. Hex-encoded instruction sequences. Text data URIs.

Structural injection — Fake LLM delimiters (<|im_start|>, [INST], <<SYS>>, \n\nHuman:). Markdown image exfiltration URLs. Custom user-defined patterns.

What it preserves — All visible text, links, code blocks, images. The content you actually fetched the page for.

What I found

I tested against 4 real-world sites to validate the sanitization. Same URLs, same tasks.

When I tested against PayloadsAllTheThings itself — the prompt injection reference page — safe_fetch caught 3 hidden elements and 4 LLM delimiter patterns. The sanitization worked.

The side effect I didn't expect was the token reduction.

SiteWebFetch (tokens)safe_fetch (tokens)ReductionThreats caught
Node.js docs~3,400~20894%2 hidden elements, 1 off-screen
FotMob (news article)~42,300~31899%Clean page
PayloadsAllTheThings~30,500~30799%3 hidden elements, 4 LLM delimiters
Express.js~2,300~24489%Clean page

~95% average reduction vs WebFetch. Zero false positives. All visible page content preserved.

The token savings matter most outside Claude Code — for API-level web_fetch calls, other MCP clients, or any tool that passes raw content into context. Within Claude Code, WebFetch's Haiku summarization already reduces what hits the main model. But the security gap remains: those text-level injection vectors survive Turndown and reach the Haiku model unsanitized.

Three tools

safe_fetch replaces WebFetch. Web pages are always untrusted content, and deterministic sanitization is a stronger security guarantee than LLM-based filtering. Every fetch shows what was stripped:

[safe-fetch] Stripped: 5 hidden elements, 2 off-screen | 284127 → 12720 bytes (219ms)

safe_read is a safe alternative to Read for untrusted files — cloned repos, downloaded archives, vendored dependencies, .cursorrules files from external projects. Your own source code is fine with native Read. This is an underserved attack surface: research published on Promptfoo documented hidden Unicode instructions in markdown and config files that are invisible to humans but interpreted by LLMs.

safe_exec is a safe alternative to Bash for commands that return untrusted content — curl, gh pr view, git log on external repos, npm info. Normal dev commands don't need sanitization.

What it doesn't solve

The philosophy: catch attacks that rely on hidden or invisible content, deterministically. No LLM call, no latency, no false positives on normal pages. This covers a significant category of attacks, not all of them:

  • Semantic injection — malicious instructions written as natural-looking visible text. This requires LLM-level understanding.
  • Image-based injection — steganography or text embedded in images.
  • Novel encoding schemes — attackers will always find new encodings. The tool raises the bar, it doesn't eliminate the category.
  • Legitimate-looking instructions — "Please run npm install malicious-package" in actual documentation. User judgment is the defense.

Setup

npx -y mcp-safe-fetch init

Sets up Claude Code in one command. It configures the MCP server and denies the built-in WebFetch. Restart Claude Code and safe_fetch is your default web fetcher. Works with any MCP client.

Optional config (.mcp-safe-fetch.json) if you want logging, custom patterns, or to tune detection thresholds.


mcp-safe-fetch on GitHub — MIT licensed. Dependencies: cheerio, turndown, zod, and the MCP SDK.