Cloud-based AI costs you your privacy. AI platforms leak data. It’s not just a hypothetical concern, but something already happening. (See the exposed DeepSeek database and the OmniGPT user data leak ). Every prompt can be a potential data leak. Personal details, API keys, proprietary business data … You sent it? You lose control over it.
Fortunately, small-sized local LLMs are rapidly improving. Models with 0.1B ~ 10B parameters can run on your own devices, from a phone to a laptop. Unfortunately, they often lack the raw power of their cloud-based cousins.
The dilemma is so obvious:
- Use Cloud AI and risk our privacy.
- Use Local AI and lose the smarts.
But what if we could get the best of both? How can we use the world’s most powerful AI without handing over our secrets?
Why The Obvious Fix Doesn’t Work
The usual fix is redaction, or “masking.” You’ll find this feature in many observability platforms. Langfuse, a YC-backed prompt engineering platform, for example, offers masking based on regular expressions (regex) . Many open-source tools on GitHub follow the same pattern.
Regex is fast, but it’s also rigid and brittle:
- Global Incompatibility: A regex pattern that works for a Canadian Social Insurance Number is useless for a US Social Security Number or a UK National Insurance Number. Phone numbers, postal codes, and ID formats vary wildly across countries. No single set of rules covers every country or service. A universal set of rules is impossible.
- Unstructured Secrets: How do you write a regex for a password? Or an API key that doesn’t have a standard prefix like
sk-
? Many services, especially newer startups, have unpredictable key formats. Plaintext credentials and other unstructured secrets are simply invisible to regex. - Constant Maintenance: The digital world is always changing. New platforms emerge, and key formats evolve. Maintainers have to constantly, tediously update the rule-based system.
Why [REDACTED]
Ruins the Conversation
Even perfect regex produces placeholders that break context. Most tools simply replace sensitive data with a generic placeholder like [REDACTED]
.
Imagine a lawyer asking an AI to analyze a legal document:
Original Prompt: “Summarize the dispute between John Doe and Jane Smith regarding the property at 123 Main St. John’s wife, Mary Doe, is also a witness.”
Dumb Redaction: “Summarize the dispute between
[REDACTED]
and[REDACTED]
regarding the property at[REDACTED]
.[REDACTED]
’s wife,[REDACTED]
, is also a witness.”
The AI is now lost. Who is married to whom? Are the first two [REDACTED]
tokens the same person?
A better approach labels entities with their role, or to say semantic masking:
Smarter Redaction: “Summarize the dispute between
[PERSON_1]
and[PERSON_2]
regarding the property at[LOCATION_1]
.[PERSON_1]
’s wife,[PERSON_3]
, is also a witness.”
Here, the AI understands that [PERSON_1]
and [PERSON_3]
are distinct individuals with a relationship, and [LOCATION_1]
is a place. The AI needs to understand these roles to solve complex problems.
While basic NER tools can identify a name as a [PERSON]
, they still miss the crucial context. The best solution understands granular roles and the relationships between entities, something that may require LLM’s intelligence.
Consider this:
Ideal Redaction: “Please inform
[DOCTOR_1_NAME]
that the treatment plan for[PATIENT_1_NAME]
(ID:[PATIENT_1_ID]
) has been approved by his guardians,[GUARDIAN_1_NAME]
and[GUARDIAN_2_NAME]
. A follow-up is scheduled with[GUARDIAN_1_NAME]
at[GUARDIAN_1_PHONE]
.”
With this relational context, the LLM understands not just individual entities, but the social structure connecting them. AI needs this richness to comprehend and act upon complex human interactions.
Using a Local LLM as a Privacy Filter
Since rules are too brittle, the answer is to use another AI as a gatekeeper. A small, trusted LLM running locally acts as a “privacy filter” before your prompt ever reaches the powerful cloud AI.
The workflow is fully automated:
- You write a prompt containing sensitive data.
- A local LLM intercepts the prompt, identifies the sensitive parts, and generates a “mask map” that replaces them with semantic placeholders (e.g.,
{"Jensen Huang": "${CLIENT_1_NAME}"}
). - Only the anonymized, placeholder-filled prompt is sent to the remote cloud AI.
- The cloud AI processes the request and sends back a response that includes the placeholders (e.g., “The plan for
${CLIENT_1_NAME}
looks solid."). - The local layer intercepts the response and uses the original mask map to replace the placeholders with the real data.
- You receive a final, complete response, and your secrets never left your machine.
Therefore, the powerful cloud AI gets the semantic context, while your secrets stay on your machine.
But Can Small Local Models Really Do the Job?
Don’t I need a monster GPU with tons of VRAM for that?
For the specific task of identifying secrets, a small, efficient model is surprisingly effective, especially with carefully crafted prompts. No need a massive model to spot a password. Our benchmarks show that models as small as 0.3B or 1.5B parameters can perform this task effectively on modest hardware.
No essential local hardware at all? No problem. You can use a remote AI provider that you really trust as your masking engine.
How did you adapt prompt engineering for a small-sized LLM on affordable devices?
You can’t treat small models like GPT-4. You can’t treat mobile phones like DGX B200. Prompting has to be tuned based on a few hard truths:
- Complex instructions fail. Small LLMs get confused easily. The most reliable method is few-shot prompting, showing clear examples of what to do, instead of writing a long list of rules.
- Limited VRAM. Generating a very long output on an affordable GPU can cause out-of-memory errors.
- Limited compute power. Longer output, longer wait.
Hardware limitations led to a core design principle: the local model’s output must be as short as possible. So the first idea is to exclude thinking/reasoning models.
You might think the local model should just rewrite the user’s input, but that’s a bad idea. A long input produces a long output, making it slow, resource-heavy, and prone to hallucinations that distort the original prompt’s meaning.
I was just reviewing my travel plans for next month, and I realized I need to update my contact information. My name is
${USER_NAME}
, and I’m using the passport with the ID${USER_PASSPORT_ID}
. I’m trying to book a flight to Rome, and the website keeps giving me an error when I try to proceed to payment. It says there’s an issue with my personal details, but I’ve double-checked everything. Could you please help me troubleshoot this? …
It’s still possible to segment the user’s input by sentences and feed them to the local model sentence by sentence, but redaction consistency is lost in this way.
In fact, local model doesn’t need to rewrite the whole user message. It just needs to generate a simple JSON map of secrets it finds. For the instance above, the entire output from the local model can be just this:
{"Jensen Huang":"${USER_NAME}","A12345678":"${USER_PASSPORT_ID}"}
This output is fast to generate, and eliminates the risk of altering the original prompt. It scales with the number of secrets found, not the length of a full input. So that small models work. Also, generating a JSON masking map simplifies the process of recovering the original data back.
Can you prove the effectiveness?
Of course, no model or instructional prompt is perfect. To validate, I built a evaluation framework. Here’s how it works:
- Create a Test Set: I had SOTA models (Grok-4, claude-sonnet-4, etc.) to synthesize a dataset that resembles realistic prompts containing private data, along with the “ground truth” answers of what should be redacted. It’s uploaded to Hugging Face: cxumol/privacymask .
- Run the Test: I evaluated candidate local models against this test set.
- Measurement: I tracked key metrics:
- Recall (True Positives): Did it find the secrets?
- False Positives: Did it redact non-secrets?
- False Negatives: Did it miss secrets?
- Errors: Did it refuse, hallucinate, timeout, or break the output format?
The goal is to choose a model with the lowest error rate and the highest recall. The full benchmark results are here , showing that even 0.3B and 1.5B tiny models are quite capable.
I highly encourage you to run your own benchmarks with dataset and prompt that reflects your specific use case to find the optimal model and prompt for your hardware and your needs.
Implementation: PromptMask
These ideas became PromptMask , an open-source, local-first privacy layer for LLMs. It works silently and transparently in the background, so your workflow doesn’t change.
It offers two main integration methods:
1. For Python Developers
PromptMask
provides a drop-in replacement for the official OpenAI Python SDK. Integrating it is as simple as changing one line of code:
# Before: from openai import OpenAI
from promptmask import OpenAIMasked as OpenAI
# The client now automatically handles masking and unmasking.
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "My user ID is johndoe and my phone number is 456-7890. Please help me write an application letter."}
]
)
# The response may contain your raw secrets, but the remote LLM never saw it.
print(response.choices[0].message.content)
It supports both non-streaming and streaming requests, unmasking the response on the fly.
2. For General Users
If you don’t write in Python, local API gateway is for you. Just run one command:
pip install "promptmask[web]"
promptmask-web
This starts a local server that acts as a proxy. Point any existing OpenAI-compatible application (like a desktop client or a custom script) to the local endpoint (http://localhost:8000/gateway/v1/chat/completions
), and PromptMask
will automatically secure your data before forwarding it to the actual cloud AI service.
Conclusion: We Can Have Both Smarts and Privacy
You don’t have to choose between powerful AI and your privacy. We can, and should, have both. By using a trusted local model as an intelligent privacy filter, we can harness the cloud’s capabilities without surrendering control of our sensitive information.
PromptMask
is my implementation of this vision. It’s open-source, easy to integrate, and puts you back in control of your data.
Give it a try, and let’s make AI both smart and safe.