Priveil: pseudonymisation for Australian financial data

Introducing Priveil — a pragmatic tool for detecting and replacing PII in financial text. Not anonymisation. Not a silver bullet. But better than nothing.

Priveil: pseudonymisation for Australian financial data

Photo by boris misevic on Unsplash

GitHub — mitchelllisle/priveil ↗ · pip install priveil · Python · FastAPI · MIT

Before I talk about what Priveil does, I want to be honest about what it doesn’t do — because that distinction matters more than any feature.

Priveil does not anonymise data.

It pseudonymises it. Those two words sound similar, but the difference is the difference between a locked door and an open one with a sign saying “nothing to see here.” True anonymisation — the kind that withstands an adversary with auxiliary information, today and in ten years — is an extremely hard problem. The only approach that comes with a mathematical guarantee is differential privacy, and differential privacy works on aggregations, not on text. If you need data that is safe to publish without downstream controls, no pattern-matching tool will get you there.

With that said: most systems in practice aren’t trying to publish to the world. They’re trying to keep PII out of logs, reduce exposure when data crosses trust boundaries, improve compliance posture, and stop names and Tax File Numbers showing up in Slack. For those purposes, a good pseudonymisation service is genuinely useful — and that’s what Priveil is.

Why anonymisation is hard

The privacy research literature has established three reasons why find-and-replace approaches can’t produce truly anonymous data:

Data is more identifying than it appears. A name and a postcode together uniquely identify most people. A sequence of transactions, a writing style, a combination of fields that each look innocuous — any of these can be as identifying as a name. You can’t enumerate what an attacker might use, so you can’t enumerate what to remove.

Auxiliary data is an unknown variable. Information that looks private may already be public for specific individuals. Politicians, athletes, executives. Data that’s safe today may become identifying after an unrelated breach. A pseudonymisation scheme that doesn’t account for what an attacker already knows provides no robust guarantee — it only needs to be wrong once.

Attacks improve over time. AI-assisted reconstruction, linkage attacks, and re-identification techniques are improving. The 2016 Australian Medicare dataset that was publicly released and then withdrawn is a case in point: patterns that appeared safe turned out to be linkable. Mitigating only known attacks isn’t enough.

I’ve written about this at more length in Too Unique to Hide and Database Reconstruction Attacks.

What Priveil actually is

Priveil is a pseudonymisation service built for Australian financial services contexts. It runs as a FastAPI service and wraps Microsoft Presidio with a set of purpose-built recognisers for Australian identifiers — Tax File Numbers, Medicare numbers, BSBs, ABNs, ACNs, and Australian phone formats — each with checksum validation where the issuing authority publishes an algorithm.

Entity Description Validated
AU_TFN Tax File Number ATO mod-11 checksum
AU_MEDICARE Medicare card DVA checksum
AU_ABN Australian Business Number ATO mod-89
AU_ACN Australian Company Number ASIC complement-of-10
AU_BSB Bank State Branch Format
AU_ACCOUNT_NUMBER Bank account Requires BSB context
AU_PHONE Mobile / landline 04XX, +61 4XX, STD

Standard Presidio types (PERSON, EMAIL_ADDRESS, CREDIT_CARD, LOCATION, DATE_TIME) are detected alongside these.


Three endpoints

/detect — find what’s there

curl -X POST http://localhost:8000/detect \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Jane Smith TFN 123 456 782, BSB 062-000, jane@bank.com.au",
    "mode": "judge"
  }'
{
  "entities": [
    { "text": "Jane Smith",          "entity_type": "PERSON",        "sensitivity": "high",     "score": 0.85 },
    { "text": "123 456 782",         "entity_type": "AU_TFN",        "sensitivity": "critical", "score": 1.0  },
    { "text": "062-000",             "entity_type": "AU_BSB",         "sensitivity": "high",     "score": 0.85 },
    { "text": "jane@bank.com.au",      "entity_type": "EMAIL_ADDRESS",  "sensitivity": "medium",   "score": 1.0  }
  ],
  "input_hash": "sha256:..."
}

The mode field controls whether detections are passed through an LLM to remove false positives ("judge") or returned raw ("fast"). The input_hash is a SHA-256 audit trail of the original text — useful when you want to prove you processed a document without storing it.

mode="judge" sends raw text to your LLM provider. The un-redacted input — TFNs, Medicare numbers, names, account details — goes to whatever model you’ve configured before any pseudonymisation happens. If that model is hosted by Anthropic or OpenAI, your most sensitive data just left your infrastructure. More on this below.

/anonymise — replace what you found

curl -X POST http://localhost:8000/anonymise \
  -H "Content-Type: application/json" \
  -d '{"text": "Jane Smith TFN 123 456 782", "mode": "judge"}'
{
  "anonymised_text": "<PERSON> TFN ***-***-***",
  "entity_map": {
    "Jane Smith":  "<PERSON>",
    "123 456 782": "***-***-***"
  }
}

Replacements use sensible defaults by entity type (TFNs become ***-***-***, credit cards get last-four masking, locations become <LOCATION>), but every operator is overridable per-request:

{
  "text": "Contact Jane Smith on 0412 345 678",
  "operator_overrides": { "PERSON": "redact", "AU_PHONE": "mask" }
}

Available operators: replace, mask, redact, hash.

One note on the entity_map: it records original PII spans as keys and must be treated as sensitive data. It’s useful for audits, but it’s not a reversible index — multiple spans may collapse to the same label, so it doesn’t reconstruct the original document on its own.

/assess — risk profile a document

The most interesting endpoint. Backed by an LLM judge, it produces a risk profile: overall sensitivity tier, applicable Australian regulatory frameworks, and handling guidance.

curl -X POST http://localhost:8000/assess \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Applicant Jane Smith TFN 123 456 782. BSB 062-000.",
    "context": "Australian home loan application"
  }'
{
  "overall_sensitivity": "critical",
  "risk_summary": "Contains TFN and BSB — highest regulatory exposure",
  "categories": ["identity", "financial"],
  "regulatory_flags": ["Privacy Act s16B", "ATO data standards"],
  "recommended_handling": "Encrypt at rest, restrict to need-to-know, purge after 90 days",
  "entity_breakdown": [
    { "entity_type": "AU_TFN", "sensitivity": "critical", "count": 1 },
    { "entity_type": "AU_BSB", "sensitivity": "high",     "count": 1 }
  ]
}

This is the endpoint that answers “I found some PII — but how worried should I actually be, and what do I need to do about it?” It understands Australian context: Privacy Act obligations, ATO data standards, ASIC requirements.


The AI provider problem

Priveil’s LLM judge is genuinely useful — it removes false positives that pure pattern-matching can’t catch. But it creates a problem that deserves a straight answer: mode="judge" and /assess send your raw, un-redacted text to whatever LLM provider you’ve configured. That text may contain TFNs, Medicare numbers, bank account details, and full names. If your judge is anthropic:claude-sonnet-4-6 or openai:gpt-4o, that data is being sent to a commercial third-party API governed by their infrastructure, their data retention policies, and their own disclosure obligations.

This is the central irony. You’re using Priveil to comply with the Privacy Act and ATO data standards — and if you configure it carelessly, you’ve handed that same data to a party you may not be authorised to share it with. Regulations don’t stop applying because the data was in a JSON request body.

What you should actually do:

For real regulated data, run the judge on a self-hosted or locally-run model. Priveil supports any OpenAI-compatible endpoint via PRIVEIL_JUDGE_BASE_URL, which means you can point it at:

  • Ollama running locally — llama3, mistral, or any model with reasonable instruction-following
  • A self-hosted vLLM or llama.cpp server
  • An internal enterprise gateway in front of a cloud model, with appropriate data processing agreements, training opt-outs, and regional controls in place

If you must use a cloud provider, you need three things confirmed before you do: a data processing agreement (DPA) covering this category of data, training on your inputs disabled, and explicit authorisation under your organisation’s information security policy. Without all three, you are likely breaching the very obligations Priveil is supposed to help with.

mode="fast" keeps everything local. No LLM involvement means no PII egress beyond Priveil itself. You’ll get more false positives, but for many preprocessing workflows — especially where a human reviewer sees the output — that’s an acceptable tradeoff. For batch pipelines that feed into downstream systems, fast mode plus a review step is often the right architecture.

If you skip the judge entirely (mode="fast" everywhere, no /assess calls), Priveil never makes an outbound network request for your data. That’s a legitimate and often correct deployment.


MCP server

Priveil also ships an MCP server, which means you can wire it directly into Claude Desktop, Cursor, or any other MCP-aware client.

Stop before you do this with real data. When you paste a document into Claude Desktop, that document — raw, with whatever PII it contains — is sent to Anthropic as part of your chat. That’s true regardless of what Priveil does afterwards. The MCP integration is genuinely useful for working with synthetic or already-pseudonymised data, or for personal/development workflows where you understand and accept what’s happening. For production financial data, it is not the right setup.

With that said, here’s how it works. Add this to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "priveil": {
      "command": "priveil-mcp",
      "env": {
        "PRIVEIL_JUDGE_MODEL": "ollama:llama3",
        "PRIVEIL_JUDGE_BASE_URL": "http://localhost:11434/v1"
      }
    }
  }
}

Using a local model via PRIVEIL_JUDGE_BASE_URL means Priveil’s judge stays on your machine. That doesn’t change the fact that pasting raw data into Claude Desktop sends it to Anthropic — but it does mean the judge isn’t also forwarding it to a second external service.

If you’re using this for exploration or development against non-production data, and you’re comfortable with Anthropic receiving the content, the anthropic:claude-sonnet-4-6 judge works well. Just be deliberate about that choice.

Three tools become available to the model: detect, anonymise, and assess. The practical use case: you paste a document into a chat, ask the AI to clean it before it goes into a log or gets sent downstream, and the AI calls Priveil’s tools directly rather than attempting pattern-matching itself.


Running it

git clone https://github.com/mitchelllisle/priveil
cd priveil

# copy and configure .env
cp .env.example .env

# install and serve
make install
make serve

The API comes up at http://localhost:8000, with interactive docs at http://localhost:8000/docs. Docker is also supported: make docker-serve.

The LLM judge is optional — mode=fast and mode=judge both work, but judge mode and /assess require PRIVEIL_JUDGE_MODEL to be set (format: provider:model). For local use, set PRIVEIL_JUDGE_BASE_URL to point at an Ollama or llama.cpp server and use any OpenAI-compatible model string. For regulated data in any context, this is the right path.


What it’s good for, and what it isn’t

Good for: keeping PII out of logs and analytics pipelines; reducing accidental exposure when data crosses trust boundaries; improving compliance posture in Australian financial services; making data less obviously identifying for operational purposes. These are real and valuable things.

Not good for: publishing data publicly; satisfying a regulator that the data is “truly anonymous”; any scenario where an adversary might have auxiliary information that could enable linkage.

Also not a substitute for thinking about where your AI runs. The judge and assess features are powerful, but using them with a commercial API provider without appropriate authorisation and agreements undermines the entire point. The tool helps — the configuration determines whether that help is net positive or net negative for your privacy posture. Use a local model, or know exactly what you’ve agreed to with whoever is hosting yours.

The codebase is deliberate about this distinction — even the README warns that the word “anonymise” appears throughout because it’s what practitioners say, not because it’s accurate.

If you’re handling Australian financial data and want something that keeps PII from leaking across service boundaries, check it out on GitHub. For further reading on why this stuff is genuinely hard, Damien Desfontaines’ What anonymization techniques can you trust? is the place to start.