Underhill

Why I Am a Luddite

Fri, 01 May 2026 13:00:00 +0000

The story most people know goes like this: in early 19th century England, textile workers who feared the future smashed the machines that threatened their jobs. Ignorance versus progress. The Luddites lost, technology won, end of story.

Brian Merchant, who spent years researching the movement, found something messier and more useful. The Luddites were not anti-technology, they were anti-poverty. They were skilled workers who understood the machines intimately and objected not to machinery itself, but to the conditions of its deployment. Their phrase for what they opposed was machinery hurtful to commonality.

They were not asking whether the machines worked. They were asking who they worked for.

That reframing matters for AI in 2026.

I use AI every day. It is genuinely useful, occasionally magical, occasionally strange, and worth taking seriously. I am not arguing that it does not matter. What I have become more skeptical of is the claim that the current shape of AI adoption is inevitable, and that asking questions about it is just resistance.

That framing usually sounds like this: the future is already decided, the only question is whether you are keeping up.

Who decided this deployment model? What alternatives were considered? Who absorbs the cost when it fails? These are not anti-technology questions. They are basic governance questions.

I do not know the future of AI, and neither does anyone else, whatever confidence they project. What has helped is having a framework for the present: evaluate specific deployments, not the technology in the abstract.

The Luddite question is that framework: not is this impressive, but who does this serve, and on what terms?

AI Does Not Fix Systems, It Accelerates Them

One pattern keeps repeating: AI does not transform how an organization works. It accelerates whatever is already there.

A team with broken processes, heavy stage gates, stale documentation, and meaningless metrics introduces AI. Now it has governance bots enforcing the same gates, AI-generated versions of documents nobody reads, and automated summaries of reports that were already being ignored.

Everything gets faster. Nothing gets better.

I have seen this play out in data work repeatedly: dashboards nobody uses produced at higher volume, strategy documents polished into confident illegibility, status updates that are longer and cleaner but carry less signal than what they replaced.

Velocity increases and noise increases with it, for the same reason: the tool was applied to a broken model rather than to the question of whether the model was worth keeping.

The roughest implementations are often not caused by careless people. They happen because pressure to be seen adopting AI outpaces the harder work of deciding what adoption should mean.

The label changes. The thinking does not.

The IKEA Effect And Prompting

Dan Ariely and colleagues studied what they called the IKEA effect: people overvalue things they help construct. In one of the better-known experiments, participants made origami cranes and then stated how much they would pay to keep them. Neutral observers were asked to price the same cranes.

The builders were willing to pay roughly five times more.

The mechanism is effort, not quality. Putting work into something changes how we value it, independent of what it objectively is.

That is interesting in an AI context.

When you prompt something into existence, you frame the question, iterate outputs, and shape the final result. That is real effort, even if it differs from producing every word or line yourself. If the threshold for ownership feelings is low, then AI-assisted output may feel more trustworthy than it deserves simply because we invested effort in producing it.

I do not think this means AI-assisted work is inherently bad. It means our confidence in that work may be less objective than we assume.

So I use a simple rule: if I have put significant effort into shaping AI-assisted output that I am about to act on, I get someone who was not in the room to review it.

Not because AI must be wrong. Because I may be the least reliable judge of something that feels like I made it.

At team scale, this matters even more. When everyone has invested effort in an AI-assisted deliverable, you can end up with a room full of ownership bias.

Skills Are Shifting, Not Disappearing

This is not a case for denial. Skills always depreciate and compound over time.

The skills that are depreciating fastest are the ones AI can increasingly perform: fast SQL drafting, standard dashboard production, report formatting, basic summarization.

The skills that compound are judgment-intensive: knowing what to ask, reading context, seeing when an analysis is technically correct but organizationally wrong, and helping stakeholders clarify what they actually need.

As the cost of producing mediocre output approaches zero, judgment becomes more scarce and more valuable.

If anyone can generate a passable dashboard in five minutes, the value shifts to deciding which dashboards should exist at all.

Whether workers are compensated for that shift, or whether gains are extracted elsewhere, is the Luddite question applied to a career.

A Useful Question, Even Without Guarantees

The Luddites lost in the most literal sense. The state crushed the movement and the machines continued.

That is not an argument against asking the question. It is a reminder that asking it does not guarantee the answer you want.

And yet, versions of this question have won before: collective bargaining over automation, labor protections, organizations that deploy AI to augment skilled work instead of deskilling it.

The framework does not promise a good outcome. It gives you a way to see clearly enough to push for one.

Sometimes the answer is good. AI can remove tedious work and free people for higher-judgment work they find meaningful.

But on the surface, that can look very similar to AI deployed to deskill, justify layoffs, and extract more from fewer people under the language of progress.

The Luddite frame helps you distinguish between those two paths.

Not refuse the technology. Ask what it is for, who decided, and who benefits.

Nobody else is going to ask that on your behalf.

You should be a Luddite too.

Why Estimates Always Lie — And What to Do About It

Fri, 20 Mar 2026 13:00:00 +0000

Ask any developer how long a project will take, and then ask again once it’s done. The numbers will rarely match. This isn’t an occasional failure — it’s one of the most consistent and documented patterns in software development.

And yet, most of us just keep doing it the same way. We stare at a Jira board, assign story points, add a 20% buffer, and hand something over with a confidence we fundamentally do not have. Then we spend months quietly explaining why things are taking longer than expected.

I wanted to understand why estimation fails so reliably — and build something that makes it harder to lie to yourself.

The Problem Isn’t Laziness

The tempting narrative is that developers are just bad at scoping things. They’re naive optimists who forget about edge cases and tech debt. Fix the person, fix the problem.

But that framing misses what’s actually going on.

When you estimate a project, you almost always estimate the work — the actual build. The feature, the screen, the API endpoint. The thing you can see and reason about. The thing that ends up in your Jira ticket.

The problem is that the work is never just the work.

Every project comes wrapped in a thick layer of invisible effort that we don’t put in the Jira ticket because it isn’t the thing we’re building. It’s the meetings, the config, the debugging sessions, the backslide on a dependency upgrade, the scope conversations, the infrastructure that breaks on a Friday afternoon.

We’re not bad at estimating the work. We’re consistently ignoring everything around it.

Dave Stewart’s Taxonomy of Invisible Work

A few years ago, Dave Stewart published a fantastic deep dive on why projects always take longer — the result of a brutal postmortem on a project that ran far, far over. He also published an accompanying gist that catalogues, in painstaking detail, all the things you don’t think about when you quote for a project.

Reading it is one of those experiences where you nod continuously while quietly reflecting on every project you’ve ever been part of.

His key insight is that project work can be broken into distinct categories, and only one of them is what we actually estimate:

Category	What it means
The work around the work	Meetings, reviews, project management
The work to get the work	Research, scoping, quoting, pitching
The work before the work	Setup, config, infrastructure, services
The work	The actual build, product, design, docs, tests
The work between the work	Debugging, refactoring, iteration, tooling
The work beyond the work	Scope creep, omissions, nice-to-haves
The work outside the work	Surprises, contingency, unknown unknowns
The work after the work	Hosting, deployment, security, ongoing support

Looking at this list, the actual work — the thing that goes in the estimate — is one entry out of eight. And Dave’s rough analysis suggests execution might represent as little as 20% of total project effort.

That number feels extreme until you think about the last project you shipped. How much time was spent in stakeholder meetings? How long did the initial environment setup take? How many days got consumed by a third-party API that didn’t behave as documented? How many afternoons were eaten by “quick questions” that turned into scope renegotiations?

Add it all up honestly and 20% starts to seem plausible. Maybe even generous.

Why We Keep Getting It Wrong

There are a few cognitive traps that make this so persistent.

The planning fallacy — our tendency to anchor on best-case scenarios and discount known risks — is well-documented. We don’t just forget the adjacent work; we actively don’t want to include it because doing so makes the estimate “more expensive” and harder to sell.

Invisible work is invisible. If it doesn’t have a ticket, it doesn’t exist in the estimate. But it still exists in the calendar.

We estimate outcomes, not processes. “Build a search feature” gets an estimate. “Spend two days understanding why Elasticsearch index updates are inconsistent across environments” doesn’t. But the second thing is what actually happens.

The practical result is that estimates consistently represent a best-case path through the visible work, while everything else accumulates silently.

Building a Better Tool

I built true-estimate to make this hidden work visible — and to make it slightly harder to accidentally produce a naive estimate.

The tool is directly inspired by Dave Stewart’s framework. Rather than a flat list of tasks, it organises your estimate into the eight phases above. You can add tasks under each phase with optional week estimates. As you fill it in, you get three numbers:

Estimated — only the execution work, the thing you’d normally quote
Hidden — everything outside the execution phase
Total — what it actually costs

The goal isn’t to produce a precise forecast, because that’s largely impossible. It’s to force the question: what am I not accounting for? The admin load, the setup time, the inevitable bugs and scope conversations — they’re going to happen regardless of whether you estimate them. The only question is whether you’re planning for them or absorbing them silently.

There’s also a sample project you can load to see what a realistic breakdown might look like. The hidden work being consistently larger than the estimated work is, in my experience, not a bug in the sample — it’s about right.

An Honest Estimate Isn’t a Pessimistic One

There’s sometimes a reluctance to estimate comprehensively because it feels like pessimism or padding. If you include two weeks for “general iteration and debugging,” it looks like you’re hedging. Shouldn’t a good developer be more efficient than that?

But this is exactly backwards. An honest estimate is a professional one. It signals that you understand how software projects actually work — that there is always invisible work, always iteration, always surprises. Hiding that work doesn’t make it go away. It just means someone absorbs it unexpectedly, whether that’s you, the project timeline, or the client.

The developers and teams who build trust over time are the ones whose estimates are reliable — not necessarily short.

Try It

If you’ve got a project in front of you — a new feature, a refactor, a greenfield build — give true-estimate a try before you submit that Jira estimate. Work through each phase and be honest about what you’re probably going to spend time on. Then compare your execution estimate to the total.

The gap between those two numbers is the amount of work you were planning to do for free.

The true-estimate tool is open source — code is on GitHub. Dave Stewart’s original article, which inspired the structure, is well worth reading in full.

Mapping Fire: Five Decades of Bushfires in NSW

Sat, 03 Jan 2026 13:00:00 +0000

Australia and fire are inseparable. For millennia, bushfires have shaped our landscapes, ecology, and communities. But as our climate changes and populations grow along bushland fringes, understanding fire patterns has never been more important.

I’ve built an interactive dashboard that explores over 50 years of fire history in New South Wales—from 1970 to 2024. Using data from NSW’s Department of Planning, Industry and Environment, it tells the story of where, when, and how fires have burned across the state.

What the Data Reveals

Since 1970, NSW has recorded 18,814 fire events, burning more than 15 million hectares—roughly 2% of Australia’s entire landmass. These fires fall into two categories: wildfires (11,503 events) which have burnt 14 million hectares, and prescribed burns (7,311 events) used for hazard reduction, clearing 1.8 million hectares.

The numbers alone don’t capture the human cost. The dashboard documents the deadliest fires, including the Badja Forest Road fire that claimed six lives during Black Summer, and the Green Wattle Creek fire that killed two volunteer firefighters when a tree struck their tanker.

Patterns in Time and Space

The visualisations reveal several clear patterns:

Geographic clustering shows fires concentrate heavily along coastal ranges where eucalypt forests meet urban development. The Blue Mountains and Central Coast are among the most fire-prone areas, with some locations experiencing dozens of fire events over the period.

Seasonal cycles are stark—summer and early autumn (December to March) dominate fire activity. But the 2019-2020 season broke patterns with unprecedented late-spring fires, signalling how changing conditions are shifting traditional fire seasons.

Drought years stand out. Wildfire frequency spikes dramatically during major droughts, particularly 2001-2002 and 2019-2020. Meanwhile, prescribed burns maintain a relatively steady baseline as fire services work to reduce fuel loads.

The Black Summer Context

The 2015-2020 period saw the most area burnt in any five-year window, driven entirely by the catastrophic Black Summer fires of 2019-2020. That season alone burnt over 5 million hectares—dwarfing every previous year on record.

Three fires during Black Summer deserve particular attention. The Gospers Mountain fire, started by a single lightning strike, ultimately burned 512,626 hectares after merging with five other fires into a megablaze exceeding one million hectares. The Currowan fire earned the name “The Forever Fire” for its 74-day duration. The Badja Forest Road fire travelled 40 kilometres in hours under catastrophic conditions, destroying 418 homes around Cobargo on New Year’s Eve.

Only 87 fires since 1970 have exceeded 50,000 hectares. Nearly all sparked from lightning strikes in remote bushland during extreme drought conditions. The dashboard shows how these mega-fires cluster in summer months when temperatures peak and fuel is driest.

Why This Matters

This isn’t just historical data—it’s a window into our future. Fire seasons now start earlier, last longer, and burn with unprecedented intensity. Understanding these patterns helps us prepare.

The dashboard shows how fires behave under different conditions, where they’re most likely to occur, and which periods have been most destructive. For anyone living in NSW or interested in fire management, these patterns matter.

It’s also worth noting what this data doesn’t capture. Historical records, especially pre-1990s, vary in accuracy. Fire boundaries are approximations. Some casualties may be unrecorded. The true human toll of these fires extends far beyond the numbers—displaced communities, destroyed homes, psychological trauma, and ecosystems fundamentally altered.

Building the Dashboard

I built this using Observable Framework with data from NSW’s Department of Planning, Industry and Environment. The dataset includes every recorded fire since 1970, with details on location, size, type, and timing. I’ve supplemented this with research from official inquiries and historical records to document casualties and home losses for the largest fires.

The goal was to make complex fire data accessible and interactive. You can explore specific years, compare wildfire versus prescribed burn patterns, see seasonal variations, and understand which areas face the highest risk.

Looking Ahead

Fire is part of Australia’s identity. Aboriginal Australians used fire as a land management tool for over 60,000 years. But the scale and intensity of modern fires—driven by climate change, fuel accumulation, and expanding urban-bushland interfaces—presents challenges we’re still learning to navigate.

This dashboard doesn’t offer solutions, but it does offer context. By seeing how fires have behaved over five decades, we can better understand what we’re facing and where we need to focus our efforts in fire management, hazard reduction, and community preparedness.

Explore the dashboard at mitchelllisle.github.io/fires-nsw-dashboard and see what patterns emerge from half a century of fire history.

Explore the dashboard: History of Bushfires in NSW

Data source: NSW DPIE Fire History Dataset

Too Unique to Hide: Understanding Re-identification Risk in Australia

Wed, 24 Dec 2025 10:00:00 +0000

We’ve all been told that our data is “de-identified” or “anonymised.” Healthcare providers, government agencies, and companies assure us that after removing names and addresses, our information is safe. But how safe is it really?

This question led me to create Too Unique to Hide, an interactive calculator that shows Australians how identifiable they might be from supposedly anonymous datasets.

How Unique Are You?

Even without your name or address, a few basic demographic facts can be quite distinctive. A combination of your postcode, age group, gender, and occupation might sound generic—but together, they can create a unique profile.

The calculator uses real Australian Bureau of Statistics (ABS) census data to show this. Enter your details, and it shows how many people in Australia share that same demographic profile. The results can be surprising.

Understanding the Numbers

When fewer people share your characteristics, linking different datasets becomes easier. For example, if an organisation releases “anonymous” health data with postcode, age, and gender, it’s possible that cross-referencing with other datasets could reveal identities—especially in smaller population groups.

The calculator shows four risk categories based on how many people match your profile, from very high risk (fewer than 10 matches) to lower risk (1,000+ matches). These estimates help you understand your potential visibility in anonymised datasets.

Real-World Examples

Re-identification isn’t just theoretical. In 2016, the Australian Department of Health released “de-identified” Medicare data, but researchers showed it was possible to re-identify individuals, leading to the dataset being withdrawn. Similar issues arose with Netflix viewing data and location tracking from apps.

Most often, this isn’t about bad actors—it’s organisations not fully appreciating how unique demographic combinations can be when sharing data for legitimate research or policy purposes.

The Combination Effect

Each demographic factor on its own is common. Millions share your age group or postcode. But combine them with gender and occupation, and you’re often in a much smaller group. The calculator visualises this, showing how rare you are for each attribute individually and combined.

What You Can Do

Understanding your profile is a useful first step:

Be mindful with surveys that collect detailed demographics along with postcodes
Think about combinations when sharing information across multiple platforms
Ask questions when organisations claim data is anonymous—what demographics remain?
Support privacy protections that go beyond simple de-identification

Building the Tool

I built this using Observable Framework and real ABS census data, inspired by research from Imperial College London. All calculations happen in your browser—nothing you enter is collected or transmitted.

The goal is education, not alarm. Many Australians don’t realise how distinctive basic demographics can be. This tool makes that concept tangible.

Looking Forward

Data sharing for research and policy is valuable, and we shouldn’t stop it. But we do need better approaches. This includes being realistic about de-identification limits, using stronger privacy techniques like differential privacy, and being thoughtful about what demographic detail gets shared.

Try the calculator at Too Unique to Hide and see where you stand. Whether you’re one in thousands or more unique, understanding your demographic fingerprint is worth knowing in our data-driven world.

Try the calculator: Too Unique to Hide - Australian Edition

Learn more: Office of the Australian Information Commissioner

One schema library to rule them all

Sat, 30 Sep 2023 12:01:35 +0000

Generating and Testing PySpark DataFrames with Sparkdantic

1. Introduction of the Problem Sparkdantic Solves

In the world of Big Data, PySpark has become a go-to framework for processing large datasets. However, as with any framework, there are challenges. One of the most cumbersome challenges is defining schemas for DataFrames and generating realistic test data. PySpark often does a good job of inferring schemas, but in some cases you need to define a schema to ensure your data arrives in the most correct state.

Pydantic is another library that is hugely popular and provides so many excellent capabilities when it comes to validating your data. Up until now, there hasn’t been an easy way tp use both.

Traditionally, developers would manually define schemas and write custom code to generate test data. This process is not only tedious but also error-prone. While PySpark provides a way to define schemas, it doesn’t take advantage of Pythons in-built data types which mean you can have to define your schema in the way PySpark wants you to.

What if there was a more streamlined way to handle schemas, interoperability between Python and Spark and an easy way to generate fake / test data?

Enter Sparkdantic, which offers a seamless integration between Pydantic models and PySpark DataFrames. With Sparkdantic, you can define DataFrame schemas using Pydantic models and generate realistic test data based on custom specifications.

To read more about Sparkdantic and install it, see my GitHub profile here

pip install sparkdantic

2. Creating Schemas and How Sparkdantic Makes It Easy

With PySpark, defining a schema usually involves creating a StructType object with a list of StructField objects. While this method is powerful, it can become verbose and hard to manage for complex schemas. You also can’t use this schema outside of PySpark.

Using Sparkdantic, you can leverage Pydantic models to define your DataFrame schema. Pydantic models are Python classes that define data shapes and validation. They are concise, readable, and offer powerful validation capabilities. A basic Pydantic model may look like this:

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int
    email: str

With the SparkModel class from Sparkdantic, you can easily convert this Pydantic model into a PySpark schema which gives you the ability to generate a PySpark valid schema with the model_spark_schema method:

from sparkdantic import SparkModel

class UserSparkSchema(SparkModel):
    name: str
    age: int
    email: str

schema = UserSparkSchema.model_spark_schema()

This will output a PySpark StructType schema, ready to be used in your DataFrames.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

StructType([
    StructField('name', StringType(), False), 
    StructField('age', IntegerType(), False), 
    StructField('email', StringType(), False)
])

3. Generating Realistic Fake Data for Unit Tests / Populating a Development Database

Once you have your schema, the next challenge is populating it with realistic data. Sparkdantic provides the ColumnGenerationSpec class, which lets you define specifications for generating data for each column.

For instance, if you want the age column to have random values between 20 and 50:

from sparkdantic.generation import ColumnGenerationSpec

age_spec = ColumnGenerationSpec(min_value=20, max_value=50, random=True)

You may also want a list of names to use for the name column. For this, we can leverage other libraries such as the well known faker library:

from faker import Faker

faker = Faker()

names = [faker.name() for _ in range(1000)]

name_spec = ColumnGenerationSpec(values=names, random=True)

Using the generate_data method of the SparkModel in Sparkdantic, you can then generate a DataFrame with the desired number of rows:

spark = SparkSession.builder.appName("demo").getOrCreate()
data = UserSparkSchema.generate_data(spark, n_rows=1000, specs={"age": age_spec, "name": name_spec})
data.show()

This will produce a DataFrame with 1000 rows, with the age column populated with random values between 20 and 50 and a randomly chosen name from a list of 1000 fake names generated by faker.

4. Conclusion

Defining PySpark DataFrame schemas and generating test data doesn’t have to be a cumbersome process. With the integration of Pydantic models and Sparkdantic, you can streamline these tasks, making your development process more efficient and error-free.

Whether you’re a data engineer writing unit tests, a data scientist experimenting with data, or a developer populating a development database, Sparkdantic offers a powerful toolset to make your life easier. Give it a try and elevate your PySpark game!

A friendly encryption CLI tool

Fri, 01 Sep 2023 12:01:35 +0000

🧟 Monstermash: A Simple CLI Tool for Data Encryption

Introduction

In today’s digital landscape, data privacy is a growing concern. While there are many tools available for data encryption, Monstermash offers a straightforward command-line interface (CLI) solution for those who prefer simplicity. Let’s explore its basic functionalities: encrypting and decrypting data.

To read more about Monstermash and install it, see my GitHub profile here

pip install monstermash

Getting Started: Generating Keys

Before Alice and Bob can exchange encrypted messages, they each need a set of keys. Monstermash provides a basic command to generate these.

For Alice:

monstermash generate

Output:

-----------------
Private Key (Alice's)
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
Public Key (Alice's)
0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
-----------------

For Bob:

monstermash generate

Output:

-----------------
Private Key (Bob's)
abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890
Public Key (Bob's)
fedcba0987654321fedcba0987654321fedcba0987654321fedcba0987654321
-----------------

Encrypting Data

Suppose Alice wants to send Bob a line from the song “Monster Mash”. She can use her private key and Bob’s public key to encrypt the message.

monstermash encrypt \
  --private-key a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2 \
  --public-key fedcba0987654321fedcba0987654321fedcba0987654321fedcba0987654321 \
  --data "They did the mash, they did the Monster Mash!"

Output:

Encrypted Data: 0123abcd4567ef890123abcd4567ef890123abcd4567ef890123abcd4567ef89

Decrypting Data

Upon receiving the encrypted message, Bob can decrypt it using his private key.

monstermash decrypt \
  --private-key abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890 \
  --data 0123abcd4567ef890123abcd4567ef890123abcd4567ef890123abcd4567ef89

Output:

Decrypted Data: They did the mash, they did the Monster Mash!

Conclusion

Monstermash is a simple CLI tool designed for basic encryption tasks. It doesn’t claim to revolutionize the encryption landscape but offers a simple solution for those familiar with the command line. If you’re looking for a no-frills way to encrypt and decrypt data, Monstermash might be worth a try.

Protecting Sensitive Data: Understanding Database Reconstruction Attacks

Thu, 23 Feb 2023 12:01:35 +0000

Protecting Sensitive Data: Understanding Database Reconstruction Attacks

There are a number of reasons businesses and governments want to share information about people. One of the most common and useful way data is shared is through a census. A Census is particularly interesting because it contains some extremely personal information about individuals and as a result, it must be carefully protected to ensure any statistical information that is released doesn’t encroach on everyones right to priavacy. In a number of cases, aggregate data does little to hinder hackers from being able to re-create a database that is either very close, or exactly the same as the original data. In this blog post, we will explore a little about how these attacks work with a simple example.

This blog post and the subsequent code is adapted from a paper on database reconstruction attacks. You can find the paper here

Imagine we work for a company called Acme Data Inc. and that have the following database that contains information for people within a certain geographic area.

name	age	married	smoker	employed
Sara Gray	8	False	False	False
Joseph Collins	18	False	True	True
Vincent Porter	24	False	False	True
Tiffany Brown	30	True	True	True
Brenda Small	36	True	False	False
Dr. Tina Ayala	66	True	False	False
Rodney Gonzalez	84	True	True	False

Note: All data here is fake generated data, and likeness to a real person is entirely coincidental.

We have 7 people in total in this block. Alongside age, we also have each resident’s smoking status, employment status and whether they are married or not. From here, we publish a variety of statistics about this block. You have probably seen something similar if you’ve ever done a census.

📓 To simplify the example, this fictional world has:

Two marriage statuses; Married (True) or Single (False)
Two smoking statuses; Non-Smoker (False) or Smoker (True)
Two employment statuses; Unemployed (False) or Employed (True)

👾 One additional piece of logic we know is that any statistics with a count of less than 3 is suppressed. Suppression of statistics with low counts is often used as a tactic for protecting privacy. The less people there are to represent a statistic, the more they often stick out in a dataset meaning their privacy is often more at risk than those who ‘blend in with the crowd’. As we’ll see, simply knowing that a statistic is suppressed can even be used to attack a dataset.

As a Data Analyst working for Acme Data, we have been tasked with producing the following summary statistics that we can publish on our website for anyone to view. After running our analysis, this is the output that we intend to publish:

id	name	count	median-age	mean-age
A1	total-population	7.0	30.0	38.0
A2	non-smoker	4.0	30.0	33.0
B2	smoker	3.0	30.0	44.0
C2	unemployed	4.0	51.0	48.0
D2	employed	3.0	24.0	24.0
A3	single-adults	NaN	NaN	NaN
B3	married-adults	4.0	51.0	54.0
A4	unemployed-non-smoker	3.0	36.0	37.0

The stat A1 represents the total population count, median age, and mean age of individuals in the database. The count refers to the total number of individuals in the database, the median age refers to the age that separates the database into two equal halves, and the mean age refers to the average age of all individuals in the database. The other stats are all showing the same information for various cohorts.

Note that with A3 we have suppressed it in order to protect the identity of the individuals who have a higher risk of being re-identified. What’s interesting about this stat is that this is information we can encode into our model to help us come up with a better re-construction. We can infer that it is suppressed because there is <3 people who represent this cohort since we know that other stats (such as D2) contain 3 people and that stat is not supressed.

In order to encode these constraints into a model that we can use to re-construct the data, we can use a library such as Z3. We can use libraries such as Z3 to model constraints and then ask for an answer that fits within those constraints. Effectively, each stat above is a constraint that we can model and we can ask it to generate all the permutations of age, smoker status, employment status and married status that have to exist in order to satisfy all the constraints. An example of modelling a constraint can be done like this:

import z3

# create a solver object, that houses all our constraints
solver = z3.Solver()

# create representations of the variables we want to receive an answer for; such as ages
ages: z3.ArraySort = z3.Array('ages', z3.IntSort(), z3.IntSort())

# define a constraint on these variables (we know there are 7 people, so we range over that number)
# the constraint we add here is to ensure all 7 people have a realistic age (between 0 and 125)
min_age = 0
max_age = 125

for i in range(7):
    solver.add(
        z3.And(
            z3.Select(ages, i) > min_age,
            z3.Select(ages, i) < max_age
        )
    )

solver.check() # this checks that our constraints can produce a valid model
solver.model() # we can then access that model

The result of the constraints above would end up outputing a list of values for ages that fit within our constraints. For example, the model we end up with might look like this:

[45, 34, 67, 34, 123, 1, 8]

Of course there could be many permutations, the model may output different answers depending on which one it picks first. With each new constraint added, we reduce the search space until we ideally get down to 1 answer that fit all the constraints. At this point, we’ve re-constructed the database!

If you want to see this in action, check out this repo with a full implementation.

Conclusion

In this article, we’ve explored how aggregate data does little to hinder hackers from being able to re-create a database that is either very close, or exactly the same as the original data. It’s important to consider this when releasing data.

Before we wrap up, you may be asking why this is possible. Well the answer to that comes from the same people that have come up with the best technique we know of to protecting against this type of attack:

“[Giving] overly accurate answers to too many questions will destroy privacy in a spectacular way”

Cynthia Dwork and Aaron Roth, Authors of ‘The Algorithmic foundations of Differential Privacy’

The next question you may be asking is “How do I protect against this attack?”. A couple of things you can look at include:

Differential privacy: DP is a great fit for protecting this type of data. In fact, the US Census Bereau have adopted DP to avoid disclosure of private information about individuals
Data minimisation: Releasing too much information can lead to a simpler re-construction attack vector, so minimising the data you release can be a simple way to limit what people can infer about your data

If you can, try and consult with privacy experts in your organisation to ensure they do a privacy review before sharing data with third-parties or with the public.

Thanks!

Underhill

Why I Am a Luddite

AI Does Not Fix Systems, It Accelerates Them

The IKEA Effect And Prompting

Observer view

Skills Are Shifting, Not Disappearing

A Useful Question, Even Without Guarantees

Why Estimates Always Lie — And What to Do About It

The Problem Isn’t Laziness

Dave Stewart’s Taxonomy of Invisible Work

Why We Keep Getting It Wrong

Building a Better Tool

An Honest Estimate Isn’t a Pessimistic One

Try It

Mapping Fire: Five Decades of Bushfires in NSW

What the Data Reveals

Patterns in Time and Space

The Black Summer Context

Why This Matters

Building the Dashboard

Looking Ahead

Too Unique to Hide: Understanding Re-identification Risk in Australia

How Unique Are You?

Understanding the Numbers

Real-World Examples

The Combination Effect

What You Can Do

Building the Tool

Looking Forward

One schema library to rule them all

Generating and Testing PySpark DataFrames with Sparkdantic

1. Introduction of the Problem Sparkdantic Solves

2. Creating Schemas and How Sparkdantic Makes It Easy

3. Generating Realistic Fake Data for Unit Tests / Populating a Development Database

4. Conclusion

A friendly encryption CLI tool

🧟 Monstermash: A Simple CLI Tool for Data Encryption

Introduction

Getting Started: Generating Keys

Encrypting Data

Decrypting Data

Conclusion

Protecting Sensitive Data: Understanding Database Reconstruction Attacks

Protecting Sensitive Data: Understanding Database Reconstruction Attacks

Conclusion