<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Underhill</title>
        <description>...</description>
        <link>/</link>
        <atom:link href="/feed.xml" rel="self" type="application/rss+xml"/>
        <pubDate>Sat, 02 May 2026 07:50:38 +0000</pubDate>
        <lastBuildDate>Sat, 02 May 2026 07:50:38 +0000</lastBuildDate>
        <generator>Jekyll v3.10.0</generator>
        
            <item>
                <title>Why I Am a Luddite</title>
                <description>&lt;p&gt;The story most people know goes like this: in early 19th century England, textile workers who feared the future smashed the machines that threatened their jobs. Ignorance versus progress. The Luddites lost, technology won, end of story.&lt;/p&gt;

&lt;p&gt;Brian Merchant, who spent years researching the movement, found something messier and more useful. The Luddites were not anti-technology, they were anti-poverty. They were skilled workers who understood the machines intimately and objected not to machinery itself, but to the conditions of its deployment. Their phrase for what they opposed was machinery hurtful to commonality.&lt;/p&gt;

&lt;p&gt;They were not asking whether the machines worked. They were asking who they worked for.&lt;/p&gt;

&lt;p&gt;That reframing matters for AI in 2026.&lt;/p&gt;

&lt;p&gt;I use AI every day. It is genuinely useful, occasionally magical, occasionally strange, and worth taking seriously. I am not arguing that it does not matter. What I have become more skeptical of is the claim that the current shape of AI adoption is inevitable, and that asking questions about it is just resistance.&lt;/p&gt;

&lt;p&gt;That framing usually sounds like this: the future is already decided, the only question is whether you are keeping up.&lt;/p&gt;

&lt;p&gt;Who decided this deployment model? What alternatives were considered? Who absorbs the cost when it fails? These are not anti-technology questions. They are basic governance questions.&lt;/p&gt;

&lt;p&gt;I do not know the future of AI, and neither does anyone else, whatever confidence they project. What has helped is having a framework for the present: evaluate specific deployments, not the technology in the abstract.&lt;/p&gt;

&lt;p&gt;The Luddite question is that framework: not is this impressive, but who does this serve, and on what terms?&lt;/p&gt;

&lt;h2 id=&quot;ai-does-not-fix-systems-it-accelerates-them&quot;&gt;AI Does Not Fix Systems, It Accelerates Them&lt;/h2&gt;

&lt;p&gt;One pattern keeps repeating: AI does not transform how an organization works. It accelerates whatever is already there.&lt;/p&gt;

&lt;p&gt;A team with broken processes, heavy stage gates, stale documentation, and meaningless metrics introduces AI. Now it has governance bots enforcing the same gates, AI-generated versions of documents nobody reads, and automated summaries of reports that were already being ignored.&lt;/p&gt;

&lt;p&gt;Everything gets faster. Nothing gets better.&lt;/p&gt;

&lt;p&gt;I have seen this play out in data work repeatedly: dashboards nobody uses produced at higher volume, strategy documents polished into confident illegibility, status updates that are longer and cleaner but carry less signal than what they replaced.&lt;/p&gt;

&lt;p&gt;Velocity increases and noise increases with it, for the same reason: the tool was applied to a broken model rather than to the question of whether the model was worth keeping.&lt;/p&gt;

&lt;p&gt;The roughest implementations are often not caused by careless people. They happen because pressure to be seen adopting AI outpaces the harder work of deciding what adoption should mean.&lt;/p&gt;

&lt;p&gt;The label changes. The thinking does not.&lt;/p&gt;

&lt;h2 id=&quot;the-ikea-effect-and-prompting&quot;&gt;The IKEA Effect And Prompting&lt;/h2&gt;

&lt;p&gt;Dan Ariely and colleagues studied what they called the IKEA effect: people overvalue things they help construct. In one of the better-known experiments, participants made origami cranes and then stated how much they would pay to keep them. Neutral observers were asked to price the same cranes.&lt;/p&gt;

&lt;p&gt;The builders were willing to pay roughly five times more.&lt;/p&gt;

&lt;p&gt;The mechanism is effort, not quality. Putting work into something changes how we value it, independent of what it objectively is.&lt;/p&gt;

&lt;p&gt;That is interesting in an AI context.&lt;/p&gt;

&lt;p&gt;When you prompt something into existence, you frame the question, iterate outputs, and shape the final result. That is real effort, even if it differs from producing every word or line yourself. If the threshold for ownership feelings is low, then AI-assisted output may feel more trustworthy than it deserves simply because we invested effort in producing it.&lt;/p&gt;

&lt;div id=&quot;ikea-effect-widget&quot; style=&quot;max-width: 620px; margin: 2rem auto; padding: 1.1rem; border: 2px solid #d8ccb8; border-radius: 10px; background: radial-gradient(80% 130% at 20% 10%, #fff8eb 0%, transparent 60%), linear-gradient(140deg, #f7f4ee 0%, #e8dfd0 100%); box-shadow: 0 10px 30px rgba(40,30,20,0.08); font-family: &apos;Courier New&apos;, Courier, monospace; color: #2a251e;&quot;&gt;
  &lt;div style=&quot;display: flex; justify-content: space-between; align-items: center; gap: 0.75rem; margin-bottom: 0.9rem; flex-wrap: wrap;&quot;&gt;
    &lt;div style=&quot;font-size: 0.88rem; text-transform: uppercase; letter-spacing: 0.08em;&quot;&gt;IKEA Effect Motif&lt;/div&gt;
    &lt;div style=&quot;display: inline-flex; border: 2px solid #d8ccb8; border-radius: 8px; overflow: hidden; background: #f5ecde;&quot;&gt;
      &lt;button id=&quot;mode-observer&quot; style=&quot;border: 0; padding: 0.35rem 0.65rem; background: #fff; color: #2a251e; font-family: inherit; font-size: 0.76rem; cursor: pointer;&quot;&gt;observer&lt;/button&gt;
      &lt;button id=&quot;mode-builder&quot; style=&quot;border: 0; padding: 0.35rem 0.65rem; background: transparent; color: #756b5d; font-family: inherit; font-size: 0.76rem; cursor: pointer;&quot;&gt;builder&lt;/button&gt;
    &lt;/div&gt;
  &lt;/div&gt;

  &lt;div style=&quot;display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; align-items: center;&quot; id=&quot;ikea-scene&quot;&gt;
    &lt;div style=&quot;position: relative; min-height: 180px; border: 2px solid #d8ccb8; border-radius: 8px; background: linear-gradient(#ede3d2 1px, transparent 1px), linear-gradient(90deg, #ede3d2 1px, transparent 1px), #f9f5ee; background-size: 20px 20px; display: grid; place-items: center; overflow: hidden;&quot;&gt;
      &lt;div id=&quot;crane&quot; style=&quot;--crane-color: #b7b0a6; --wing-color: #cec7be; width: 128px; height: 128px; position: relative; transform: translateY(4px); image-rendering: pixelated; animation: crane-bob 5s ease-in-out infinite;&quot;&gt;
        &lt;div id=&quot;wing-left&quot; class=&quot;crane-wing&quot;&gt;&lt;/div&gt;
        &lt;div id=&quot;wing-right&quot; class=&quot;crane-wing crane-wing-right&quot;&gt;&lt;/div&gt;
        &lt;div class=&quot;crane-fold&quot;&gt;&lt;/div&gt;
        &lt;div class=&quot;crane-beak&quot;&gt;&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;position: absolute; bottom: 0.55rem; left: 0.55rem; right: 0.55rem; font-size: 0.72rem; background: rgba(255,255,255,0.9); border: 1px solid #d8ccb8; border-radius: 6px; padding: 0.4rem 0.5rem;&quot;&gt;Effort can increase attachment without increasing quality.&lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;border: 2px solid #d8ccb8; border-radius: 8px; padding: 0.8rem; background: #fff9ef;&quot;&gt;
      &lt;h3 id=&quot;mode-title&quot; style=&quot;margin: 0 0 0.35rem; font-size: 0.9rem;&quot;&gt;Observer view&lt;/h3&gt;
      &lt;p id=&quot;mode-note&quot; style=&quot;margin: 0.25rem 0; color: #756b5d; font-size: 0.8rem; line-height: 1.45;&quot;&gt;Looks like output. Low attachment.&lt;/p&gt;
      &lt;p id=&quot;mode-score&quot; style=&quot;margin-top: 0.6rem; display: inline-block; font-size: 0.76rem; border: 1px solid #d8ccb8; border-radius: 999px; padding: 0.22rem 0.55rem; color: #2a251e; background: #fff;&quot;&gt;valuation: 1x&lt;/p&gt;
      &lt;p style=&quot;margin: 0.5rem 0 0; font-size: 0.72rem; color: #756b5d;&quot;&gt;&lt;strong style=&quot;color: #2a251e;&quot;&gt;How to read:&lt;/strong&gt; same crane, different valuation signal. The output can look identical while attachment changes with effort.&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;style&gt;
  #crane::before {
    content: &apos;&apos;;
    position: absolute;
    inset: 0;
    background: var(--crane-color);
    clip-path: polygon(50% 10%, 62% 26%, 86% 26%, 68% 40%, 82% 56%, 58% 52%, 50% 66%, 42% 52%, 18% 56%, 32% 40%, 14% 26%, 38% 26%);
    filter: drop-shadow(0 6px 0 rgba(30,20,10,0.08));
  }
  .crane-wing {
    position: absolute;
    width: 42px;
    height: 29px;
    background: var(--wing-color);
    top: 44%;
    left: 18%;
    transform-origin: 100% 50%;
    clip-path: polygon(0 50%, 100% 0, 100% 100%);
    animation: crane-flap-left 3.6s steps(2, end) infinite;
  }
  .crane-wing-right {
    left: auto;
    right: 18%;
    transform-origin: 0 50%;
    clip-path: polygon(0 0, 100% 50%, 0 100%);
    animation: crane-flap-right 3.6s steps(2, end) infinite;
  }
  .crane-fold {
    position: absolute;
    width: 34px;
    height: 22px;
    top: 39%;
    left: 50%;
    transform: translateX(-50%);
    background: rgba(255, 255, 255, 0.16);
    clip-path: polygon(50% 0, 100% 100%, 0 100%);
    pointer-events: none;
  }
  .crane-beak {
    position: absolute;
    width: 18px;
    height: 7px;
    top: 24%;
    right: 14%;
    background: var(--crane-color);
    clip-path: polygon(0 50%, 100% 0, 100% 100%);
    transform: rotate(-5deg);
    pointer-events: none;
  }
  @keyframes crane-bob {
    0%, 100% { transform: translateY(4px); }
    50% { transform: translateY(0); }
  }
  @keyframes crane-flap-left {
    0%, 75%, 100% { transform: rotate(0deg); }
    80%, 90% { transform: rotate(-12deg); }
  }
  @keyframes crane-flap-right {
    0%, 75%, 100% { transform: rotate(0deg); }
    80%, 90% { transform: rotate(12deg); }
  }
  @media (max-width: 680px) {
    #ikea-scene {
      grid-template-columns: 1fr !important;
    }
  }
&lt;/style&gt;

&lt;script&gt;
(function () {
  var copy = {
    observer: {
      title: &apos;Observer view&apos;,
      note: &apos;Looks like output. Low attachment.&apos;,
      score: &apos;valuation: 1x&apos;,
      color: &apos;#b7b0a6&apos;,
      wing: &apos;#cec7be&apos;
    },
    builder: {
      title: &apos;Builder view&apos;,
      note: &apos;Effort creates ownership. Attachment rises.&apos;,
      score: &apos;valuation: 5x&apos;,
      color: &apos;#cf5d2e&apos;,
      wing: &apos;#db8664&apos;
    }
  };

  var mode = &apos;observer&apos;;

  var btnObserver = document.getElementById(&apos;mode-observer&apos;);
  var btnBuilder = document.getElementById(&apos;mode-builder&apos;);
  var title = document.getElementById(&apos;mode-title&apos;);
  var note = document.getElementById(&apos;mode-note&apos;);
  var score = document.getElementById(&apos;mode-score&apos;);
  var crane = document.getElementById(&apos;crane&apos;);
  var wingLeft = document.getElementById(&apos;wing-left&apos;);
  var wingRight = document.getElementById(&apos;wing-right&apos;);

  if (!btnObserver || !btnBuilder || !title || !note || !score || !crane || !wingLeft || !wingRight) return;

  function paint() {
    var c = copy[mode];
    title.textContent = c.title;
    note.textContent = c.note;
    score.textContent = c.score;
    crane.style.setProperty(&apos;--crane-color&apos;, c.color);
    crane.style.setProperty(&apos;--wing-color&apos;, c.wing);
    wingLeft.style.background = c.wing;
    wingRight.style.background = c.wing;

    btnObserver.style.background = mode === &apos;observer&apos; ? &apos;#fff&apos; : &apos;transparent&apos;;
    btnObserver.style.color = mode === &apos;observer&apos; ? &apos;#2a251e&apos; : &apos;#756b5d&apos;;
    btnBuilder.style.background = mode === &apos;builder&apos; ? &apos;#fff&apos; : &apos;transparent&apos;;
    btnBuilder.style.color = mode === &apos;builder&apos; ? &apos;#2a251e&apos; : &apos;#756b5d&apos;;
  }

  btnObserver.addEventListener(&apos;click&apos;, function () {
    mode = &apos;observer&apos;;
    paint();
  });

  btnBuilder.addEventListener(&apos;click&apos;, function () {
    mode = &apos;builder&apos;;
    paint();
  });

  paint();
})();
&lt;/script&gt;

&lt;p&gt;I do not think this means AI-assisted work is inherently bad. It means our confidence in that work may be less objective than we assume.&lt;/p&gt;

&lt;p&gt;So I use a simple rule: if I have put significant effort into shaping AI-assisted output that I am about to act on, I get someone who was not in the room to review it.&lt;/p&gt;

&lt;p&gt;Not because AI must be wrong. Because I may be the least reliable judge of something that feels like I made it.&lt;/p&gt;

&lt;p&gt;At team scale, this matters even more. When everyone has invested effort in an AI-assisted deliverable, you can end up with a room full of ownership bias.&lt;/p&gt;

&lt;h2 id=&quot;skills-are-shifting-not-disappearing&quot;&gt;Skills Are Shifting, Not Disappearing&lt;/h2&gt;

&lt;p&gt;This is not a case for denial. Skills always depreciate and compound over time.&lt;/p&gt;

&lt;p&gt;The skills that are depreciating fastest are the ones AI can increasingly perform: fast SQL drafting, standard dashboard production, report formatting, basic summarization.&lt;/p&gt;

&lt;p&gt;The skills that compound are judgment-intensive: knowing what to ask, reading context, seeing when an analysis is technically correct but organizationally wrong, and helping stakeholders clarify what they actually need.&lt;/p&gt;

&lt;p&gt;As the cost of producing mediocre output approaches zero, judgment becomes more scarce and more valuable.&lt;/p&gt;

&lt;p&gt;If anyone can generate a passable dashboard in five minutes, the value shifts to deciding which dashboards should exist at all.&lt;/p&gt;

&lt;p&gt;Whether workers are compensated for that shift, or whether gains are extracted elsewhere, is the Luddite question applied to a career.&lt;/p&gt;

&lt;h2 id=&quot;a-useful-question-even-without-guarantees&quot;&gt;A Useful Question, Even Without Guarantees&lt;/h2&gt;

&lt;p&gt;The Luddites lost in the most literal sense. The state crushed the movement and the machines continued.&lt;/p&gt;

&lt;p&gt;That is not an argument against asking the question. It is a reminder that asking it does not guarantee the answer you want.&lt;/p&gt;

&lt;p&gt;And yet, versions of this question have won before: collective bargaining over automation, labor protections, organizations that deploy AI to augment skilled work instead of deskilling it.&lt;/p&gt;

&lt;p&gt;The framework does not promise a good outcome. It gives you a way to see clearly enough to push for one.&lt;/p&gt;

&lt;p&gt;Sometimes the answer is good. AI can remove tedious work and free people for higher-judgment work they find meaningful.&lt;/p&gt;

&lt;p&gt;But on the surface, that can look very similar to AI deployed to deskill, justify layoffs, and extract more from fewer people under the language of progress.&lt;/p&gt;

&lt;p&gt;The Luddite frame helps you distinguish between those two paths.&lt;/p&gt;

&lt;p&gt;Not refuse the technology. Ask what it is for, who decided, and who benefits.&lt;/p&gt;

&lt;p&gt;Nobody else is going to ask that on your behalf.&lt;/p&gt;

&lt;p&gt;You should be a Luddite too.&lt;/p&gt;
</description>
                <pubDate>Fri, 01 May 2026 13:00:00 +0000</pubDate>
                <link>/blog/why-i-am-a-luddite</link>
                <guid isPermaLink="true">/blog/why-i-am-a-luddite</guid>
                
                <category>ai</category>
                
                <category>work</category>
                
                <category>technology</category>
                
                <category>organisations</category>
                
                
            </item>
        
            <item>
                <title>Why Estimates Always Lie — And What to Do About It</title>
                <description>&lt;p&gt;Ask any developer how long a project will take, and then ask again once it’s done. The numbers will rarely match. This isn’t an occasional failure — it’s one of the most consistent and documented patterns in software development.&lt;/p&gt;

&lt;p&gt;And yet, most of us just keep doing it the same way. We stare at a Jira board, assign story points, add a 20% buffer, and hand something over with a confidence we fundamentally do not have. Then we spend months quietly explaining why things are taking longer than expected.&lt;/p&gt;

&lt;p&gt;I wanted to understand &lt;em&gt;why&lt;/em&gt; estimation fails so reliably — and build something that makes it harder to lie to yourself.&lt;/p&gt;

&lt;h2 id=&quot;the-problem-isnt-laziness&quot;&gt;The Problem Isn’t Laziness&lt;/h2&gt;

&lt;p&gt;The tempting narrative is that developers are just bad at scoping things. They’re naive optimists who forget about edge cases and tech debt. Fix the person, fix the problem.&lt;/p&gt;

&lt;p&gt;But that framing misses what’s actually going on.&lt;/p&gt;

&lt;p&gt;When you estimate a project, you almost always estimate &lt;em&gt;the work&lt;/em&gt; — the actual build. The feature, the screen, the API endpoint. The thing you can see and reason about. The thing that ends up in your Jira ticket.&lt;/p&gt;

&lt;p&gt;The problem is that the work is never just the work.&lt;/p&gt;

&lt;p&gt;Every project comes wrapped in a thick layer of invisible effort that we don’t put in the Jira ticket because it isn’t &lt;em&gt;the thing&lt;/em&gt; we’re building. It’s the meetings, the config, the debugging sessions, the backslide on a dependency upgrade, the scope conversations, the infrastructure that breaks on a Friday afternoon.&lt;/p&gt;

&lt;p&gt;We’re not bad at estimating &lt;em&gt;the work&lt;/em&gt;. We’re consistently ignoring everything around it.&lt;/p&gt;

&lt;h2 id=&quot;dave-stewarts-taxonomy-of-invisible-work&quot;&gt;Dave Stewart’s Taxonomy of Invisible Work&lt;/h2&gt;

&lt;p&gt;A few years ago, Dave Stewart published a &lt;a href=&quot;https://davestewart.co.uk/blog/work/project-estimation/&quot;&gt;fantastic deep dive&lt;/a&gt; on why projects always take longer — the result of a brutal postmortem on a project that ran far, far over. He also published an &lt;a href=&quot;https://gist.github.com/davestewart/643ffc55aa7c173618d2707b776a1443&quot;&gt;accompanying gist&lt;/a&gt; that catalogues, in painstaking detail, all the things you don’t think about when you quote for a project.&lt;/p&gt;

&lt;p&gt;Reading it is one of those experiences where you nod continuously while quietly reflecting on every project you’ve ever been part of.&lt;/p&gt;

&lt;p&gt;His key insight is that project work can be broken into distinct categories, and only one of them is what we actually estimate:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Category&lt;/th&gt;
      &lt;th&gt;What it means&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;The work around the work&lt;/td&gt;
      &lt;td&gt;Meetings, reviews, project management&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work to get the work&lt;/td&gt;
      &lt;td&gt;Research, scoping, quoting, pitching&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work before the work&lt;/td&gt;
      &lt;td&gt;Setup, config, infrastructure, services&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;The work&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;The actual build, product, design, docs, tests&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work between the work&lt;/td&gt;
      &lt;td&gt;Debugging, refactoring, iteration, tooling&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work beyond the work&lt;/td&gt;
      &lt;td&gt;Scope creep, omissions, nice-to-haves&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work outside the work&lt;/td&gt;
      &lt;td&gt;Surprises, contingency, unknown unknowns&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work after the work&lt;/td&gt;
      &lt;td&gt;Hosting, deployment, security, ongoing support&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Looking at this list, the &lt;em&gt;actual work&lt;/em&gt; — the thing that goes in the estimate — is one entry out of eight. And Dave’s rough analysis suggests execution might represent as little as &lt;strong&gt;20% of total project effort&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That number feels extreme until you think about the last project you shipped. How much time was spent in stakeholder meetings? How long did the initial environment setup take? How many days got consumed by a third-party API that didn’t behave as documented? How many afternoons were eaten by “quick questions” that turned into scope renegotiations?&lt;/p&gt;

&lt;p&gt;Add it all up honestly and 20% starts to seem plausible. Maybe even generous.&lt;/p&gt;

&lt;h2 id=&quot;why-we-keep-getting-it-wrong&quot;&gt;Why We Keep Getting It Wrong&lt;/h2&gt;

&lt;p&gt;There are a few cognitive traps that make this so persistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The planning fallacy&lt;/strong&gt; — our tendency to anchor on best-case scenarios and discount known risks — is well-documented. We don’t just forget the adjacent work; we actively don’t want to include it because doing so makes the estimate “more expensive” and harder to sell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invisible work is invisible.&lt;/strong&gt; If it doesn’t have a ticket, it doesn’t exist in the estimate. But it still exists in the calendar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We estimate outcomes, not processes.&lt;/strong&gt; “Build a search feature” gets an estimate. “Spend two days understanding why Elasticsearch index updates are inconsistent across environments” doesn’t. But the second thing is what actually happens.&lt;/p&gt;

&lt;p&gt;The practical result is that estimates consistently represent a best-case path through the &lt;em&gt;visible&lt;/em&gt; work, while everything else accumulates silently.&lt;/p&gt;

&lt;h2 id=&quot;building-a-better-tool&quot;&gt;Building a Better Tool&lt;/h2&gt;

&lt;p&gt;I built &lt;a href=&quot;https://mitchelllisle.github.io/true-estimate/&quot;&gt;true-estimate&lt;/a&gt; to make this hidden work visible — and to make it slightly harder to accidentally produce a naive estimate.&lt;/p&gt;

&lt;p&gt;The tool is directly inspired by Dave Stewart’s framework. Rather than a flat list of tasks, it organises your estimate into the eight phases above. You can add tasks under each phase with optional week estimates. As you fill it in, you get three numbers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Estimated&lt;/strong&gt; — only the execution work, the thing you’d normally quote&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hidden&lt;/strong&gt; — everything outside the execution phase&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Total&lt;/strong&gt; — what it actually costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal isn’t to produce a precise forecast, because that’s largely impossible. It’s to force the question: &lt;em&gt;what am I not accounting for?&lt;/em&gt; The admin load, the setup time, the inevitable bugs and scope conversations — they’re going to happen regardless of whether you estimate them. The only question is whether you’re planning for them or absorbing them silently.&lt;/p&gt;

&lt;p&gt;There’s also a sample project you can load to see what a realistic breakdown might look like. The hidden work being consistently larger than the estimated work is, in my experience, not a bug in the sample — it’s about right.&lt;/p&gt;

&lt;h2 id=&quot;an-honest-estimate-isnt-a-pessimistic-one&quot;&gt;An Honest Estimate Isn’t a Pessimistic One&lt;/h2&gt;

&lt;p&gt;There’s sometimes a reluctance to estimate comprehensively because it feels like pessimism or padding. If you include two weeks for “general iteration and debugging,” it looks like you’re hedging. Shouldn’t a good developer be more efficient than that?&lt;/p&gt;

&lt;p&gt;But this is exactly backwards. An honest estimate is a professional one. It signals that you understand how software projects actually work — that there is always invisible work, always iteration, always surprises. Hiding that work doesn’t make it go away. It just means someone absorbs it unexpectedly, whether that’s you, the project timeline, or the client.&lt;/p&gt;

&lt;p&gt;The developers and teams who build trust over time are the ones whose estimates are reliable — not necessarily short.&lt;/p&gt;

&lt;h2 id=&quot;try-it&quot;&gt;Try It&lt;/h2&gt;

&lt;p&gt;If you’ve got a project in front of you — a new feature, a refactor, a greenfield build — give &lt;a href=&quot;https://mitchelllisle.github.io/true-estimate/&quot;&gt;true-estimate&lt;/a&gt; a try before you submit that Jira estimate. Work through each phase and be honest about what you’re probably going to spend time on. Then compare your execution estimate to the total.&lt;/p&gt;

&lt;p&gt;The gap between those two numbers is the amount of work you were planning to do for free.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;The true-estimate tool is open source — code is &lt;a href=&quot;https://github.com/mitchelllisle/true-estimate&quot;&gt;on GitHub&lt;/a&gt;. Dave Stewart’s original article, which inspired the structure, is &lt;a href=&quot;https://davestewart.co.uk/blog/work/project-estimation/&quot;&gt;well worth reading in full&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
                <pubDate>Fri, 20 Mar 2026 13:00:00 +0000</pubDate>
                <link>/blog/why-estimates-always-lie</link>
                <guid isPermaLink="true">/blog/why-estimates-always-lie</guid>
                
                <category>software</category>
                
                <category>estimation</category>
                
                <category>productivity</category>
                
                <category>tools</category>
                
                
            </item>
        
            <item>
                <title>Mapping Fire: Five Decades of Bushfires in NSW</title>
                <description>&lt;p&gt;Australia and fire are inseparable. For millennia, bushfires have shaped our landscapes, ecology, and communities. But as our climate changes and populations grow along bushland fringes, understanding fire patterns has never been more important.&lt;/p&gt;

&lt;p&gt;I’ve built an &lt;a href=&quot;https://mitchelllisle.github.io/fires-nsw-dashboard/&quot;&gt;interactive dashboard&lt;/a&gt; that explores over 50 years of fire history in New South Wales—from 1970 to 2024. Using data from NSW’s Department of Planning, Industry and Environment, it tells the story of where, when, and how fires have burned across the state.&lt;/p&gt;

&lt;h2 id=&quot;what-the-data-reveals&quot;&gt;What the Data Reveals&lt;/h2&gt;

&lt;p&gt;Since 1970, NSW has recorded &lt;strong&gt;18,814 fire events&lt;/strong&gt;, burning more than &lt;strong&gt;15 million hectares&lt;/strong&gt;—roughly 2% of Australia’s entire landmass. These fires fall into two categories: wildfires (11,503 events) which have burnt 14 million hectares, and prescribed burns (7,311 events) used for hazard reduction, clearing 1.8 million hectares.&lt;/p&gt;

&lt;p&gt;The numbers alone don’t capture the human cost. The dashboard documents the deadliest fires, including the Badja Forest Road fire that claimed six lives during Black Summer, and the Green Wattle Creek fire that killed two volunteer firefighters when a tree struck their tanker.&lt;/p&gt;

&lt;h2 id=&quot;patterns-in-time-and-space&quot;&gt;Patterns in Time and Space&lt;/h2&gt;

&lt;p&gt;The visualisations reveal several clear patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Geographic clustering&lt;/strong&gt; shows fires concentrate heavily along coastal ranges where eucalypt forests meet urban development. The Blue Mountains and Central Coast are among the most fire-prone areas, with some locations experiencing dozens of fire events over the period.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seasonal cycles&lt;/strong&gt; are stark—summer and early autumn (December to March) dominate fire activity. But the 2019-2020 season broke patterns with unprecedented late-spring fires, signalling how changing conditions are shifting traditional fire seasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drought years stand out&lt;/strong&gt;. Wildfire frequency spikes dramatically during major droughts, particularly 2001-2002 and 2019-2020. Meanwhile, prescribed burns maintain a relatively steady baseline as fire services work to reduce fuel loads.&lt;/p&gt;

&lt;h2 id=&quot;the-black-summer-context&quot;&gt;The Black Summer Context&lt;/h2&gt;

&lt;p&gt;The 2015-2020 period saw the most area burnt in any five-year window, driven entirely by the catastrophic Black Summer fires of 2019-2020. That season alone burnt over &lt;strong&gt;5 million hectares&lt;/strong&gt;—dwarfing every previous year on record.&lt;/p&gt;

&lt;p&gt;Three fires during Black Summer deserve particular attention. The Gospers Mountain fire, started by a single lightning strike, ultimately burned 512,626 hectares after merging with five other fires into a megablaze exceeding one million hectares. The Currowan fire earned the name “The Forever Fire” for its 74-day duration. The Badja Forest Road fire travelled 40 kilometres in hours under catastrophic conditions, destroying 418 homes around Cobargo on New Year’s Eve.&lt;/p&gt;

&lt;p&gt;Only 87 fires since 1970 have exceeded 50,000 hectares. Nearly all sparked from lightning strikes in remote bushland during extreme drought conditions. The dashboard shows how these mega-fires cluster in summer months when temperatures peak and fuel is driest.&lt;/p&gt;

&lt;h2 id=&quot;why-this-matters&quot;&gt;Why This Matters&lt;/h2&gt;

&lt;p&gt;This isn’t just historical data—it’s a window into our future. Fire seasons now start earlier, last longer, and burn with unprecedented intensity. Understanding these patterns helps us prepare.&lt;/p&gt;

&lt;p&gt;The dashboard shows how fires behave under different conditions, where they’re most likely to occur, and which periods have been most destructive. For anyone living in NSW or interested in fire management, these patterns matter.&lt;/p&gt;

&lt;p&gt;It’s also worth noting what this data doesn’t capture. Historical records, especially pre-1990s, vary in accuracy. Fire boundaries are approximations. Some casualties may be unrecorded. The true human toll of these fires extends far beyond the numbers—displaced communities, destroyed homes, psychological trauma, and ecosystems fundamentally altered.&lt;/p&gt;

&lt;h2 id=&quot;building-the-dashboard&quot;&gt;Building the Dashboard&lt;/h2&gt;

&lt;p&gt;I built this using Observable Framework with data from NSW’s Department of Planning, Industry and Environment. The dataset includes every recorded fire since 1970, with details on location, size, type, and timing. I’ve supplemented this with research from official inquiries and historical records to document casualties and home losses for the largest fires.&lt;/p&gt;

&lt;p&gt;The goal was to make complex fire data accessible and interactive. You can explore specific years, compare wildfire versus prescribed burn patterns, see seasonal variations, and understand which areas face the highest risk.&lt;/p&gt;

&lt;h2 id=&quot;looking-ahead&quot;&gt;Looking Ahead&lt;/h2&gt;

&lt;p&gt;Fire is part of Australia’s identity. Aboriginal Australians used fire as a land management tool for over 60,000 years. But the scale and intensity of modern fires—driven by climate change, fuel accumulation, and expanding urban-bushland interfaces—presents challenges we’re still learning to navigate.&lt;/p&gt;

&lt;p&gt;This dashboard doesn’t offer solutions, but it does offer context. By seeing how fires have behaved over five decades, we can better understand what we’re facing and where we need to focus our efforts in fire management, hazard reduction, and community preparedness.&lt;/p&gt;

&lt;p&gt;Explore the dashboard at &lt;a href=&quot;https://mitchelllisle.github.io/fires-nsw-dashboard/&quot;&gt;mitchelllisle.github.io/fires-nsw-dashboard&lt;/a&gt; and see what patterns emerge from half a century of fire history.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;Explore the dashboard: &lt;a href=&quot;https://mitchelllisle.github.io/fires-nsw-dashboard/&quot;&gt;History of Bushfires in NSW&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data source: &lt;a href=&quot;https://datasets.seed.nsw.gov.au/dataset/fire-history-wildfires-and-prescribed-burns-1e8b6&quot;&gt;NSW DPIE Fire History Dataset&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
</description>
                <pubDate>Sat, 03 Jan 2026 13:00:00 +0000</pubDate>
                <link>/blog/nsw-fires-dashboard</link>
                <guid isPermaLink="true">/blog/nsw-fires-dashboard</guid>
                
                <category>data</category>
                
                <category>visualisation</category>
                
                <category>australia</category>
                
                <category>climate</category>
                
                
            </item>
        
            <item>
                <title>Too Unique to Hide: Understanding Re-identification Risk in Australia</title>
                <description>&lt;p&gt;We’ve all been told that our data is “de-identified” or “anonymised.” Healthcare providers, government agencies, and companies assure us that after removing names and addresses, our information is safe. But how safe is it really?&lt;/p&gt;

&lt;p&gt;This question led me to create &lt;a href=&quot;https://mitchelllisle.github.io/too-unique-to-hide-aus/&quot;&gt;Too Unique to Hide&lt;/a&gt;, an interactive calculator that shows Australians how identifiable they might be from supposedly anonymous datasets.&lt;/p&gt;

&lt;h2 id=&quot;how-unique-are-you&quot;&gt;How Unique Are You?&lt;/h2&gt;

&lt;p&gt;Even without your name or address, a few basic demographic facts can be quite distinctive. A combination of your postcode, age group, gender, and occupation might sound generic—but together, they can create a unique profile.&lt;/p&gt;

&lt;p&gt;The calculator uses real Australian Bureau of Statistics (ABS) census data to show this. Enter your details, and it shows how many people in Australia share that same demographic profile. The results can be surprising.&lt;/p&gt;

&lt;h2 id=&quot;understanding-the-numbers&quot;&gt;Understanding the Numbers&lt;/h2&gt;

&lt;p&gt;When fewer people share your characteristics, linking different datasets becomes easier. For example, if an organisation releases “anonymous” health data with postcode, age, and gender, it’s possible that cross-referencing with other datasets could reveal identities—especially in smaller population groups.&lt;/p&gt;

&lt;p&gt;The calculator shows four risk categories based on how many people match your profile, from very high risk (fewer than 10 matches) to lower risk (1,000+ matches). These estimates help you understand your potential visibility in anonymised datasets.&lt;/p&gt;

&lt;h2 id=&quot;real-world-examples&quot;&gt;Real-World Examples&lt;/h2&gt;

&lt;p&gt;Re-identification isn’t just theoretical. In 2016, the Australian Department of Health released “de-identified” Medicare data, but researchers showed it was possible to re-identify individuals, leading to the dataset being withdrawn. Similar issues arose with Netflix viewing data and location tracking from apps.&lt;/p&gt;

&lt;p&gt;Most often, this isn’t about bad actors—it’s organisations not fully appreciating how unique demographic combinations can be when sharing data for legitimate research or policy purposes.&lt;/p&gt;

&lt;h2 id=&quot;the-combination-effect&quot;&gt;The Combination Effect&lt;/h2&gt;

&lt;p&gt;Each demographic factor on its own is common. Millions share your age group or postcode. But combine them with gender and occupation, and you’re often in a much smaller group. The calculator visualises this, showing how rare you are for each attribute individually and combined.&lt;/p&gt;

&lt;h2 id=&quot;what-you-can-do&quot;&gt;What You Can Do&lt;/h2&gt;

&lt;p&gt;Understanding your profile is a useful first step:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Be mindful with surveys&lt;/strong&gt; that collect detailed demographics along with postcodes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Think about combinations&lt;/strong&gt; when sharing information across multiple platforms&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Ask questions&lt;/strong&gt; when organisations claim data is anonymous—what demographics remain?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Support privacy protections&lt;/strong&gt; that go beyond simple de-identification&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;building-the-tool&quot;&gt;Building the Tool&lt;/h2&gt;

&lt;p&gt;I built this using Observable Framework and real ABS census data, inspired by research from Imperial College London. All calculations happen in your browser—nothing you enter is collected or transmitted.&lt;/p&gt;

&lt;p&gt;The goal is education, not alarm. Many Australians don’t realise how distinctive basic demographics can be. This tool makes that concept tangible.&lt;/p&gt;

&lt;h2 id=&quot;looking-forward&quot;&gt;Looking Forward&lt;/h2&gt;

&lt;p&gt;Data sharing for research and policy is valuable, and we shouldn’t stop it. But we do need better approaches. This includes being realistic about de-identification limits, using stronger privacy techniques like differential privacy, and being thoughtful about what demographic detail gets shared.&lt;/p&gt;

&lt;p&gt;Try the calculator at &lt;a href=&quot;https://mitchelllisle.github.io/too-unique-to-hide-aus/&quot;&gt;Too Unique to Hide&lt;/a&gt; and see where you stand. Whether you’re one in thousands or more unique, understanding your demographic fingerprint is worth knowing in our data-driven world.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;Try the calculator: &lt;a href=&quot;https://mitchelllisle.github.io/too-unique-to-hide-aus/&quot;&gt;Too Unique to Hide - Australian Edition&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Learn more: &lt;a href=&quot;https://www.oaic.gov.au/&quot;&gt;Office of the Australian Information Commissioner&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
</description>
                <pubDate>Wed, 24 Dec 2025 10:00:00 +0000</pubDate>
                <link>/blog/too-unique-to-hide</link>
                <guid isPermaLink="true">/blog/too-unique-to-hide</guid>
                
                <category>privacy</category>
                
                <category>data</category>
                
                <category>security</category>
                
                <category>australia</category>
                
                
            </item>
        
            <item>
                <title>One schema library to rule them all</title>
                <description>&lt;h2 id=&quot;generating-and-testing-pyspark-dataframes-with-sparkdantic&quot;&gt;Generating and Testing PySpark DataFrames with Sparkdantic&lt;/h2&gt;

&lt;h3 id=&quot;1-introduction-of-the-problem-sparkdantic-solves&quot;&gt;1. Introduction of the Problem Sparkdantic Solves&lt;/h3&gt;

&lt;p&gt;In the world of Big Data, PySpark has become a go-to framework for processing large datasets. However, as with any 
framework, there are challenges. One of the most cumbersome challenges is defining schemas for DataFrames and generating 
realistic test data. PySpark often does a good job of inferring schemas, but in some cases you need to define a schema
to ensure your data arrives in the most correct state.&lt;/p&gt;

&lt;p&gt;Pydantic is another library that is hugely popular and provides so many excellent capabilities when it comes to validating
your data. Up until now, there hasn’t been an easy way tp use both.&lt;/p&gt;

&lt;p&gt;Traditionally, developers would manually define schemas and write custom code to generate test data. This process is not
only tedious but also error-prone. While PySpark provides a way to define schemas, it doesn’t take advantage of Pythons
in-built data types which mean you can have to define your schema in the way PySpark wants you to.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;What if there was a more streamlined way to handle schemas, interoperability between Python and Spark and an easy way
to generate fake / test data?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Enter Sparkdantic, which offers a seamless integration between Pydantic models and PySpark DataFrames. With Sparkdantic,
you can define DataFrame schemas using Pydantic models and generate realistic test data based on custom specifications.&lt;/p&gt;

&lt;p&gt;To read more about Sparkdantic and install it, see my GitHub profile &lt;a href=&quot;https://github.com/mitchelllisle/sparkdantic&quot;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip install sparkdantic&lt;/code&gt;&lt;/p&gt;

&lt;h3 id=&quot;2-creating-schemas-and-how-sparkdantic-makes-it-easy&quot;&gt;2. Creating Schemas and How Sparkdantic Makes It Easy&lt;/h3&gt;

&lt;p&gt;With PySpark, defining a schema usually involves creating a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StructType&lt;/code&gt; object with a list of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StructField&lt;/code&gt; objects. 
While this method is powerful, it can become verbose and hard to manage for complex schemas. You also can’t use this 
schema outside of PySpark.&lt;/p&gt;

&lt;p&gt;Using Sparkdantic, you can leverage Pydantic models to define your DataFrame schema. Pydantic models are Python classes 
that define data shapes and validation. They are concise, readable, and offer powerful validation capabilities. A basic
Pydantic model may look like this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pydantic&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BaseModel&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;User&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BaseModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SparkModel&lt;/code&gt; class from Sparkdantic, you can easily convert this Pydantic model into a PySpark schema which
gives you the ability to generate a PySpark valid schema with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;model_spark_schema&lt;/code&gt; method:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sparkdantic&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkModel&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;UserSparkSchema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SparkModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UserSparkSchema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_spark_schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will output a PySpark &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StructType&lt;/code&gt; schema, ready to be used in your DataFrames.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pyspark.sql.types&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerType&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;name&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
    &lt;span class=&quot;n&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;age&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
    &lt;span class=&quot;n&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;email&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-generating-realistic-fake-data-for-unit-tests--populating-a-development-database&quot;&gt;3. Generating Realistic Fake Data for Unit Tests / Populating a Development Database&lt;/h3&gt;

&lt;p&gt;Once you have your schema, the next challenge is populating it with realistic data. Sparkdantic provides the 
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ColumnGenerationSpec&lt;/code&gt; class, which lets you define specifications for generating data for each column.&lt;/p&gt;

&lt;p&gt;For instance, if you want the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;age&lt;/code&gt; column to have random values between 20 and 50:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sparkdantic.generation&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ColumnGenerationSpec&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;age_spec&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ColumnGenerationSpec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;min_value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You may also want a list of names to use for the name column. For this, we can leverage other libraries such as the well
known &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;faker&lt;/code&gt; library:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;faker&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Faker&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;faker&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Faker&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;names&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;faker&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;name_spec&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ColumnGenerationSpec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;generate_data&lt;/code&gt; method of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SparkModel&lt;/code&gt; in Sparkdantic, you can then generate a DataFrame with the desired
number of rows:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;builder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;appName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;demo&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getOrCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UserSparkSchema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generate_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_rows&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;specs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age_spec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name_spec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will produce a DataFrame with 1000 rows, with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;age&lt;/code&gt; column populated with random values between 20 and 50 and
a randomly chosen name from a list of 1000 fake names generated by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;faker&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;4-conclusion&quot;&gt;4. Conclusion&lt;/h3&gt;

&lt;p&gt;Defining PySpark DataFrame schemas and generating test data doesn’t have to be a cumbersome process. With the 
integration of Pydantic models and Sparkdantic, you can streamline these tasks, making your development process more 
efficient and error-free.&lt;/p&gt;

&lt;p&gt;Whether you’re a data engineer writing unit tests, a data scientist experimenting with data, or a developer populating a
development database, Sparkdantic offers a powerful toolset to make your life easier. Give it a try and elevate your 
PySpark game!&lt;/p&gt;
</description>
                <pubDate>Sat, 30 Sep 2023 12:01:35 +0000</pubDate>
                <link>/blog/pyspark-pydantic-schema-library</link>
                <guid isPermaLink="true">/blog/pyspark-pydantic-schema-library</guid>
                
                <category>pydantic</category>
                
                <category>python</category>
                
                <category>pyspark</category>
                
                <category>schema</category>
                
                
            </item>
        
            <item>
                <title>A friendly encryption CLI tool</title>
                <description>&lt;h2 id=&quot;-monstermash-a-simple-cli-tool-for-data-encryption&quot;&gt;🧟 Monstermash: A Simple CLI Tool for Data Encryption&lt;/h2&gt;

&lt;h3 id=&quot;introduction&quot;&gt;Introduction&lt;/h3&gt;

&lt;p&gt;In today’s digital landscape, data privacy is a growing concern. While there are many tools available for data encryption,
Monstermash offers a straightforward command-line interface (CLI) solution for those who prefer simplicity. 
Let’s explore its basic functionalities: encrypting and decrypting data.&lt;/p&gt;

&lt;p&gt;To read more about Monstermash and install it, see my GitHub profile &lt;a href=&quot;https://github.com/mitchelllisle/monstermash&quot;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip install monstermash&lt;/code&gt;&lt;/p&gt;

&lt;h3 id=&quot;getting-started-generating-keys&quot;&gt;Getting Started: Generating Keys&lt;/h3&gt;

&lt;p&gt;Before Alice and Bob can exchange encrypted messages, they each need a set of keys. Monstermash provides a basic command
to generate these.&lt;/p&gt;

&lt;p&gt;For Alice:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash generate
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-----------------
Private Key (Alice&apos;s)
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
Public Key (Alice&apos;s)
0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
-----------------
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For Bob:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash generate
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-----------------
Private Key (Bob&apos;s)
abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890
Public Key (Bob&apos;s)
fedcba0987654321fedcba0987654321fedcba0987654321fedcba0987654321
-----------------
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;encrypting-data&quot;&gt;Encrypting Data&lt;/h3&gt;

&lt;p&gt;Suppose Alice wants to send Bob a line from the song “Monster Mash”. She can use her private key and Bob’s public key to
encrypt the message.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash encrypt &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--private-key&lt;/span&gt; a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--public-key&lt;/span&gt; fedcba0987654321fedcba0987654321fedcba0987654321fedcba0987654321 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--data&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;They did the mash, they did the Monster Mash!&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Encrypted Data: 0123abcd4567ef890123abcd4567ef890123abcd4567ef890123abcd4567ef89
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;decrypting-data&quot;&gt;Decrypting Data&lt;/h3&gt;

&lt;p&gt;Upon receiving the encrypted message, Bob can decrypt it using his private key.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash decrypt &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--private-key&lt;/span&gt; abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--data&lt;/span&gt; 0123abcd4567ef890123abcd4567ef890123abcd4567ef890123abcd4567ef89
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Decrypted Data: They did the mash, they did the Monster Mash!
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;Monstermash is a simple CLI tool designed for basic encryption tasks. It doesn’t claim to revolutionize the encryption 
landscape but offers a simple solution for those familiar with the command line. If you’re looking for a no-frills way 
to encrypt and decrypt data, Monstermash might be worth a try.&lt;/p&gt;
</description>
                <pubDate>Fri, 01 Sep 2023 12:01:35 +0000</pubDate>
                <link>/blog/monstermash</link>
                <guid isPermaLink="true">/blog/monstermash</guid>
                
                <category>NaCL</category>
                
                <category>python</category>
                
                <category>encryption</category>
                
                <category>cli</category>
                
                
            </item>
        
            <item>
                <title>Protecting Sensitive Data: Understanding Database Reconstruction Attacks</title>
                <description>&lt;h1 id=&quot;protecting-sensitive-data-understanding-database-reconstruction-attacks&quot;&gt;Protecting Sensitive Data: Understanding Database Reconstruction Attacks&lt;/h1&gt;

&lt;p&gt;There are a number of reasons businesses and governments want to share information about people. One of the most common and useful way data is shared is through a census. A Census is particularly interesting because it contains some extremely personal information about individuals and as a result, it must be carefully protected to ensure any statistical information that is released doesn’t encroach on everyones right to priavacy. In a number of cases, aggregate data does little to hinder hackers from being able to re-create a database that is either very close, or exactly the same as the original data. In this blog post, we will explore a little about how these attacks work with a simple example.&lt;/p&gt;

&lt;p&gt;This blog post and the subsequent code is adapted from a paper on database reconstruction attacks. You can find the paper &lt;a href=&quot;https://queue.acm.org/detail.cfm?id=3295691&quot;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine we work for a company called Acme Data Inc. and that have the following database that contains information for people within a certain geographic area.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;name&lt;/th&gt;
      &lt;th&gt;age&lt;/th&gt;
      &lt;th&gt;married&lt;/th&gt;
      &lt;th&gt;smoker&lt;/th&gt;
      &lt;th&gt;employed&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Sara Gray&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Joseph Collins&lt;/td&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Vincent Porter&lt;/td&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Tiffany Brown&lt;/td&gt;
      &lt;td&gt;30&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Brenda Small&lt;/td&gt;
      &lt;td&gt;36&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Dr. Tina Ayala&lt;/td&gt;
      &lt;td&gt;66&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Rodney Gonzalez&lt;/td&gt;
      &lt;td&gt;84&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: All data here is fake generated data, and likeness to a real person is entirely coincidental.&lt;/p&gt;

&lt;p&gt;We have &lt;em&gt;7&lt;/em&gt; people in total in this block. Alongside &lt;strong&gt;age&lt;/strong&gt;, we also have each resident’s &lt;strong&gt;smoking status&lt;/strong&gt;, &lt;strong&gt;employment status&lt;/strong&gt; and whether they are &lt;strong&gt;married&lt;/strong&gt; or not. From here, we publish a variety of statistics about this block. You have probably seen something similar if you’ve ever done a census.&lt;/p&gt;

&lt;p&gt;📓 To simplify the example, this fictional world has:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Two marriage statuses; Married (&lt;strong&gt;True&lt;/strong&gt;) or Single (&lt;strong&gt;False&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;Two smoking statuses; Non-Smoker (&lt;strong&gt;False&lt;/strong&gt;) or Smoker (&lt;strong&gt;True&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;Two employment statuses;  Unemployed (&lt;strong&gt;False&lt;/strong&gt;) or Employed (&lt;strong&gt;True&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👾 One additional piece of logic we know is that any statistics with a &lt;strong&gt;count of less than 3&lt;/strong&gt; is suppressed. Suppression of statistics with low counts is often used as a tactic for protecting privacy. The less people there are to represent a statistic, the more they often stick out in a dataset meaning their privacy is often more at risk than those who ‘blend in with the crowd’. As we’ll see, simply knowing that a statistic is suppressed can even be used to attack a dataset.&lt;/p&gt;

&lt;p&gt;As a Data Analyst working for Acme Data, we have been tasked with producing the following summary statistics that we can publish on our website for anyone to view. After running our analysis, this is the output that we intend to publish:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;name&lt;/th&gt;
      &lt;th&gt;count&lt;/th&gt;
      &lt;th&gt;median-age&lt;/th&gt;
      &lt;th&gt;mean-age&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;A1&lt;/td&gt;
      &lt;td&gt;total-population&lt;/td&gt;
      &lt;td&gt;7.0&lt;/td&gt;
      &lt;td&gt;30.0&lt;/td&gt;
      &lt;td&gt;38.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;A2&lt;/td&gt;
      &lt;td&gt;non-smoker&lt;/td&gt;
      &lt;td&gt;4.0&lt;/td&gt;
      &lt;td&gt;30.0&lt;/td&gt;
      &lt;td&gt;33.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;B2&lt;/td&gt;
      &lt;td&gt;smoker&lt;/td&gt;
      &lt;td&gt;3.0&lt;/td&gt;
      &lt;td&gt;30.0&lt;/td&gt;
      &lt;td&gt;44.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;C2&lt;/td&gt;
      &lt;td&gt;unemployed&lt;/td&gt;
      &lt;td&gt;4.0&lt;/td&gt;
      &lt;td&gt;51.0&lt;/td&gt;
      &lt;td&gt;48.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;D2&lt;/td&gt;
      &lt;td&gt;employed&lt;/td&gt;
      &lt;td&gt;3.0&lt;/td&gt;
      &lt;td&gt;24.0&lt;/td&gt;
      &lt;td&gt;24.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;A3&lt;/td&gt;
      &lt;td&gt;single-adults&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;B3&lt;/td&gt;
      &lt;td&gt;married-adults&lt;/td&gt;
      &lt;td&gt;4.0&lt;/td&gt;
      &lt;td&gt;51.0&lt;/td&gt;
      &lt;td&gt;54.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;A4&lt;/td&gt;
      &lt;td&gt;unemployed-non-smoker&lt;/td&gt;
      &lt;td&gt;3.0&lt;/td&gt;
      &lt;td&gt;36.0&lt;/td&gt;
      &lt;td&gt;37.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The stat &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A1&lt;/code&gt; represents the total population count, median age, and mean age of individuals in the database. The count refers to the total number of individuals in the database, the median age refers to the age that separates the database into two equal halves, and the mean age refers to the average age of all individuals in the database. The other stats are all showing the same information for various cohorts.&lt;/p&gt;

&lt;p&gt;Note that with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A3&lt;/code&gt; we have suppressed it in order to protect the identity of the individuals who have a higher risk of being re-identified. What’s interesting about this stat is that this is information we can encode into our model to help us come up with a better re-construction. We can infer that it is suppressed because there is &amp;lt;3 people who represent this cohort since we know that other stats (such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D2&lt;/code&gt;) contain 3 people and that stat is not supressed.&lt;/p&gt;

&lt;p&gt;In order to encode these constraints into a model that we can use to re-construct the data, we can use a library such as &lt;a href=&quot;https://github.com/Z3Prover/z3&quot;&gt;Z3&lt;/a&gt;. We can use libraries such as Z3 to model constraints and then ask for an answer that fits within those constraints. Effectively, each stat above is a constraint that we can model and we can ask it to generate all the permutations of age, smoker status, employment status and married status that have to exist in order to satisfy all the constraints. An example of modelling a constraint can be done like this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;z3&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# create a solver object, that houses all our constraints
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# create representations of the variables we want to receive an answer for; such as ages
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ArraySort&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;ages&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IntSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IntSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# define a constraint on these variables (we know there are 7 people, so we range over that number)
# the constraint we add here is to ensure all 7 people have a realistic age (between 0 and 125)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;min_age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;max_age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;125&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min_age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_age&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;check&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# this checks that our constraints can produce a valid model
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# we can then access that model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The result of the constraints above would end up outputing a list of values for ages that fit within our constraints. For example, the model we end up with might look like this:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[45, 34, 67, 34, 123, 1, 8]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Of course there could be many permutations, the model may output different answers depending on which one it picks first. With each new constraint added, we reduce the search space until we ideally get down to 1 answer that fit all the constraints. At this point, we’ve re-constructed the database!&lt;/p&gt;

&lt;p&gt;If you want to see this in action, check out &lt;a href=&quot;https://github.com/mitchelllisle/database-reconstruction-attacks&quot;&gt;this repo with a full implementation&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this article, we’ve explored how aggregate data does little to hinder hackers from being able to re-create a database that is either very close, or exactly the same as the original data. It’s important to consider this when releasing data.&lt;/p&gt;

&lt;p&gt;Before we wrap up, you may be asking why this is possible. Well the answer to that comes from the same people that have come up with the best technique we know of to protecting against this type of attack:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“[Giving] overly accurate answers to too many questions will destroy privacy in a spectacular way”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cynthia Dwork and Aaron Roth, Authors of ‘The Algorithmic foundations of Differential Privacy’&lt;/p&gt;

&lt;p&gt;The next question you may be asking is “How do I protect against this attack?”. A couple of things you can look at include:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://desfontain.es/privacy/friendly-intro-to-differential-privacy.html&quot;&gt;Differential privacy&lt;/a&gt;: DP is a great fit for protecting this type of data. In fact, the US Census Bereau have adopted DP to avoid disclosure of private information about individuals&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://martinfowler.com/bliki/Datensparsamkeit.html&quot;&gt;Data minimisation&lt;/a&gt;: Releasing too much information can lead to a simpler re-construction attack vector, so minimising the data you release can be a simple way to limit what people can infer about your data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can, try and consult with privacy experts in your organisation to ensure they do a privacy review before sharing data with third-parties or with the public.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;
</description>
                <pubDate>Thu, 23 Feb 2023 12:01:35 +0000</pubDate>
                <link>/blog/database-reconstruction-attacks</link>
                <guid isPermaLink="true">/blog/database-reconstruction-attacks</guid>
                
                <category>privacy</category>
                
                <category>python</category>
                
                <category>databases</category>
                
                <category>z3</category>
                
                
            </item>
        
    </channel>
</rss>