<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Mitchell Lisle</title>
    <description></description>
    <link>/</link>
    <atom:link href="/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Thu, 04 Jun 2026 10:20:57 +0000</pubDate>
    <lastBuildDate>Thu, 04 Jun 2026 10:20:57 +0000</lastBuildDate>
    <generator>Jekyll v4.4.1</generator>
    
      <item>
        <title>Designing the Room</title>
        <description>&lt;p style=&quot;font-size: 0.85rem; color: var(--color-base-text-2); margin: 0 0 1.25rem;&quot;&gt;Photo by &lt;a href=&quot;https://unsplash.com/@andersjilden?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText&quot;&gt;Anders Jildén&lt;/a&gt; on &lt;a href=&quot;https://unsplash.com/photos/low-angle-photography-of-gray-building-at-daytime-Sc5RKXLBjGg?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText&quot;&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In 1941, Mervin Kelly designed the new &lt;a href=&quot;https://en.wikipedia.org/wiki/Bell_Labs&quot;&gt;Bell Labs&lt;/a&gt; building at Murray Hill with a single, deliberate constraint: one long corridor. Not separate wings for physics and chemistry and engineering. One corridor, so that every researcher walking to their office would pass everyone else. Kelly believed the building’s layout would determine the quality of what happened inside it. He was right. The corridor produced the transistor, information theory, Unix, and the C language. The room did something. That was the point.&lt;/p&gt;

&lt;p&gt;The Bell Labs example hints at something worth exploring: the environments we design for teams shape performance in ways that often go unrecognized. It suggests the environment matters as much as, if not more than, traditional notions of leadership.&lt;/p&gt;

&lt;p&gt;Research and experience suggest that many factors affecting team performance relate to environmental design rather than individual traits. Who gets heard in a meeting. Whether mistakes are escalated or buried. How work is assigned and explained. Whether someone feels safe saying “I don’t know.” These appear to be shaped by environmental settings—the conditions and norms of how teams operate—rather than by personality traits alone. Managers, whether intentionally or not, often help create these conditions.&lt;/p&gt;

&lt;div style=&quot;text-align: center; margin: 3rem auto; max-width: 640px;&quot;&gt;
  &lt;p style=&quot;font-size: 0.9rem; color: var(--color-base-text); margin: 0 0 1rem; font-weight: 500;&quot;&gt;Behaviour is a function of the person and the environment&lt;/p&gt;
  &lt;div style=&quot;font-size: 5.5rem; font-weight: 800; color: var(--color-base-text); font-family: &apos;Courier New&apos;, Courier, monospace; letter-spacing: -0.05em; font-style: italic;&quot;&gt;B = f(P, E)&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Kurt_Lewin&quot;&gt;Kurt Lewin&lt;/a&gt; spent decades trying to convince people of a deceptively simple idea: behaviour is a function of both the person and the environment, and the environment is systematically underweighted in how we explain what happens. His equation, B = f(P, E), looks like a truism until you start noticing how rarely the E is examined. When a data team underdelivers, the conversation almost always focuses on P. Skill gaps. Attitude. The wrong hire. The individual who isn’t stepping up or who doesn’t get what the business is trying to do. The E barely gets a mention. Nobody says “what is it about how we structured this work, or ran these meetings, or handled the last incident, that produced this behaviour?” Yet Lewin’s claim, and the evidence that has accumulated around it since, is that the E is doing most of the work.&lt;/p&gt;

&lt;div id=&quot;lewin-viz&quot; style=&quot;font-family: Georgia, serif; margin: 2rem 0; padding: 2rem 1.5rem; color: var(--color-base-text); background: var(--color-base-bg); border: 1px solid var(--color-base-bg-3); border-radius: 4px;&quot;&gt;
  &lt;h2 style=&quot;font-size: 1.25rem; font-weight: 600; margin: 0 0 0.4rem; color: var(--color-base-text);&quot;&gt;B = f(P, E): Where Are You Looking?&lt;/h2&gt;
  &lt;p style=&quot;font-size: 0.9rem; color: var(--color-base-text-2); margin: 0 0 1.5rem; line-height: 1.5;&quot;&gt;When behaviour on your team goes wrong, how much of your diagnosis focuses on the person versus the environment?&lt;/p&gt;
  &lt;div style=&quot;display: flex; align-items: center; gap: 0.75rem; margin-bottom: 0.75rem;&quot;&gt;
    &lt;span style=&quot;font-size: 0.85rem; font-weight: 600; color: #b94a2c; white-space: nowrap;&quot;&gt;Person (P)&lt;/span&gt;
    &lt;input id=&quot;lewin-slider&quot; type=&quot;range&quot; min=&quot;0&quot; max=&quot;100&quot; value=&quot;70&quot; style=&quot;flex: 1; accent-color: #555; cursor: pointer;&quot; /&gt;
    &lt;span style=&quot;font-size: 0.85rem; font-weight: 600; color: #2c6e49; white-space: nowrap;&quot;&gt;Environment (E)&lt;/span&gt;
  &lt;/div&gt;
  &lt;div style=&quot;display: flex; height: 2rem; border-radius: 3px; overflow: hidden; margin-bottom: 0.5rem;&quot;&gt;
    &lt;div id=&quot;bar-p&quot; style=&quot;background: #b94a2c; display: flex; align-items: center; justify-content: center; transition: width 0.15s ease; min-width: 2rem; width: 70%;&quot;&gt;&lt;span id=&quot;bar-p-text&quot; style=&quot;font-size: 0.75rem; color: white; font-weight: 600;&quot;&gt;70%&lt;/span&gt;&lt;/div&gt;
    &lt;div id=&quot;bar-e&quot; style=&quot;background: #2c6e49; display: flex; align-items: center; justify-content: center; transition: width 0.15s ease; min-width: 2rem; width: 30%;&quot;&gt;&lt;span id=&quot;bar-e-text&quot; style=&quot;font-size: 0.75rem; color: white; font-weight: 600;&quot;&gt;30%&lt;/span&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div id=&quot;framing-label&quot; style=&quot;text-align: center; font-size: 0.85rem; color: var(--color-base-text-2); margin-bottom: 1.5rem; font-style: italic;&quot;&gt;Person-heavy&lt;/div&gt;
  &lt;div style=&quot;display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin-bottom: 1.5rem;&quot;&gt;
    &lt;div&gt;
      &lt;h3 style=&quot;font-size: 0.8rem; font-weight: 600; margin: 0 0 0.3rem; text-transform: uppercase; letter-spacing: 0.04em; color: #b94a2c;&quot;&gt;Interventions you&apos;ll reach for&lt;/h3&gt;
      &lt;div id=&quot;person-interventions&quot;&gt;&lt;/div&gt;
    &lt;/div&gt;
    &lt;div&gt;
      &lt;h3 style=&quot;font-size: 0.8rem; font-weight: 600; margin: 0 0 0.3rem; text-transform: uppercase; letter-spacing: 0.04em; color: #2c6e49;&quot;&gt;Interventions you&apos;ll reach for&lt;/h3&gt;
      &lt;div id=&quot;env-interventions&quot;&gt;&lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;p style=&quot;font-size: 0.8rem; color: var(--color-base-text-2); margin: 0; line-height: 1.5; border-top: 1px solid var(--color-base-bg-3); padding-top: 1rem;&quot;&gt;Lewin&apos;s equation is multiplicative in effect. A great environment enables people who would otherwise underperform. The slider shows how the framing of a problem shapes the solutions you see.&lt;/p&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
  var personInterventions = [&quot;Performance improvement plan&quot;, &quot;Training or coaching&quot;, &quot;Role reassignment&quot;, &quot;Hire differently next time&quot;];
  var envInterventions = [&quot;Redesign meeting norms&quot;, &quot;Clarify how decisions get made&quot;, &quot;Change how errors are handled&quot;, &quot;Audit what silence costs&quot;, &quot;Reduce over-specification of work&quot;];

  function renderItem(text, type) {
    var isDark = document.documentElement.hasAttribute(&apos;data-bs-theme&apos;) &amp;&amp; document.documentElement.getAttribute(&apos;data-bs-theme&apos;) === &apos;dark&apos;;
    var bg = isDark ? (type === &apos;p&apos; ? &apos;#3d2420&apos; : &apos;#1d3d2f&apos;) : (type === &apos;p&apos; ? &apos;#fdf0ed&apos; : &apos;#edf5f1&apos;);
    var border = type === &apos;p&apos; ? &apos;#b94a2c&apos; : &apos;#2c6e49&apos;;
    var color = isDark ? &apos;#f3f4f6&apos; : (type === &apos;p&apos; ? &apos;#5a2010&apos; : &apos;#1a4030&apos;);
    return &apos;&lt;div style=&quot;font-size:0.85rem;padding:0.4rem 0.6rem;border-radius:3px;line-height:1.4;background:&apos; + bg + &apos;;border-left:3px solid &apos; + border + &apos;;color:&apos; + color + &apos;;margin-bottom:0.4rem;&quot;&gt;&apos; + text + &apos;&lt;/div&gt;&apos;;
  }

  function update(pWeight) {
    var eWeight = 100 - pWeight;
    document.getElementById(&apos;bar-p&apos;).style.width = pWeight + &apos;%&apos;;
    document.getElementById(&apos;bar-e&apos;).style.width = eWeight + &apos;%&apos;;
    document.getElementById(&apos;bar-p-text&apos;).textContent = pWeight + &apos;%&apos;;
    document.getElementById(&apos;bar-e-text&apos;).textContent = eWeight + &apos;%&apos;;
    var label = pWeight &gt; 60 ? &apos;Person-heavy&apos; : pWeight &lt; 40 ? &apos;Environment-heavy&apos; : &apos;Balanced&apos;;
    document.getElementById(&apos;framing-label&apos;).textContent = label;
    var pCount = Math.ceil((pWeight / 100) * personInterventions.length);
    var eCount = Math.ceil((eWeight / 100) * envInterventions.length);
    var pHtml = pCount === 0 ? &apos;&lt;div style=&quot;font-size:0.85rem;color:var(--color-base-text-2);font-style:italic;&quot;&gt;None&lt;/div&gt;&apos; : personInterventions.slice(0, pCount).map(function(t){ return renderItem(t, &apos;p&apos;); }).join(&apos;&apos;);
    var eHtml = eCount === 0 ? &apos;&lt;div style=&quot;font-size:0.85rem;color:var(--color-base-text-2);font-style:italic;&quot;&gt;None&lt;/div&gt;&apos; : envInterventions.slice(0, eCount).map(function(t){ return renderItem(t, &apos;e&apos;); }).join(&apos;&apos;);
    document.getElementById(&apos;person-interventions&apos;).innerHTML = pHtml;
    document.getElementById(&apos;env-interventions&apos;).innerHTML = eHtml;
  }

  var slider = document.getElementById(&apos;lewin-slider&apos;);
  slider.addEventListener(&apos;input&apos;, function() { update(parseInt(this.value)); });
  update(70);
})();
&lt;/script&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Barry_Schwartz_(psychologist)&quot;&gt;Barry Schwartz&lt;/a&gt; traced the same logic through the history of work itself. Industrial job design started with an assumption: workers are essentially lazy and will do as little as possible unless compelled. So work was designed accordingly: fragmented, supervised, stripped of discretion. The workers duly behaved as assumed. Schwartz’s point is that this was not a discovery about human nature. It was an experiment in environmental design, and it produced exactly the behaviour it expected. Lewin would have recognised it as a predictable application of his equation. The data equivalent is alive and well: if you treat a team’s work as a service desk, structure it around ticket throughput, and specify requests so tightly that no judgment is required, you will produce a team that behaves like a service desk. You designed the room. Don’t be surprised by who shows up to sit in it.&lt;/p&gt;

&lt;p&gt;The most important piece of empirical evidence in this knowledge base on team performance comes from &lt;a href=&quot;https://psychsafety.com/googles-project-aristotle/&quot;&gt;Google’s Project Aristotle&lt;/a&gt;, which spent two years studying 180 teams and arrived at a conclusion that surprised even the researchers: who is on the team explains very little about how the team performs. What explains it is how the team operates. The dominant predictor, by a distance, was psychological safety, the shared belief that it is safe to speak up, ask questions, surface mistakes, and disagree, without it counting against you. Not talent density. Not technical depth. The quality of the environment, specifically: whether people experience it as safe to be honest. &lt;a href=&quot;https://www.ted.com/talks/margaret_heffernan_why_it_s_time_to_forget_the_pecking_order_at_work&quot;&gt;Margaret Heffernan&lt;/a&gt; reached the same conclusion from a different direction, observing that teams built around individual star performance (she calls it the super-chicken model, and the label earns its keep) consistently underperform teams built around social capital, mutual helpfulness, and equal voice. The room matters more than the roster.&lt;/p&gt;

&lt;p&gt;What does “designing the environment” actually mean in practice? It means that when a data engineer stays quiet in a planning meeting, the most interesting question is not “why is she disengaged?” but “what signal has this room been sending about what happens when you speak up?” It means that when you explain the context behind a piece of work, not just the specification, you are supporting the autonomy that &lt;a href=&quot;https://en.wikipedia.org/wiki/Self-determination_theory&quot;&gt;Deci and Ryan&lt;/a&gt; identified as a basic psychological need, the one whose frustration reliably degrades quality and creativity. It means that when something goes wrong and you respond with curiosity rather than blame, you are calibrating the room’s safety level, and that calibration will echo in how your team handles the next ten things that go wrong. Tony Manganiello’s observation that loyalty is built in moments of tension is exactly this: the tension is the test of the environment, and people are watching closely. They are always watching closely.&lt;/p&gt;

&lt;p&gt;The practical implication is that almost everything you do as a data leader is environmental design, whether you think of it that way or not. The way you run a retrospective. Whether you share the reasoning behind a priority call or just the call. Whether you let a dominant voice run the room or actively create space. Whether errors get surfaced to you or hidden from you, which is itself a readout of safety. These are not soft decisions. They are architectural ones.&lt;/p&gt;

&lt;p&gt;Lewin had a phrase worth keeping: there is nothing so practical as a good theory. The theory here is simple. The people on your team are largely behaving in response to the environment you have built for them. Improving that environment will produce better behaviour more reliably than improving the individuals. Which means the most useful question when auditing your team’s performance is not “who are the problems?” It is “what is the room doing?”&lt;/p&gt;
</description>
        <pubDate>Wed, 03 Jun 2026 14:00:00 +0000</pubDate>
        <link>/blog/2026-06-04-designing-the-room/</link>
        <guid isPermaLink="true">/blog/2026-06-04-designing-the-room/</guid>
        
        <category>leadership</category>
        
        <category>management</category>
        
        <category>data teams</category>
        
        <category>environment</category>
        
        <category>psychology</category>
        
        
      </item>
    
      <item>
        <title>Why Estimates Always Lie — And What to Do About It</title>
        <description>&lt;p&gt;Ask any developer how long a project will take, and then ask again once it’s done. The numbers will rarely match. This isn’t an occasional failure — it’s one of the most consistent and documented patterns in software development.&lt;/p&gt;

&lt;p&gt;And yet, most of us just keep doing it the same way. We stare at a Jira board, assign story points, add a 20% buffer, and hand something over with a confidence we fundamentally do not have. Then we spend months quietly explaining why things are taking longer than expected.&lt;/p&gt;

&lt;p&gt;I wanted to understand &lt;em&gt;why&lt;/em&gt; estimation fails so reliably — and build something that makes it harder to lie to yourself.&lt;/p&gt;

&lt;h2 id=&quot;the-problem-isnt-laziness&quot;&gt;The Problem Isn’t Laziness&lt;/h2&gt;

&lt;p&gt;The tempting narrative is that developers are just bad at scoping things. They’re naive optimists who forget about edge cases and tech debt. Fix the person, fix the problem.&lt;/p&gt;

&lt;p&gt;But that framing misses what’s actually going on.&lt;/p&gt;

&lt;p&gt;When you estimate a project, you almost always estimate &lt;em&gt;the work&lt;/em&gt; — the actual build. The feature, the screen, the API endpoint. The thing you can see and reason about. The thing that ends up in your Jira ticket.&lt;/p&gt;

&lt;p&gt;The problem is that the work is never just the work.&lt;/p&gt;

&lt;p&gt;Every project comes wrapped in a thick layer of invisible effort that we don’t put in the Jira ticket because it isn’t &lt;em&gt;the thing&lt;/em&gt; we’re building. It’s the meetings, the config, the debugging sessions, the backslide on a dependency upgrade, the scope conversations, the infrastructure that breaks on a Friday afternoon.&lt;/p&gt;

&lt;p&gt;We’re not bad at estimating &lt;em&gt;the work&lt;/em&gt;. We’re consistently ignoring everything around it.&lt;/p&gt;

&lt;h2 id=&quot;dave-stewarts-taxonomy-of-invisible-work&quot;&gt;Dave Stewart’s Taxonomy of Invisible Work&lt;/h2&gt;

&lt;p&gt;A few years ago, Dave Stewart published a &lt;a href=&quot;https://davestewart.co.uk/blog/work/project-estimation/&quot;&gt;fantastic deep dive&lt;/a&gt; on why projects always take longer — the result of a brutal postmortem on a project that ran far, far over. He also published an &lt;a href=&quot;https://gist.github.com/davestewart/643ffc55aa7c173618d2707b776a1443&quot;&gt;accompanying gist&lt;/a&gt; that catalogues, in painstaking detail, all the things you don’t think about when you quote for a project.&lt;/p&gt;

&lt;p&gt;Reading it is one of those experiences where you nod continuously while quietly reflecting on every project you’ve ever been part of.&lt;/p&gt;

&lt;p&gt;His key insight is that project work can be broken into distinct categories, and only one of them is what we actually estimate:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Category&lt;/th&gt;
      &lt;th&gt;What it means&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;The work around the work&lt;/td&gt;
      &lt;td&gt;Meetings, reviews, project management&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work to get the work&lt;/td&gt;
      &lt;td&gt;Research, scoping, quoting, pitching&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work before the work&lt;/td&gt;
      &lt;td&gt;Setup, config, infrastructure, services&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;The work&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;The actual build, product, design, docs, tests&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work between the work&lt;/td&gt;
      &lt;td&gt;Debugging, refactoring, iteration, tooling&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work beyond the work&lt;/td&gt;
      &lt;td&gt;Scope creep, omissions, nice-to-haves&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work outside the work&lt;/td&gt;
      &lt;td&gt;Surprises, contingency, unknown unknowns&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;The work after the work&lt;/td&gt;
      &lt;td&gt;Hosting, deployment, security, ongoing support&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Looking at this list, the &lt;em&gt;actual work&lt;/em&gt; — the thing that goes in the estimate — is one entry out of eight. And Dave’s rough analysis suggests execution might represent as little as &lt;strong&gt;20% of total project effort&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That number feels extreme until you think about the last project you shipped. How much time was spent in stakeholder meetings? How long did the initial environment setup take? How many days got consumed by a third-party API that didn’t behave as documented? How many afternoons were eaten by “quick questions” that turned into scope renegotiations?&lt;/p&gt;

&lt;p&gt;Add it all up honestly and 20% starts to seem plausible. Maybe even generous.&lt;/p&gt;

&lt;h2 id=&quot;why-we-keep-getting-it-wrong&quot;&gt;Why We Keep Getting It Wrong&lt;/h2&gt;

&lt;p&gt;There are a few cognitive traps that make this so persistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The planning fallacy&lt;/strong&gt; — our tendency to anchor on best-case scenarios and discount known risks — is well-documented. We don’t just forget the adjacent work; we actively don’t want to include it because doing so makes the estimate “more expensive” and harder to sell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invisible work is invisible.&lt;/strong&gt; If it doesn’t have a ticket, it doesn’t exist in the estimate. But it still exists in the calendar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We estimate outcomes, not processes.&lt;/strong&gt; “Build a search feature” gets an estimate. “Spend two days understanding why Elasticsearch index updates are inconsistent across environments” doesn’t. But the second thing is what actually happens.&lt;/p&gt;

&lt;p&gt;The practical result is that estimates consistently represent a best-case path through the &lt;em&gt;visible&lt;/em&gt; work, while everything else accumulates silently.&lt;/p&gt;

&lt;h2 id=&quot;building-a-better-tool&quot;&gt;Building a Better Tool&lt;/h2&gt;

&lt;p&gt;I built &lt;a href=&quot;https://mitchelllisle.github.io/true-estimate/&quot;&gt;true-estimate&lt;/a&gt; to make this hidden work visible — and to make it slightly harder to accidentally produce a naive estimate.&lt;/p&gt;

&lt;p&gt;The tool is directly inspired by Dave Stewart’s framework. Rather than a flat list of tasks, it organises your estimate into the eight phases above. You can add tasks under each phase with optional week estimates. As you fill it in, you get three numbers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Estimated&lt;/strong&gt; — only the execution work, the thing you’d normally quote&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hidden&lt;/strong&gt; — everything outside the execution phase&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Total&lt;/strong&gt; — what it actually costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal isn’t to produce a precise forecast, because that’s largely impossible. It’s to force the question: &lt;em&gt;what am I not accounting for?&lt;/em&gt; The admin load, the setup time, the inevitable bugs and scope conversations — they’re going to happen regardless of whether you estimate them. The only question is whether you’re planning for them or absorbing them silently.&lt;/p&gt;

&lt;p&gt;There’s also a sample project you can load to see what a realistic breakdown might look like. The hidden work being consistently larger than the estimated work is, in my experience, not a bug in the sample — it’s about right.&lt;/p&gt;

&lt;h2 id=&quot;an-honest-estimate-isnt-a-pessimistic-one&quot;&gt;An Honest Estimate Isn’t a Pessimistic One&lt;/h2&gt;

&lt;p&gt;There’s sometimes a reluctance to estimate comprehensively because it feels like pessimism or padding. If you include two weeks for “general iteration and debugging,” it looks like you’re hedging. Shouldn’t a good developer be more efficient than that?&lt;/p&gt;

&lt;p&gt;But this is exactly backwards. An honest estimate is a professional one. It signals that you understand how software projects actually work — that there is always invisible work, always iteration, always surprises. Hiding that work doesn’t make it go away. It just means someone absorbs it unexpectedly, whether that’s you, the project timeline, or the client.&lt;/p&gt;

&lt;p&gt;The developers and teams who build trust over time are the ones whose estimates are reliable — not necessarily short.&lt;/p&gt;

&lt;h2 id=&quot;try-it&quot;&gt;Try It&lt;/h2&gt;

&lt;p&gt;If you’ve got a project in front of you — a new feature, a refactor, a greenfield build — give &lt;a href=&quot;https://mitchelllisle.github.io/true-estimate/&quot;&gt;true-estimate&lt;/a&gt; a try before you submit that Jira estimate. Work through each phase and be honest about what you’re probably going to spend time on. Then compare your execution estimate to the total.&lt;/p&gt;

&lt;p&gt;The gap between those two numbers is the amount of work you were planning to do for free.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;The true-estimate tool is open source — code is &lt;a href=&quot;https://github.com/mitchelllisle/true-estimate&quot;&gt;on GitHub&lt;/a&gt;. Dave Stewart’s original article, which inspired the structure, is &lt;a href=&quot;https://davestewart.co.uk/blog/work/project-estimation/&quot;&gt;well worth reading in full&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
        <pubDate>Fri, 20 Mar 2026 13:00:00 +0000</pubDate>
        <link>/blog/2026-03-21-why-estimates-always-lie/</link>
        <guid isPermaLink="true">/blog/2026-03-21-why-estimates-always-lie/</guid>
        
        <category>software</category>
        
        <category>estimation</category>
        
        <category>productivity</category>
        
        <category>tools</category>
        
        
      </item>
    
      <item>
        <title>Mapping Fire: Five Decades of Bushfires in NSW</title>
        <description>&lt;p&gt;Australia and fire are inseparable. For millennia, bushfires have shaped our landscapes, ecology, and communities. But as our climate changes and populations grow along bushland fringes, understanding fire patterns has never been more important.&lt;/p&gt;

&lt;p&gt;I’ve built an &lt;a href=&quot;https://mitchelllisle.github.io/fires-nsw-dashboard/&quot;&gt;interactive dashboard&lt;/a&gt; that explores over 50 years of fire history in New South Wales—from 1970 to 2024. Using data from NSW’s Department of Planning, Industry and Environment, it tells the story of where, when, and how fires have burned across the state.&lt;/p&gt;

&lt;h2 id=&quot;what-the-data-reveals&quot;&gt;What the Data Reveals&lt;/h2&gt;

&lt;p&gt;Since 1970, NSW has recorded &lt;strong&gt;18,814 fire events&lt;/strong&gt;, burning more than &lt;strong&gt;15 million hectares&lt;/strong&gt;—roughly 2% of Australia’s entire landmass. These fires fall into two categories: wildfires (11,503 events) which have burnt 14 million hectares, and prescribed burns (7,311 events) used for hazard reduction, clearing 1.8 million hectares.&lt;/p&gt;

&lt;p&gt;The numbers alone don’t capture the human cost. The dashboard documents the deadliest fires, including the Badja Forest Road fire that claimed six lives during Black Summer, and the Green Wattle Creek fire that killed two volunteer firefighters when a tree struck their tanker.&lt;/p&gt;

&lt;h2 id=&quot;patterns-in-time-and-space&quot;&gt;Patterns in Time and Space&lt;/h2&gt;

&lt;p&gt;The visualisations reveal several clear patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Geographic clustering&lt;/strong&gt; shows fires concentrate heavily along coastal ranges where eucalypt forests meet urban development. The Blue Mountains and Central Coast are among the most fire-prone areas, with some locations experiencing dozens of fire events over the period.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seasonal cycles&lt;/strong&gt; are stark—summer and early autumn (December to March) dominate fire activity. But the 2019-2020 season broke patterns with unprecedented late-spring fires, signalling how changing conditions are shifting traditional fire seasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drought years stand out&lt;/strong&gt;. Wildfire frequency spikes dramatically during major droughts, particularly 2001-2002 and 2019-2020. Meanwhile, prescribed burns maintain a relatively steady baseline as fire services work to reduce fuel loads.&lt;/p&gt;

&lt;h2 id=&quot;the-black-summer-context&quot;&gt;The Black Summer Context&lt;/h2&gt;

&lt;p&gt;The 2015-2020 period saw the most area burnt in any five-year window, driven entirely by the catastrophic Black Summer fires of 2019-2020. That season alone burnt over &lt;strong&gt;5 million hectares&lt;/strong&gt;—dwarfing every previous year on record.&lt;/p&gt;

&lt;p&gt;Three fires during Black Summer deserve particular attention. The Gospers Mountain fire, started by a single lightning strike, ultimately burned 512,626 hectares after merging with five other fires into a megablaze exceeding one million hectares. The Currowan fire earned the name “The Forever Fire” for its 74-day duration. The Badja Forest Road fire travelled 40 kilometres in hours under catastrophic conditions, destroying 418 homes around Cobargo on New Year’s Eve.&lt;/p&gt;

&lt;p&gt;Only 87 fires since 1970 have exceeded 50,000 hectares. Nearly all sparked from lightning strikes in remote bushland during extreme drought conditions. The dashboard shows how these mega-fires cluster in summer months when temperatures peak and fuel is driest.&lt;/p&gt;

&lt;h2 id=&quot;why-this-matters&quot;&gt;Why This Matters&lt;/h2&gt;

&lt;p&gt;This isn’t just historical data—it’s a window into our future. Fire seasons now start earlier, last longer, and burn with unprecedented intensity. Understanding these patterns helps us prepare.&lt;/p&gt;

&lt;p&gt;The dashboard shows how fires behave under different conditions, where they’re most likely to occur, and which periods have been most destructive. For anyone living in NSW or interested in fire management, these patterns matter.&lt;/p&gt;

&lt;p&gt;It’s also worth noting what this data doesn’t capture. Historical records, especially pre-1990s, vary in accuracy. Fire boundaries are approximations. Some casualties may be unrecorded. The true human toll of these fires extends far beyond the numbers—displaced communities, destroyed homes, psychological trauma, and ecosystems fundamentally altered.&lt;/p&gt;

&lt;h2 id=&quot;building-the-dashboard&quot;&gt;Building the Dashboard&lt;/h2&gt;

&lt;p&gt;I built this using Observable Framework with data from NSW’s Department of Planning, Industry and Environment. The dataset includes every recorded fire since 1970, with details on location, size, type, and timing. I’ve supplemented this with research from official inquiries and historical records to document casualties and home losses for the largest fires.&lt;/p&gt;

&lt;p&gt;The goal was to make complex fire data accessible and interactive. You can explore specific years, compare wildfire versus prescribed burn patterns, see seasonal variations, and understand which areas face the highest risk.&lt;/p&gt;

&lt;h2 id=&quot;looking-ahead&quot;&gt;Looking Ahead&lt;/h2&gt;

&lt;p&gt;Fire is part of Australia’s identity. Aboriginal Australians used fire as a land management tool for over 60,000 years. But the scale and intensity of modern fires—driven by climate change, fuel accumulation, and expanding urban-bushland interfaces—presents challenges we’re still learning to navigate.&lt;/p&gt;

&lt;p&gt;This dashboard doesn’t offer solutions, but it does offer context. By seeing how fires have behaved over five decades, we can better understand what we’re facing and where we need to focus our efforts in fire management, hazard reduction, and community preparedness.&lt;/p&gt;

&lt;p&gt;Explore the dashboard at &lt;a href=&quot;https://mitchelllisle.github.io/fires-nsw-dashboard/&quot;&gt;mitchelllisle.github.io/fires-nsw-dashboard&lt;/a&gt; and see what patterns emerge from half a century of fire history.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;Explore the dashboard: &lt;a href=&quot;https://mitchelllisle.github.io/fires-nsw-dashboard/&quot;&gt;History of Bushfires in NSW&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data source: &lt;a href=&quot;https://datasets.seed.nsw.gov.au/dataset/fire-history-wildfires-and-prescribed-burns-1e8b6&quot;&gt;NSW DPIE Fire History Dataset&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
</description>
        <pubDate>Sat, 03 Jan 2026 13:00:00 +0000</pubDate>
        <link>/blog/2026-01-04-nsw-fires-dashboard/</link>
        <guid isPermaLink="true">/blog/2026-01-04-nsw-fires-dashboard/</guid>
        
        <category>data</category>
        
        <category>visualisation</category>
        
        <category>australia</category>
        
        <category>climate</category>
        
        
      </item>
    
      <item>
        <title>Too Unique to Hide: Understanding Re-identification Risk in Australia</title>
        <description>&lt;p&gt;We’ve all been told that our data is “de-identified” or “anonymised.” Healthcare providers, government agencies, and companies assure us that after removing names and addresses, our information is safe. But how safe is it really?&lt;/p&gt;

&lt;p&gt;This question led me to create &lt;a href=&quot;https://mitchelllisle.github.io/too-unique-to-hide-aus/&quot;&gt;Too Unique to Hide&lt;/a&gt;, an interactive calculator that shows Australians how identifiable they might be from supposedly anonymous datasets.&lt;/p&gt;

&lt;h2 id=&quot;how-unique-are-you&quot;&gt;How Unique Are You?&lt;/h2&gt;

&lt;p&gt;Even without your name or address, a few basic demographic facts can be quite distinctive. A combination of your postcode, age group, gender, and occupation might sound generic—but together, they can create a unique profile.&lt;/p&gt;

&lt;p&gt;The calculator uses real Australian Bureau of Statistics (ABS) census data to show this. Enter your details, and it shows how many people in Australia share that same demographic profile. The results can be surprising.&lt;/p&gt;

&lt;h2 id=&quot;understanding-the-numbers&quot;&gt;Understanding the Numbers&lt;/h2&gt;

&lt;p&gt;When fewer people share your characteristics, linking different datasets becomes easier. For example, if an organisation releases “anonymous” health data with postcode, age, and gender, it’s possible that cross-referencing with other datasets could reveal identities—especially in smaller population groups.&lt;/p&gt;

&lt;p&gt;The calculator shows four risk categories based on how many people match your profile, from very high risk (fewer than 10 matches) to lower risk (1,000+ matches). These estimates help you understand your potential visibility in anonymised datasets.&lt;/p&gt;

&lt;h2 id=&quot;real-world-examples&quot;&gt;Real-World Examples&lt;/h2&gt;

&lt;p&gt;Re-identification isn’t just theoretical. In 2016, the Australian Department of Health released “de-identified” Medicare data, but researchers showed it was possible to re-identify individuals, leading to the dataset being withdrawn. Similar issues arose with Netflix viewing data and location tracking from apps.&lt;/p&gt;

&lt;p&gt;Most often, this isn’t about bad actors—it’s organisations not fully appreciating how unique demographic combinations can be when sharing data for legitimate research or policy purposes.&lt;/p&gt;

&lt;h2 id=&quot;the-combination-effect&quot;&gt;The Combination Effect&lt;/h2&gt;

&lt;p&gt;Each demographic factor on its own is common. Millions share your age group or postcode. But combine them with gender and occupation, and you’re often in a much smaller group. The calculator visualises this, showing how rare you are for each attribute individually and combined.&lt;/p&gt;

&lt;h2 id=&quot;what-you-can-do&quot;&gt;What You Can Do&lt;/h2&gt;

&lt;p&gt;Understanding your profile is a useful first step:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Be mindful with surveys&lt;/strong&gt; that collect detailed demographics along with postcodes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Think about combinations&lt;/strong&gt; when sharing information across multiple platforms&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Ask questions&lt;/strong&gt; when organisations claim data is anonymous—what demographics remain?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Support privacy protections&lt;/strong&gt; that go beyond simple de-identification&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;building-the-tool&quot;&gt;Building the Tool&lt;/h2&gt;

&lt;p&gt;I built this using Observable Framework and real ABS census data, inspired by research from Imperial College London. All calculations happen in your browser—nothing you enter is collected or transmitted.&lt;/p&gt;

&lt;p&gt;The goal is education, not alarm. Many Australians don’t realise how distinctive basic demographics can be. This tool makes that concept tangible.&lt;/p&gt;

&lt;h2 id=&quot;looking-forward&quot;&gt;Looking Forward&lt;/h2&gt;

&lt;p&gt;Data sharing for research and policy is valuable, and we shouldn’t stop it. But we do need better approaches. This includes being realistic about de-identification limits, using stronger privacy techniques like differential privacy, and being thoughtful about what demographic detail gets shared.&lt;/p&gt;

&lt;p&gt;Try the calculator at &lt;a href=&quot;https://mitchelllisle.github.io/too-unique-to-hide-aus/&quot;&gt;Too Unique to Hide&lt;/a&gt; and see where you stand. Whether you’re one in thousands or more unique, understanding your demographic fingerprint is worth knowing in our data-driven world.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;Try the calculator: &lt;a href=&quot;https://mitchelllisle.github.io/too-unique-to-hide-aus/&quot;&gt;Too Unique to Hide - Australian Edition&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Learn more: &lt;a href=&quot;https://www.oaic.gov.au/&quot;&gt;Office of the Australian Information Commissioner&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
</description>
        <pubDate>Wed, 24 Dec 2025 10:00:00 +0000</pubDate>
        <link>/blog/2025-12-24-too-unique-to-hide/</link>
        <guid isPermaLink="true">/blog/2025-12-24-too-unique-to-hide/</guid>
        
        <category>privacy</category>
        
        <category>data</category>
        
        <category>security</category>
        
        <category>australia</category>
        
        
      </item>
    
      <item>
        <title>One schema library to rule them all</title>
        <description>&lt;h2 id=&quot;generating-and-testing-pyspark-dataframes-with-sparkdantic&quot;&gt;Generating and Testing PySpark DataFrames with Sparkdantic&lt;/h2&gt;

&lt;h3 id=&quot;1-introduction-of-the-problem-sparkdantic-solves&quot;&gt;1. Introduction of the Problem Sparkdantic Solves&lt;/h3&gt;

&lt;p&gt;In the world of Big Data, PySpark has become a go-to framework for processing large datasets. However, as with any 
framework, there are challenges. One of the most cumbersome challenges is defining schemas for DataFrames and generating 
realistic test data. PySpark often does a good job of inferring schemas, but in some cases you need to define a schema
to ensure your data arrives in the most correct state.&lt;/p&gt;

&lt;p&gt;Pydantic is another library that is hugely popular and provides so many excellent capabilities when it comes to validating
your data. Up until now, there hasn’t been an easy way tp use both.&lt;/p&gt;

&lt;p&gt;Traditionally, developers would manually define schemas and write custom code to generate test data. This process is not
only tedious but also error-prone. While PySpark provides a way to define schemas, it doesn’t take advantage of Pythons
in-built data types which mean you can have to define your schema in the way PySpark wants you to.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;What if there was a more streamlined way to handle schemas, interoperability between Python and Spark and an easy way
to generate fake / test data?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Enter Sparkdantic, which offers a seamless integration between Pydantic models and PySpark DataFrames. With Sparkdantic,
you can define DataFrame schemas using Pydantic models and generate realistic test data based on custom specifications.&lt;/p&gt;

&lt;p&gt;To read more about Sparkdantic and install it, see my GitHub profile &lt;a href=&quot;https://github.com/mitchelllisle/sparkdantic&quot;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip install sparkdantic&lt;/code&gt;&lt;/p&gt;

&lt;h3 id=&quot;2-creating-schemas-and-how-sparkdantic-makes-it-easy&quot;&gt;2. Creating Schemas and How Sparkdantic Makes It Easy&lt;/h3&gt;

&lt;p&gt;With PySpark, defining a schema usually involves creating a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StructType&lt;/code&gt; object with a list of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StructField&lt;/code&gt; objects. 
While this method is powerful, it can become verbose and hard to manage for complex schemas. You also can’t use this 
schema outside of PySpark.&lt;/p&gt;

&lt;p&gt;Using Sparkdantic, you can leverage Pydantic models to define your DataFrame schema. Pydantic models are Python classes 
that define data shapes and validation. They are concise, readable, and offer powerful validation capabilities. A basic
Pydantic model may look like this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pydantic&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BaseModel&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;User&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BaseModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SparkModel&lt;/code&gt; class from Sparkdantic, you can easily convert this Pydantic model into a PySpark schema which
gives you the ability to generate a PySpark valid schema with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;model_spark_schema&lt;/code&gt; method:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sparkdantic&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkModel&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;UserSparkSchema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SparkModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UserSparkSchema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;model_spark_schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will output a PySpark &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StructType&lt;/code&gt; schema, ready to be used in your DataFrames.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql.types&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IntegerType&lt;/span&gt;

&lt;span class=&quot;nc&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
    &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
    &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;IntegerType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
    &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-generating-realistic-fake-data-for-unit-tests--populating-a-development-database&quot;&gt;3. Generating Realistic Fake Data for Unit Tests / Populating a Development Database&lt;/h3&gt;

&lt;p&gt;Once you have your schema, the next challenge is populating it with realistic data. Sparkdantic provides the 
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ColumnGenerationSpec&lt;/code&gt; class, which lets you define specifications for generating data for each column.&lt;/p&gt;

&lt;p&gt;For instance, if you want the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;age&lt;/code&gt; column to have random values between 20 and 50:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sparkdantic.generation&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ColumnGenerationSpec&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;age_spec&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ColumnGenerationSpec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;min_value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You may also want a list of names to use for the name column. For this, we can leverage other libraries such as the well
known &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;faker&lt;/code&gt; library:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;faker&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Faker&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;faker&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Faker&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;names&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;faker&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;name_spec&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ColumnGenerationSpec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;generate_data&lt;/code&gt; method of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SparkModel&lt;/code&gt; in Sparkdantic, you can then generate a DataFrame with the desired
number of rows:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;builder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;appName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;demo&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;getOrCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UserSparkSchema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;generate_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_rows&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;specs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age_spec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name_spec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will produce a DataFrame with 1000 rows, with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;age&lt;/code&gt; column populated with random values between 20 and 50 and
a randomly chosen name from a list of 1000 fake names generated by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;faker&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;4-conclusion&quot;&gt;4. Conclusion&lt;/h3&gt;

&lt;p&gt;Defining PySpark DataFrame schemas and generating test data doesn’t have to be a cumbersome process. With the 
integration of Pydantic models and Sparkdantic, you can streamline these tasks, making your development process more 
efficient and error-free.&lt;/p&gt;

&lt;p&gt;Whether you’re a data engineer writing unit tests, a data scientist experimenting with data, or a developer populating a
development database, Sparkdantic offers a powerful toolset to make your life easier. Give it a try and elevate your 
PySpark game!&lt;/p&gt;
</description>
        <pubDate>Sat, 30 Sep 2023 12:01:35 +0000</pubDate>
        <link>/blog/2023-09-30-pyspark-pydantic-schema-library/</link>
        <guid isPermaLink="true">/blog/2023-09-30-pyspark-pydantic-schema-library/</guid>
        
        <category>pydantic</category>
        
        <category>python</category>
        
        <category>pyspark</category>
        
        <category>schema</category>
        
        
      </item>
    
      <item>
        <title>A friendly encryption CLI tool</title>
        <description>&lt;h2 id=&quot;-monstermash-a-simple-cli-tool-for-data-encryption&quot;&gt;🧟 Monstermash: A Simple CLI Tool for Data Encryption&lt;/h2&gt;

&lt;h3 id=&quot;introduction&quot;&gt;Introduction&lt;/h3&gt;

&lt;p&gt;In today’s digital landscape, data privacy is a growing concern. While there are many tools available for data encryption,
Monstermash offers a straightforward command-line interface (CLI) solution for those who prefer simplicity. 
Let’s explore its basic functionalities: encrypting and decrypting data.&lt;/p&gt;

&lt;p&gt;To read more about Monstermash and install it, see my GitHub profile &lt;a href=&quot;https://github.com/mitchelllisle/monstermash&quot;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip install monstermash&lt;/code&gt;&lt;/p&gt;

&lt;h3 id=&quot;getting-started-generating-keys&quot;&gt;Getting Started: Generating Keys&lt;/h3&gt;

&lt;p&gt;Before Alice and Bob can exchange encrypted messages, they each need a set of keys. Monstermash provides a basic command
to generate these.&lt;/p&gt;

&lt;p&gt;For Alice:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash generate
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-----------------
Private Key (Alice&apos;s)
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
Public Key (Alice&apos;s)
0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
-----------------
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For Bob:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash generate
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-----------------
Private Key (Bob&apos;s)
abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890
Public Key (Bob&apos;s)
fedcba0987654321fedcba0987654321fedcba0987654321fedcba0987654321
-----------------
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;encrypting-data&quot;&gt;Encrypting Data&lt;/h3&gt;

&lt;p&gt;Suppose Alice wants to send Bob a line from the song “Monster Mash”. She can use her private key and Bob’s public key to
encrypt the message.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash encrypt &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--private-key&lt;/span&gt; a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--public-key&lt;/span&gt; fedcba0987654321fedcba0987654321fedcba0987654321fedcba0987654321 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--data&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;They did the mash, they did the Monster Mash!&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Encrypted Data: 0123abcd4567ef890123abcd4567ef890123abcd4567ef890123abcd4567ef89
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;decrypting-data&quot;&gt;Decrypting Data&lt;/h3&gt;

&lt;p&gt;Upon receiving the encrypted message, Bob can decrypt it using his private key.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;monstermash decrypt &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--private-key&lt;/span&gt; abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--data&lt;/span&gt; 0123abcd4567ef890123abcd4567ef890123abcd4567ef890123abcd4567ef89
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Decrypted Data: They did the mash, they did the Monster Mash!
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;Monstermash is a simple CLI tool designed for basic encryption tasks. It doesn’t claim to revolutionize the encryption 
landscape but offers a simple solution for those familiar with the command line. If you’re looking for a no-frills way 
to encrypt and decrypt data, Monstermash might be worth a try.&lt;/p&gt;
</description>
        <pubDate>Fri, 01 Sep 2023 12:01:35 +0000</pubDate>
        <link>/blog/2023-08-01-monstermash/</link>
        <guid isPermaLink="true">/blog/2023-08-01-monstermash/</guid>
        
        <category>NaCL</category>
        
        <category>python</category>
        
        <category>encryption</category>
        
        <category>cli</category>
        
        
      </item>
    
      <item>
        <title>Protecting Sensitive Data: Understanding Database Reconstruction Attacks</title>
        <description>&lt;h1 id=&quot;protecting-sensitive-data-understanding-database-reconstruction-attacks&quot;&gt;Protecting Sensitive Data: Understanding Database Reconstruction Attacks&lt;/h1&gt;

&lt;p&gt;There are a number of reasons businesses and governments want to share information about people. One of the most common and useful way data is shared is through a census. A Census is particularly interesting because it contains some extremely personal information about individuals and as a result, it must be carefully protected to ensure any statistical information that is released doesn’t encroach on everyones right to priavacy. In a number of cases, aggregate data does little to hinder hackers from being able to re-create a database that is either very close, or exactly the same as the original data. In this blog post, we will explore a little about how these attacks work with a simple example.&lt;/p&gt;

&lt;p&gt;This blog post and the subsequent code is adapted from a paper on database reconstruction attacks. You can find the paper &lt;a href=&quot;https://queue.acm.org/detail.cfm?id=3295691&quot;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine we work for a company called Acme Data Inc. and that have the following database that contains information for people within a certain geographic area.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;name&lt;/th&gt;
      &lt;th&gt;age&lt;/th&gt;
      &lt;th&gt;married&lt;/th&gt;
      &lt;th&gt;smoker&lt;/th&gt;
      &lt;th&gt;employed&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Sara Gray&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Joseph Collins&lt;/td&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Vincent Porter&lt;/td&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Tiffany Brown&lt;/td&gt;
      &lt;td&gt;30&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Brenda Small&lt;/td&gt;
      &lt;td&gt;36&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Dr. Tina Ayala&lt;/td&gt;
      &lt;td&gt;66&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Rodney Gonzalez&lt;/td&gt;
      &lt;td&gt;84&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;True&lt;/td&gt;
      &lt;td&gt;False&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: All data here is fake generated data, and likeness to a real person is entirely coincidental.&lt;/p&gt;

&lt;p&gt;We have &lt;em&gt;7&lt;/em&gt; people in total in this block. Alongside &lt;strong&gt;age&lt;/strong&gt;, we also have each resident’s &lt;strong&gt;smoking status&lt;/strong&gt;, &lt;strong&gt;employment status&lt;/strong&gt; and whether they are &lt;strong&gt;married&lt;/strong&gt; or not. From here, we publish a variety of statistics about this block. You have probably seen something similar if you’ve ever done a census.&lt;/p&gt;

&lt;p&gt;📓 To simplify the example, this fictional world has:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Two marriage statuses; Married (&lt;strong&gt;True&lt;/strong&gt;) or Single (&lt;strong&gt;False&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;Two smoking statuses; Non-Smoker (&lt;strong&gt;False&lt;/strong&gt;) or Smoker (&lt;strong&gt;True&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;Two employment statuses;  Unemployed (&lt;strong&gt;False&lt;/strong&gt;) or Employed (&lt;strong&gt;True&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👾 One additional piece of logic we know is that any statistics with a &lt;strong&gt;count of less than 3&lt;/strong&gt; is suppressed. Suppression of statistics with low counts is often used as a tactic for protecting privacy. The less people there are to represent a statistic, the more they often stick out in a dataset meaning their privacy is often more at risk than those who ‘blend in with the crowd’. As we’ll see, simply knowing that a statistic is suppressed can even be used to attack a dataset.&lt;/p&gt;

&lt;p&gt;As a Data Analyst working for Acme Data, we have been tasked with producing the following summary statistics that we can publish on our website for anyone to view. After running our analysis, this is the output that we intend to publish:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;name&lt;/th&gt;
      &lt;th&gt;count&lt;/th&gt;
      &lt;th&gt;median-age&lt;/th&gt;
      &lt;th&gt;mean-age&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;A1&lt;/td&gt;
      &lt;td&gt;total-population&lt;/td&gt;
      &lt;td&gt;7.0&lt;/td&gt;
      &lt;td&gt;30.0&lt;/td&gt;
      &lt;td&gt;38.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;A2&lt;/td&gt;
      &lt;td&gt;non-smoker&lt;/td&gt;
      &lt;td&gt;4.0&lt;/td&gt;
      &lt;td&gt;30.0&lt;/td&gt;
      &lt;td&gt;33.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;B2&lt;/td&gt;
      &lt;td&gt;smoker&lt;/td&gt;
      &lt;td&gt;3.0&lt;/td&gt;
      &lt;td&gt;30.0&lt;/td&gt;
      &lt;td&gt;44.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;C2&lt;/td&gt;
      &lt;td&gt;unemployed&lt;/td&gt;
      &lt;td&gt;4.0&lt;/td&gt;
      &lt;td&gt;51.0&lt;/td&gt;
      &lt;td&gt;48.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;D2&lt;/td&gt;
      &lt;td&gt;employed&lt;/td&gt;
      &lt;td&gt;3.0&lt;/td&gt;
      &lt;td&gt;24.0&lt;/td&gt;
      &lt;td&gt;24.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;A3&lt;/td&gt;
      &lt;td&gt;single-adults&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;B3&lt;/td&gt;
      &lt;td&gt;married-adults&lt;/td&gt;
      &lt;td&gt;4.0&lt;/td&gt;
      &lt;td&gt;51.0&lt;/td&gt;
      &lt;td&gt;54.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;A4&lt;/td&gt;
      &lt;td&gt;unemployed-non-smoker&lt;/td&gt;
      &lt;td&gt;3.0&lt;/td&gt;
      &lt;td&gt;36.0&lt;/td&gt;
      &lt;td&gt;37.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The stat &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A1&lt;/code&gt; represents the total population count, median age, and mean age of individuals in the database. The count refers to the total number of individuals in the database, the median age refers to the age that separates the database into two equal halves, and the mean age refers to the average age of all individuals in the database. The other stats are all showing the same information for various cohorts.&lt;/p&gt;

&lt;p&gt;Note that with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A3&lt;/code&gt; we have suppressed it in order to protect the identity of the individuals who have a higher risk of being re-identified. What’s interesting about this stat is that this is information we can encode into our model to help us come up with a better re-construction. We can infer that it is suppressed because there is &amp;lt;3 people who represent this cohort since we know that other stats (such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D2&lt;/code&gt;) contain 3 people and that stat is not supressed.&lt;/p&gt;

&lt;p&gt;In order to encode these constraints into a model that we can use to re-construct the data, we can use a library such as &lt;a href=&quot;https://github.com/Z3Prover/z3&quot;&gt;Z3&lt;/a&gt;. We can use libraries such as Z3 to model constraints and then ask for an answer that fits within those constraints. Effectively, each stat above is a constraint that we can model and we can ask it to generate all the permutations of age, smoker status, employment status and married status that have to exist in order to satisfy all the constraints. An example of modelling a constraint can be done like this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# create a solver object, that houses all our constraints
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# create representations of the variables we want to receive an answer for; such as ages
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ArraySort&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ages&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;IntSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;IntSort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# define a constraint on these variables (we know there are 7 people, so we range over that number)
# the constraint we add here is to ensure all 7 people have a realistic age (between 0 and 125)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;min_age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;max_age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;125&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min_age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_age&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;check&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# this checks that our constraints can produce a valid model
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# we can then access that model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The result of the constraints above would end up outputing a list of values for ages that fit within our constraints. For example, the model we end up with might look like this:&lt;/p&gt;

&lt;div class=&quot;language-text highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[45, 34, 67, 34, 123, 1, 8]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Of course there could be many permutations, the model may output different answers depending on which one it picks first. With each new constraint added, we reduce the search space until we ideally get down to 1 answer that fit all the constraints. At this point, we’ve re-constructed the database!&lt;/p&gt;

&lt;p&gt;If you want to see this in action, check out &lt;a href=&quot;https://github.com/mitchelllisle/database-reconstruction-attacks&quot;&gt;this repo with a full implementation&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this article, we’ve explored how aggregate data does little to hinder hackers from being able to re-create a database that is either very close, or exactly the same as the original data. It’s important to consider this when releasing data.&lt;/p&gt;

&lt;p&gt;Before we wrap up, you may be asking why this is possible. Well the answer to that comes from the same people that have come up with the best technique we know of to protecting against this type of attack:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“[Giving] overly accurate answers to too many questions will destroy privacy in a spectacular way”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cynthia Dwork and Aaron Roth, Authors of ‘The Algorithmic foundations of Differential Privacy’&lt;/p&gt;

&lt;p&gt;The next question you may be asking is “How do I protect against this attack?”. A couple of things you can look at include:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://desfontain.es/privacy/friendly-intro-to-differential-privacy.html&quot;&gt;Differential privacy&lt;/a&gt;: DP is a great fit for protecting this type of data. In fact, the US Census Bereau have adopted DP to avoid disclosure of private information about individuals&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://martinfowler.com/bliki/Datensparsamkeit.html&quot;&gt;Data minimisation&lt;/a&gt;: Releasing too much information can lead to a simpler re-construction attack vector, so minimising the data you release can be a simple way to limit what people can infer about your data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can, try and consult with privacy experts in your organisation to ensure they do a privacy review before sharing data with third-parties or with the public.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;
</description>
        <pubDate>Thu, 23 Feb 2023 12:01:35 +0000</pubDate>
        <link>/blog/2019-10-14-database-reconstruction-attacks/</link>
        <guid isPermaLink="true">/blog/2019-10-14-database-reconstruction-attacks/</guid>
        
        <category>privacy</category>
        
        <category>python</category>
        
        <category>databases</category>
        
        <category>z3</category>
        
        
      </item>
    
  </channel>
</rss>