Quis custodiet ipsos custodes? – Why We Should Care about Government Data Mining

If you are anything like me, you will agree that this summer’s PRISM/NSA scandal has marked a new low in the history of 21st century digital surveillance. But by far the scariest prospect is that the NSA’s attitude towards privacy and data-mining might represent a new normal in how large institutions relate to their customers’ data. True to form, for the last half-year the media has been reinforcing just this narrative with their breathless coverage of how “big data” concepts are revolutionizing the future of both security and corporate administration, offering untold benefits to analysts interested in getting an edge. But even with the copious amounts of digital ink spilled on “big data” and data-mining, very little consideration is given to explaining the technology involved.PRISM

To a certain extent, this is understandable. The NSA scandals along with other corporate data-farming controversies are primarily about the privacy and ownership of social data, not the use of the data once it is collected. When the government (or corporate entity) has already absconded with our information, do we really care what they do with it? Still, absent even a rough overview of the technology, several of the potential dangers with our lax attitudes towards data privacy may not be fully apparent to the average news consumer.

In the broadest sense data mining is simply the use of computer algorithms to adaptively classify a dataset to discover subtle relationships. “Adaptively” is the keyword in the last sentence. Data mining certainly is not the straightforward testing of relationships over a large dataset (that would be standard statistical analysis). For this reason data mining is often categorized as an extension of artificial intelligence or machine learning. Data mining techniques’ advantage over traditional methods is in their ability to perceive relationships that could confound human perception. As humans, we are programmed to look for certain types of patterns in data sets. Humans have certain sensitivities (perceiving faces and voices) and in-sensitivities (very small or large numbers) and when approaching a dataset these biases can blind us to relationships that contain a good deal of predictive power. Unlike standard statistical methods, most data-mining algorithms simply begin with a metric to track and then proceed to use examples (called a training set) to create categories that explain the metrics variance. Since no categories are proposed at the offset, this adaptive technique greatly resembles “learning” and can come up with relationships across thousands of distinct inputs that a human would likely never discover.

Before my readers’ eyes glaze over, perhaps I can offer an example showing the difference between the two approaches. Say we would like to understand what factors influence teacher performance in primary schools. Certainly we can imagine a large dataset comprising the dossiers of teachers along with a metric representing the overall effectiveness of those teachers in educating their students. In a standard statistical approach, one might try to test whether classroom size, income, or location, or even a linear combination of all the variables effect the ultimate performance metric. From these standard methods we would expect an answer that would tell us whether there is a statistically valid relationship in any one of our proposed relationships (e.g. if there is 0.85 correlation between class size and performance at 95% confidence). However, this is approach limited. There might be very subtle relationships lurking in the many thousands variables we have on each teacher. Enter data mining. At the end of running a mining algorithm we might arrive at entirely unexpected rules that would never have been discovered by a pre-hoc query. We might discover, for instance, that teachers below the age of 35 have a improved performance metric but only if machineLearningthey are unmarried and living in an urban area; we could discover that teachers above the age of 40 in suburbs actually improve their performance metric when class size is increased above 20 students. Any number of complicated relationships might be generated based on the very subtle interplay of the thousands of variables describing each teacher. Some one might never expect.

At this point one might ask: So what’s the problem with this technology? Certainly it can only help to understand reality-based facts to make our decisions. If we assume no malice on the part of the one conducting the investigation, there can be nothing to lose from implementing a new approach to data. Well, like most things, the devil is in the details.

In the previous example, the rules I offered were odd but easy to understand. In a more realistic scenario rules could involve the interplay of thousand variables and might not be logically discernible to non-experts. In many, if not most cases, huge multi-variate relationships could be standing in for casual relationships between data not even contained in the original dataset. Of course a domain expert could, with some work, gain an understanding of why these rules hold predictive power but this is expensive and given the nature of current data mining practices, such rigorous methods are almost never pursued. In fact, to some analysts, data mining rules are often seen as a possible replacement for domain knowledge. This can lead to an insane scenario, where data-mining applications actually blind analysts to the casual relationships in play.  When the data becomes large and complex, it may be tempting to let the so-called “learning” of the data-mining algorithm replace expertise. But despite the nomenclature, the algorithm hasn’t really applied any intelligence; and so a modern data-mining system could proceed with it’s power discernment completely disabled.

All of this sounds very esoteric so perhaps I can  return to our original example of tracking social and personal data. The great and abiding problem with allowing large institutions free range to mine these personal data sources is that citizens will have only a vague notion of what casual inferences are being used to track them. In all likelihood the institutions have no clear idea either, and because of this, any number of things might be unwittingly inferred about subjects. With a large datasheet, a machine learning algorithm might discover an exact fingerprint that will statistically identify people as gun-owners, transgender, Muslims, home-schoolers, or any number of things not even tracked in the initial dataset. Of course, these relationships will be obscured to all but the very rigorous analyst, but in a poorly supervised system these fingerprints (standing in for first hand data) will be used to make decisions.

Privacy advocates frequently bring up examples where someone’s sexual and dietary habits might be inferred from access to an amazon wish list or a public Facebook profile. This scenario is often dismissed by cooler-headed individuals as paranoid tripe. Don’t the analysts at Amazon or the NSA have better things to do then sit around collecting personal dossiers citizens? And how could anyone be spiteful enough to snoop into strangers personal lives? The cooler heads are certainly right in assuming a lack spite on the part of data-miners, unfortunately sloth might accomplish what malice does not. Your employer or the government might be flagged that you panopticonare a suspicious or an “interesting” individual by their data-analysis system simply because you have a hobby of collecting antique guns. Of course no one would have ever asked if you were a gun-collector to begin with, but nevertheless they are getting an answer (in so many words) from the algorithm. A data mining algorithm that reads through thousands of different variables across a dataset of millions of individuals will likely have fingerprints (statistical proxies) for any number of personal facts.  As a result government or corporate dossiers could implicitly have any number of pieces of information about an individual..

I suppose at this point one could introduce another larger philosophical objection But really what is the problem? If the relationships are obscured within the system no one individual is the wiser and if statistically some groups are prone to being more violent, better customers of embarrassing products, or simply not well suited for some social environments, shouldn’t we be using this information to improve society? After all no person will likely discover or know these relationships? But the problem is just that. No one does know! The human transaction of investigation, which should contain a level of discretion, has been replaced by a headless automaton. In the past, Americans decided that, at some level, there was a core concept of privacy, some predictive information would not be used to make judgments about an individual regardless of the benefits to society. Data-mining, in effect, could circumvent this concept, using peripheral data to compose an implicit portrait of peoples’ more intimate biographical details.

All of this brings me back to two recent news stories: the revelation of extensive and racially biased “stop and frisk policies” within Bloomberg’s New York city and the increasing prevalence of CCTV surveillance throughout most urban areas in Europe and America. Certainly both surveillance of citizens and the use of racial profiling is troubling itself. However, in both cases, the extent of the evils were mitigated by technological limitations and a firm desire on the part of the public not to transgress common sense privacy norms. CCTV never could be the Orwellian Panopticon of Big Brother for the simple reason that, for each CCTV, there had to be an official at the other end watching (an intractable staffing issue). Moreover, despite having used racially profiling “stop and frisk” policies in the past, New Yorkers are now demonstrating that they find said-policies unethical even if they do help law enforcement reduce crime rates. There are too many people who value privacy above security, and too few watchers willing to test their limits for a nightmare scenario to ever be realized.

However, with the advent of data mining, both the technological barriers and our ability to detect ethical problems in the invasion of our privacy are fundamentally undermined. Whereas previously a city-wide CCTV surveillance system could never be realistically be staffed; a CCTV database paired with a data-mining algorithm trained to the face, action, and motion patterns of criminals could quite easily provide an effective means to detecting potential crimes in progress. Conversely, police officials forbidden from using race  as a means to track suspects may turn to subtler methods of tracking criminal activity through the use of data-mining algorithms trained on everything BUT race and race’s more obvious correlates. What is likely to emerge are large complicated data-fingerprints that proxy for race in everything but name. The New York Mayors Office was able to effectively deny the use of race in its “stop and frisk” policy for years by hiding behind an obvious proxy (saying they were targeting “neighborhoods”). Subsequently, future regimes able to couch their implicit racial strategy in complicated data-sets may be virtually undetectable by the public. Of course those rigorous enough to check (or cynical enough to guess) will know that society is essentially tracking its members by race anyway; and so, at the end of the day society will be using technology to do an end-run-around its ethical principles of privacy and fairness.

The ancient Latin conundrum Quis custodiet ipsos custodes? (who watches the watchers themselves) expressed clearlyWho_Watches_The_Watchmen__by_XxFallenFaerieX3 the problem of limiting the reach of authorities while holding those same authorities accountable. In answer, western common law proposed the dual values of privacy and transparency. More explicitly, that authorities needed to be restrained from using some information (even if readily available), and citizens had the right to know how and where they were being watched. I may sound paranoid, but in a society locked into system of machine surveillance neither the principle of privacy or transparency can be truly realized. Extensive data collection and mining may have untold benefits on policing, marketing, and  general convenience, but without an enormous effort on the part of society to manage that system, citizens will no longer understand the means and mechanisms through which they are constantly watched and judged. The machine will see it, the machine will flag it, the machine will determine the boundaries for judgement and set the conditions for the interaction between the citizen and the authorities. In a society so ruled, “Who Watches the Watchers?” not only ceases to have a meaningful answer, it will cease to be a meaningful question.


4 thoughts on “Quis custodiet ipsos custodes? – Why We Should Care about Government Data Mining

  1. It’s funny that with data mining not only can we voluntarily forgo the services of a domain expert, we can draw conclusions and correlations about matters where no experts exist, and potentially even where no causal connections exist at all. We will be tempted to throw data into our machines, and simply trust the machine when it tells us that a town with three grocery stores, one butcher shop, and a no street parking policy on Tuesdays is at really extreme risk for destruction by flood. Because data mining substitutes for human pattern matching and intuition, and works beyond the limitations of the human brain in these matters, the human brain’s ability to spot its errors is going to be hamstrung by the very nature of the practice.
    If our assumptions about statistics and correlation are in error, we are about to collide with the error, whether we recognize it or not. It is quite possible that statistical correlation by mere coincidence is much more common than we thought, but because we have always investigated statistics with an eye towards spotting real causes (ie we begin by hypothesizing a connection between data, and so our intuition limits what data will even enter into the analysis) our risk of producing false correlations and anomalies has been small. This process that you describe sounds more like we are simply collecting every measured human behavior of every type, and mechanically seeking connections. Such a system WILL succeed in finding connections; if the data is rich enough and the process thorough enough, correlations most certainly will be ferreted out. The question to me is whether this expanded scope of statistics reveals that 95% statistical certainty is simply much more common in nature than we thought, as improved instruments of astronomy revealed that the stars are radically more numerous than previously thought.
    And the other question is that if such spurious outcomes do appear, will they be recognized as spurious, and the standard of statistical certainty be raised, or will the town i imagined have to demolish one of its grocery stores to reduce its flood risk? It may be worst case scenario, but if the statistical nature of the universe turns out to be more full of coincidence than we thought, a really advanced practice of data mining could become an oracle pronouncing new superstitious dooms upon the world.

    • Well it is certainly the case that poorly run statistical methods lead to persistent false beliefs chiefly because we confuse the uncertainty described within the model with the actual uncertainty (including uncertainty not modeled and uncertainty about the validity of the model). It’s always hard to express this to statistical novices. I usually refer people to excellent xkcd webcomic which illustrates the problem.
      Also, for a more extended treatment on this subject “The Black Swan ” Nassim Nickolas Taleb

      • You’re right of course, statistics isn’t anywhere near my field and statistical thought is actually really foreign to the kind of analysis I do, and to day to day analysis. Sometimes I think a statistical claim doesn’t mean anything like what we in the general public think it means. And I don’t think the error depends on bad methods either, it’s more that we always want to draw a kind of conclusion from statistics that may or may not be justified, but is fundamentally different from the claim about probability that the statistician is trying to make (as best I can tell).

        The point I was trying to make is that in creating what you’ve called fingerprints from large numbers of variables about large numbers of people, it’s very possible to discover the fingerprint that identifies say, serial killers AND five innocent people, but it will be very hard to tell it apart from the fingerprint for just serial killers. As the number of variables collected about each person increases, the possibility of finding a strong identifying pattern for any set of people increases, including arbitrarily selected sets. Again, not being a statistician, I would speculate that if we knew an infinite number of facts about each person, we could take absolutely any group of people and connect them by a combination of facts along the lines of the fingerprint you described, even if there were no deterministic connection. By mining deep enough into the data (approaching an infinite number of variables) will we be able to find proof of the validity of the zodiac? Or do statisticians have a method for avoiding the trap? I haven’t the slightest, but it’s what sprang to mind when I read your piece.

  2. Pingback: Slate.com : The Government Can Learn More From your Data than you might Think | Data Distributist

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s