If you are anything like me, you will agree that this summer’s PRISM/NSA scandal has marked a new low in the history of 21st century digital surveillance. But by far the scariest prospect is that the NSA’s attitude towards privacy and data-mining might represent a new normal in how large institutions relate to their customers’ data. True to form, for the last half-year the media has been reinforcing just this narrative with their breathless coverage of how “big data” concepts are revolutionizing the future of both security and corporate administration, offering untold benefits to analysts interested in getting an edge. But even with the copious amounts of digital ink spilled on “big data” and data-mining, very little consideration is given to explaining the technology involved.
To a certain extent, this is understandable. The NSA scandals along with other corporate data-farming controversies are primarily about the privacy and ownership of social data, not the use of the data once it is collected. When the government (or corporate entity) has already absconded with our information, do we really care what they do with it? Still, absent even a rough overview of the technology, several of the potential dangers with our lax attitudes towards data privacy may not be fully apparent to the average news consumer.
In the broadest sense data mining is simply the use of computer algorithms to adaptively classify a dataset to discover subtle relationships. “Adaptively” is the keyword in the last sentence. Data mining certainly is not the straightforward testing of relationships over a large dataset (that would be standard statistical analysis). For this reason data mining is often categorized as an extension of artificial intelligence or machine learning. Data mining techniques’ advantage over traditional methods is in their ability to perceive relationships that could confound human perception. As humans, we are programmed to look for certain types of patterns in data sets. Humans have certain sensitivities (perceiving faces and voices) and in-sensitivities (very small or large numbers) and when approaching a dataset these biases can blind us to relationships that contain a good deal of predictive power. Unlike standard statistical methods, most data-mining algorithms simply begin with a metric to track and then proceed to use examples (called a training set) to create categories that explain the metrics variance. Since no categories are proposed at the offset, this adaptive technique greatly resembles “learning” and can come up with relationships across thousands of distinct inputs that a human would likely never discover.
Before my readers’ eyes glaze over, perhaps I can offer an example showing the difference between the two approaches. Say we would like to understand what factors influence teacher performance in primary schools. Certainly we can imagine a large dataset comprising the dossiers of teachers along with a metric representing the overall effectiveness of those teachers in educating their students. In a standard statistical approach, one might try to test whether classroom size, income, or location, or even a linear combination of all the variables effect the ultimate performance metric. From these standard methods we would expect an answer that would tell us whether there is a statistically valid relationship in any one of our proposed relationships (e.g. if there is 0.85 correlation between class size and performance at 95% confidence). However, this is approach limited. There might be very subtle relationships lurking in the many thousands variables we have on each teacher. Enter data mining. At the end of running a mining algorithm we might arrive at entirely unexpected rules that would never have been discovered by a pre-hoc query. We might discover, for instance, that teachers below the age of 35 have a improved performance metric but only if they are unmarried and living in an urban area; we could discover that teachers above the age of 40 in suburbs actually improve their performance metric when class size is increased above 20 students. Any number of complicated relationships might be generated based on the very subtle interplay of the thousands of variables describing each teacher. Some one might never expect.
At this point one might ask: So what’s the problem with this technology? Certainly it can only help to understand reality-based facts to make our decisions. If we assume no malice on the part of the one conducting the investigation, there can be nothing to lose from implementing a new approach to data. Well, like most things, the devil is in the details.
In the previous example, the rules I offered were odd but easy to understand. In a more realistic scenario rules could involve the interplay of thousand variables and might not be logically discernible to non-experts. In many, if not most cases, huge multi-variate relationships could be standing in for casual relationships between data not even contained in the original dataset. Of course a domain expert could, with some work, gain an understanding of why these rules hold predictive power but this is expensive and given the nature of current data mining practices, such rigorous methods are almost never pursued. In fact, to some analysts, data mining rules are often seen as a possible replacement for domain knowledge. This can lead to an insane scenario, where data-mining applications actually blind analysts to the casual relationships in play. When the data becomes large and complex, it may be tempting to let the so-called “learning” of the data-mining algorithm replace expertise. But despite the nomenclature, the algorithm hasn’t really applied any intelligence; and so a modern data-mining system could proceed with it’s power discernment completely disabled.
All of this sounds very esoteric so perhaps I can return to our original example of tracking social and personal data. The great and abiding problem with allowing large institutions free range to mine these personal data sources is that citizens will have only a vague notion of what casual inferences are being used to track them. In all likelihood the institutions have no clear idea either, and because of this, any number of things might be unwittingly inferred about subjects. With a large datasheet, a machine learning algorithm might discover an exact fingerprint that will statistically identify people as gun-owners, transgender, Muslims, home-schoolers, or any number of things not even tracked in the initial dataset. Of course, these relationships will be obscured to all but the very rigorous analyst, but in a poorly supervised system these fingerprints (standing in for first hand data) will be used to make decisions.
Privacy advocates frequently bring up examples where someone’s sexual and dietary habits might be inferred from access to an amazon wish list or a public Facebook profile. This scenario is often dismissed by cooler-headed individuals as paranoid tripe. Don’t the analysts at Amazon or the NSA have better things to do then sit around collecting personal dossiers citizens? And how could anyone be spiteful enough to snoop into strangers personal lives? The cooler heads are certainly right in assuming a lack spite on the part of data-miners, unfortunately sloth might accomplish what malice does not. Your employer or the government might be flagged that you are a suspicious or an “interesting” individual by their data-analysis system simply because you have a hobby of collecting antique guns. Of course no one would have ever asked if you were a gun-collector to begin with, but nevertheless they are getting an answer (in so many words) from the algorithm. A data mining algorithm that reads through thousands of different variables across a dataset of millions of individuals will likely have fingerprints (statistical proxies) for any number of personal facts. As a result government or corporate dossiers could implicitly have any number of pieces of information about an individual..
I suppose at this point one could introduce another larger philosophical objection But really what is the problem? If the relationships are obscured within the system no one individual is the wiser and if statistically some groups are prone to being more violent, better customers of embarrassing products, or simply not well suited for some social environments, shouldn’t we be using this information to improve society? After all no person will likely discover or know these relationships? But the problem is just that. No one does know! The human transaction of investigation, which should contain a level of discretion, has been replaced by a headless automaton. In the past, Americans decided that, at some level, there was a core concept of privacy, some predictive information would not be used to make judgments about an individual regardless of the benefits to society. Data-mining, in effect, could circumvent this concept, using peripheral data to compose an implicit portrait of peoples’ more intimate biographical details.
All of this brings me back to two recent news stories: the revelation of extensive and racially biased “stop and frisk policies” within Bloomberg’s New York city and the increasing prevalence of CCTV surveillance throughout most urban areas in Europe and America. Certainly both surveillance of citizens and the use of racial profiling is troubling itself. However, in both cases, the extent of the evils were mitigated by technological limitations and a firm desire on the part of the public not to transgress common sense privacy norms. CCTV never could be the Orwellian Panopticon of Big Brother for the simple reason that, for each CCTV, there had to be an official at the other end watching (an intractable staffing issue). Moreover, despite having used racially profiling “stop and frisk” policies in the past, New Yorkers are now demonstrating that they find said-policies unethical even if they do help law enforcement reduce crime rates. There are too many people who value privacy above security, and too few watchers willing to test their limits for a nightmare scenario to ever be realized.
However, with the advent of data mining, both the technological barriers and our ability to detect ethical problems in the invasion of our privacy are fundamentally undermined. Whereas previously a city-wide CCTV surveillance system could never be realistically be staffed; a CCTV database paired with a data-mining algorithm trained to the face, action, and motion patterns of criminals could quite easily provide an effective means to detecting potential crimes in progress. Conversely, police officials forbidden from using race as a means to track suspects may turn to subtler methods of tracking criminal activity through the use of data-mining algorithms trained on everything BUT race and race’s more obvious correlates. What is likely to emerge are large complicated data-fingerprints that proxy for race in everything but name. The New York Mayors Office was able to effectively deny the use of race in its “stop and frisk” policy for years by hiding behind an obvious proxy (saying they were targeting “neighborhoods”). Subsequently, future regimes able to couch their implicit racial strategy in complicated data-sets may be virtually undetectable by the public. Of course those rigorous enough to check (or cynical enough to guess) will know that society is essentially tracking its members by race anyway; and so, at the end of the day society will be using technology to do an end-run-around its ethical principles of privacy and fairness.
The ancient Latin conundrum Quis custodiet ipsos custodes? (who watches the watchers themselves) expressed clearly the problem of limiting the reach of authorities while holding those same authorities accountable. In answer, western common law proposed the dual values of privacy and transparency. More explicitly, that authorities needed to be restrained from using some information (even if readily available), and citizens had the right to know how and where they were being watched. I may sound paranoid, but in a society locked into system of machine surveillance neither the principle of privacy or transparency can be truly realized. Extensive data collection and mining may have untold benefits on policing, marketing, and general convenience, but without an enormous effort on the part of society to manage that system, citizens will no longer understand the means and mechanisms through which they are constantly watched and judged. The machine will see it, the machine will flag it, the machine will determine the boundaries for judgement and set the conditions for the interaction between the citizen and the authorities. In a society so ruled, “Who Watches the Watchers?” not only ceases to have a meaningful answer, it will cease to be a meaningful question.