Mining Healthcare Data : A Modern Rumpelstilskin Story

Posted on February 5, 2014 by Dave "the distributist" Donovan

Via Megan McArdle: the New York times and the Washington Post are reporting on recent problems stemming from the Obama Administration’s Healthcare-data project. Apparently data analysts studying health impacts of new programs are not controlling their experimental samples. Whereas ideally the government’s analysis would be the basis for crafting intelligent policy, the New York Time’s description calls into question the robustness of the research being conducted.

“The studies that are regarded as the most reliable randomly assign people or institutions to participate in a program or to go on as usual, and then compare outcomes for the two groups to see if the intervention had an effect.

Instead, the Innovation Center has so far mostly undertaken demonstration projects; about 40 of them are now underway. Those projects test an idea, like a new payment system that might encourage better medical care — with all of a study’s participants, and then rely on mathematical modeling to judge the results.”

The superficial approach described above is odd because it seemingly flies in the face of conventional approaches statistical modeling. For those not familiar, establishing a randomized control is essential to getting results that don’t just confirm the hypothesis is being tested. You can see this problem in the infamous Israeli Air Force Study(a really informative overview of this concept can be found on YouTube), and it’s been a long standing statistical understanding that, when possible, randomized control samples are always preferable.

So why do government analysts feel so confident that they can dispense with what has, until recently, been an essential feature in any statistical experiment? Well because they’ve got great data-mining technology! Here, the word “mathematical modeling” does a lot of work in obscuring the real methods that the government is using. Mathematical modeling can really mean anything, and ironically the NYT’s link on this description is broken.

Megan McArdle, has a good take on the possible sources of the mistake: sloppy thinking on the part of federal bureaucrats. Says McArdle:

Gold’s article implies that the administration is looking at gross savings — which is to say, it’s just reporting the amount of money saved by the accountable-care organizations that ended up on the positive side of the ledger, even though this is less than half the total. Statisticians have a term for this: the Texas sharpshooter fallacy…..

Perhaps, I may be even more cynical than McArdle, but my take is somewhat different.

Given that the administration has been unable to produce evidence of healthcare savings from increased coverage, it is fair to say that the president is feeling pressure to come up with some statistical result that will make costs appear more reasonable (at least ahead of the next CBO estimate). Moreover, without speculating too much as to the overall structure of bureaucratic management, I don’t think it is unlikely that individual analysts are also feeling the pressure to deliver “good” results, especially with all of these cool new “big data” tools so prominently featured in the news.

The result is predictable: a sort of magical thinking arises where data-mining and complex models become panacea for turning poorly conducted statistical tests into predictive models showing large savings from new “innovative” approaches to delivering healthcare. Of course the results are all confirmation bias, but who’s going to look a gift horse in the mouth? Certainly not an administration desperate for good news on the healthcare front.

Now admittedly, I have no inside information, but if this kind of sloppy analysis is indeed going on then it is certainly a cause for concern. The one-sided use of over-optimistic healthcare predictions could lead the CBO to perennial underestimate the cost of supporting programs like Medicare in their current state. This in turn could ultimately doom these program’s long-term solvency (not to mention the long term solvency of the country) since politicians are all too willing to forgo necessary reform in the light of CBO reports that tell them healthcare costs will come down on their own accord.

But ultimately this problem is not political. It stems from a cultural approach to data analysis that is far too prevalent in industry and in government. I like to think of it as a modern day Rupelstilskin story. What do we have? Reams of uncontrolled data. What Do We Want? Optimistic predictive results. With this point of view, it’s tempting to simply lock analysts in a room and ask them to build mathematical models until they finally manage to spin the straw data into golden predictive models like the miller’s daughter from the aforementioned fairy tale.

But just as in the fairytale, when we force someone to spin straw into gold, it shouldn’t be surprising when magical methods play a large role in their process. Moreover, in the case of the government’s own analysis the Rupelstilskin metaphor can be taken yet further. For in trusting their magic numbers, our current leaders may have put the next generation on the line for the results.

The Truth About Data Science

Posted on February 4, 2014 by Dave "the distributist" Donovan

From a recent conference on data science :

” A data scientist is a statistician who lives in San Francisco.”

“Data Science is statistics on a Mac.”

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”

True words.

For the Weekend : Seven Quick Takes #7QT

Posted on January 18, 2014 by Dave "the distributist" Donovan

Using a tried but true internet meme. Seven quick reactions to interesting things on the internet this week.

1.) Math : One of the Strangest Numerical Identities Ever

Who would have thought that 1 + 2 + 3 + 4 + 5 ……. = -1/12? The guys at the youtube channel NumberPhile have the details:

The result is sound but a little deceptive. The idea a non-converging series having a finite numerical identity is already such an abstract concept that it’s hard for me to really expect a rational looking result moving forward. Still the result is weird, valid and apparently applicable to string theory. But how can you add bunch of positive numbers together and still get a negative one?

2.) Privacy – Good News : Obama Scales back the NSA spying program

The Obama Administration looks like it’s finally embarrassed itself enough and is now making some retractions on spying. Though, I suspect this ultimately may amount to little more than a strategic repositioning. It looks like there is still going to be some kind of data retention mandate that forces the third parties to retain data in case the government has “legitimate” reasons to make queries. Of course, we all know that we can count on large corporations to securely hold vast collections of personal data they do not expect to profit from. Right?, right?

3.) Privacy – Bad News : Target’s Personal Data Breach May Be Worse than expected

So, unsurprisingly, Target has lost an astronomical amount of customer data including e-mail, telephone numbers, and PINs. It’s looking like negligence on the part of Target played a role in the breach but I am not expecting anyone to ultimately care about this. Even if we can expect a large amount of identity theft coming from the data showing up on a Russian hacker site, is the public really going to link the increase in identity theft back to bad corporate data management?

4.) Folk Music : Is Inside Llewyn Davis Accurate?

It looks like a number of folk singers have pushed back against the portrayal of early 60s Greenwich Village. No comment here, but the discussion is quite interesting.

5.) Entertainment : The Academy Award Nominated Animated Shorts

Probably the highlight of the Academy Award season is the Animated Shorts category. It’s usually worth it to catch the shorts in theaters and now we have the lineup with trailers:

This year’s fare seems a little disappointing however. Most look like very straightforward Pixar clones. In fact, I’m finding myself looking forward more to a feature-length nomination, namely, The Wind Rises , Miyazaki’s last film detailing the life of Jiro Horikoshi, the father of Japanese aviation and the infamous creator of the Zero.

6.) Religion : Atheist “Church” in Schism Several months after starting

I blogged several month’s ago about the coming of an “Atheist Babylon”, where atheist communities would schism along ideological lines so stark that most non-believers would prefer to associate with like-minded believers rather than their atheist brethren. Sure enough in the news today:

The squabbles led to a tiff and finally a schism between two factions within Sunday Assembly NYC (Atheist congregations DL). Jones reportedly told Moore that his faction was no longer welcome in the Sunday Assembly movement.

Moore promises that his group, Godless Revival, will be more firmly atheistic than the Sunday Assembly, which he now dismisses as “a humanistic cult.”

Someone still needs to explain what “humanistic cult” means to a militant atheist, it sounds like something a conservative Catholic would say.

7.) Data : What will the Impact of Big Data be on Pharmaceutical Marketing?

Finally, a new speculation about uses of data-mining in the pharmaceuticals industry. This looks like a long shot to me, but it’s worth keeping an eye on. I might start worrying if the results are more successful.

Anyway, Happy Martin Luther King Day Weekend! I’ll stay updated.

Byte Counter Byte: Human vs. Machine Judgment

Posted on January 10, 2014 by Dave "the distributist" Donovan

The central question in the digital age may be “who owns our data?” but this could just as easily be rephrased as “who makes the decisions over how our data is used?”. So far the decisions have been increasingly made my machines. The conversation shown here illustrates that there may be some caveats to this assumption.

In a nutshell, this is more or less the disagreement as it’s seen from a data-scientist’s perspective. But I think that there are more fundamental questions for consumers. Beyond which of the two models will ultimately dominate the market, do we want machines to manage how our data is used and analyzed?

Distributist Resolutions for the Digital Age

Posted on January 3, 2014 by Dave "the distributist" Donovan

I’ve been thinking a lot about becoming more responsible for my digital property in 2014. It’s not just the scandal with the NSA. It’s realizing how much of one’s life is essentially tied up in strings of “1’s” and “0’s” stored on large corporate-owned servers. If there is one thing I’ve learned in 2013, it’s how fundamentally essential my digital information is to my personal well-being.

This general attitude has only been reinforced since I heard from a friend who lost $10,000 dollars in BitCoin when he cancelled a cloud account without including proper backup. At first it sounded outlandish to be so invested in pieces of information that essentially didn’t exist beyond their presence on a third party server. Then I thought of the copious amounts of ebooks, apps, and music I “owned” but that could be easily rescinded by the party in charge of the DRM.

So in 2013 I will be trying out some new resolutions, not just to ensure my own information is secure, but to really become part of the collective solution that will eventually be needed to solve the issue.

1.) Use Non-Proprietary and DRM-free file formats

I think one of the main ways to ensure privacy is to establish boundaries between data owned by the user and the data owned by the service. Nothing has hurt this distinction more than existence of pervasive DRM. By now everyone if familiar with horror stories of people loosing their collection of ebooks or mp3’s based on legalistic mismanagement on the part of Amazon or Apple. But beyond ridiculous worst-case scenarios, the truly destructive part of DRM is the implicit understanding it embodies that a user does not own a digital piece of media the way they own a physical copy of the same material. In order for any sort of rational concept of information ownership to emerge, DRM must go.

For myself this is a daunting task. Like most users of my generation I bought into the digital marketplace early and without thinking of the in infrastructure I was creating. As a result I have invested thousands in media formats that are DRM locked. Yes, I can strip it off, but this takes time and is not exactly legal. For the time being, at least I can stick to the formats that are open. No more kindle books or iTunes media.

2.) Use the Open Source Alternative

Alright, Alright. I have already written about the general futility of trying to work without proprietary software. But I am also sick and tired of companies abusing their market dominance of applications to rope people’s data into their own personal cloud systems. There is no way to survive (in the corporate world at least) without Microsoft Office; but I’ve noticed increasingly that the application is trying to move my documents from the hard drive to the cloud. Creepy, but especially creepy considering that, due to Microsoft’s dominance, the open source alternatives for word processing and spreadsheet management provide no real alternative in a modern work flow.

For the time being it’s baby steps: using Firefox instead of Google Chrome, GIMP instead of Adobe Photoshop. Though, it might be a while before I can ditch iTunes or Microsoft Office.

3.) Keep Updated With Privacy News and Networks

There are plenty of ways to keep abreast of the various updates to the status of online privacy. However, I have to confess, as much as I love talking about privacy in the abstract, I hate actually following the day-to-day news concerning which new groups have, most recently, been abusing digital privacy. Still, there is no real way to handle the issue without being informed. Not to mention, I’d be a bit of a hypocrite to complain about user apathy when I can’t be bothered to read a three page article about the new Google terms of service.

I should have my work cut out for me for the next year. I also plan to exercise regularly, sustain a low-carb diet, and lose ten pounds; but, of course, that should resolution should be relatively easy to keep.

Why I’m Praying for More Judicial Activism against Online Privacy

Posted on December 29, 2013 by Dave "the distributist" Donovan

We all knew this was coming. Yesterday, the courts pushed back against earlier rulings on privacy and the NSA’s data-collection schemes. From the New York Times :

“A federal judge on Friday ruled that a National Security Agency program that collects enormous troves of phone records is legal, making the latest contribution to an extraordinary debate among courts and a presidential review group about how to balance security and privacy in the era of big data

In just 11 days, the two judges and the presidential panel reached the opposite of consensus on every significant question before them, including the intelligence value of the program, the privacy interests at stake and how the Constitution figures in the analysis.”

I do hope that the US Supreme Court picks this case up. Not that I’m expecting the court to rule in favor of privacy, I just want some definitive status-quo so that an honest discussion of the issue can take place. I get the sense reading the news that no one really understands what’s at stake or the relevant precedence in law for online privacy. The technology is changing fast and, consequently, no one feels like its worth developing a strong opinion. At this time, a large judicial decision might help people wake up and become involved with the issue of digital privacy.

I think this is more or less the role that Roe vs. Wade had on the issue of abortion. Before the landmark ruling, the anti-abortion movement was a disorganized coalition of church groups shell-shocked by the sexual revolution and unable to put forward any argument beyond dogma. Forty years later, with the specter of Roe vs.Wade still looming, the Pro-Life community had formed itself into a cohesive and burgeoning movement dwarfing its opposition on the national stage.

I would hope something similar might be possible for the advocates of online privacy. A setback wrought through judicial activism would be bad; but could anything be worse than the slow deterioration of privacy through apathy and public ignorance?

An Ode to the Small Victories

Posted on December 17, 2013 by Dave "the distributist" Donovan

I’m a cynic by nature, so it’s great to occasionally revel in a small victory even though one might have an eerie foreboding that the gains seen today will be wiped away by tomorrow’s ill omens.

First, several weeks ago, I heard a very interesting story about a segment of Engineers at Google working towards advancing the cause data-ownership. Recently, these folks introduced a new function to easily download one’s data off of the Google Platform. This might seem to be a pedantic thing for people not on the techie side of things, but this is a pretty major concession on the part of a company that makes its money from exerting control over personal information. More than the actually utility of downloading 4 giga-bytes of e-mail messages and “having them” on ones drive, the step of accommodating the download tacitly concedes that users have a distinct ownership over the data they upload onto Google. This is a big step.

I’ve always been a little concerned about how dependent we have become on Google to hold the information infrastructure that supports our lives. I still don’t trust the company, but it’s good to know that there are people who are aware of the problems

But just as the advocates for privacy must be ever vigilant, so too must Justice’s sword be swift at avenging its offense, and oh-did the avenging sword fall today:

“I cannot imagine a more ‘indiscriminate’ and ‘arbitrary’ invasion than this systematic and high-tech collection and retention of personal data on virtually every single citizen for purposes of querying and analyzing it without prior judicial approval,”

That quote from judge Richard L. Leon of the Federal District Court for DC in the first ruling of what I hope to be an utter judicial repudiation of the Obama administration ‘s assault on privacy and data ownership. It’s hard not to be optimistic after a ruling strikes at the core of the NSA’s efforts to co-op all privately collected data. However, this is not a definitive as it might at first seem. The ruling just an injunction against further collection of data and likely this step will be overturned by a higher court considering the liberties Judge Leon took with interpreting Smith v. Maryland.

Nonetheless one has to celebrate the small victories raise, raise a glass, and sing something appropriate to the occasion, maybe by Robert Burns….

Slate.com : The Government Can Learn More From your Data than you might Think

Posted on November 24, 2013 by Dave "the distributist" Donovan

In today’s slate, Dahlia Lithwick echo’s a concern that I’ve voiced previously in my post : Quis custodiet ipsos custodes?. Says Lithwick :

…our metadata in fact tells the government a lot more about us than we might realize, especially when different types of metadata are aggregated together. Consider calls to single-purpose hotlines: NSA collection of our metadata means the government knows when we’ve called a rape hotline, a domestic violence hotline, an addiction hotline, or a support line for gay teens. Hotlines for whistleblowers in every agency are fair game, as are police hotlines for “anonymous” reports of crimes. Charities that make it possible to text a donation to a particular cause (say, Planned Parenthood) or political candidate or super PAC could reveal an enormous amount about our political activities.

I’m glad to see that the issue is receiving more of the legal attention that it deserves. The article emphasizes an important issue, namely, how the new data-mining technology will allow obscure facts to be inferred from seemingly innocuous data, independent of individual human observers. The conversation is certainly advancing. Certainly the first step is making the average citizen aware that the data they think is available about them online is only the tip of the iceberg of the information actually available to a data-miner.

Still, there is a missing piece. None of the mainstream articles on this subject, so far, have talked about how data-mining might obscure from the data-collectors themselves the intrusiveness of their queries. With the kind of automation that is available, it is not hard to imagine an algorithm-developed personal profile being created with information much more intimate than the developers of said algorithms intended. A data-miner ignorant of his domain (or asleep at the switch), might be much more dangerous than any nosey bureaucrat.

At the time being my concern still is more in the realm of science fiction. Nonetheless, I am expecting that it won’t take too long for a major scandal to break where the authorities’ lack of self-awareness about their own intrusiveness will be all-too-obvious. It wasn’t long ago that large-scale accurate digital surveillance was itself science fiction. Technology, especially when automated, has a way of surpassing our own awareness of it.

The Distributist

Information, Ownership, Insight

Tag Archives: datamining

Mining Healthcare Data : A Modern Rumpelstilskin Story

The Truth About Data Science

For the Weekend : Seven Quick Takes #7QT

1.) Math : One of the Strangest Numerical Identities Ever

2.) Privacy – Good News : Obama Scales back the NSA spying program

3.) Privacy – Bad News : Target’s Personal Data Breach May Be Worse than expected

4.) Folk Music : Is Inside Llewyn Davis Accurate?

5.) Entertainment : The Academy Award Nominated Animated Shorts

6.) Religion : Atheist “Church” in Schism Several months after starting

7.) Data : What will the Impact of Big Data be on Pharmaceutical Marketing?

Byte Counter Byte: Human vs. Machine Judgment

Distributist Resolutions for the Digital Age

Why I’m Praying for More Judicial Activism against Online Privacy

An Ode to the Small Victories

Slate.com : The Government Can Learn More From your Data than you might Think

Information, Ownership, Insight

Share this:

Share this:

1.) Math : One of the Strangest Numerical Identities Ever

2.) Privacy – Good News : Obama Scales back the NSA spying program

3.) Privacy – Bad News : Target’s Personal Data Breach May Be Worse than expected

4.) Folk Music : Is Inside Llewyn Davis Accurate?

5.) Entertainment : The Academy Award Nominated Animated Shorts

6.) Religion : Atheist “Church” in Schism Several months after starting

7.) Data : What will the Impact of Big Data be on Pharmaceutical Marketing?

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: