Mining Healthcare Data : A Modern Rumpelstilskin Story

Via Megan McArdle: the New York times and the Washington Post are reporting on recent problems stemming from the Obama Administration’s Healthcare-data project. Apparently data analysts studying health impacts of new programs are not controlling their experimental samples. Whereas ideally the government’s analysis would be the basis for crafting intelligent policy, the New York Time’s description calls into question the robustness of the research being conducted.healthData

“The studies that are regarded as the most reliable randomly assign people or institutions to participate in a program or to go on as usual, and then compare outcomes for the two groups to see if the intervention had an effect.

Instead, the Innovation Center has so far mostly undertaken demonstration projects; about 40 of them are now underway. Those projects test an idea, like a new payment system that might encourage better medical care — with all of a study’s participants, and then rely on mathematical modeling to judge the results.”

The superficial approach described above is odd because it seemingly flies in the face of conventional approaches statistical modeling. For those not familiar, establishing a randomized control is essential to getting results that don’t just confirm the hypothesis is being tested. You can see this problem in the infamous Israeli Air Force Study(a really informative overview of this concept can be found on YouTube), and it’s been a long standing statistical understanding that, when possible, randomized control samples are always preferable.

So why do government analysts feel so confident that they can dispense with what has, until recently, been an essential feature in any statistical experiment? Well because they’ve got great data-mining technology!  Here, the word “mathematical modeling” does a lot of work in obscuring the real methods that the government is using. Mathematical modeling can really mean anything, and ironically the NYT’s link on this description is broken.

Megan McArdle, has a good take on the possible sources of the mistake: sloppy thinking on the part of federal bureaucrats. Says McArdle:

Gold’s article implies that the administration is looking at gross savings — which is to say, it’s just reporting the amount of money saved by the accountable-care organizations that ended up on the positive side of the ledger, even though this is less than half the total. Statisticians have a term for this: the Texas sharpshooter fallacy…..

Perhaps, I may be even more cynical than McArdle, but my take is somewhat different.rumpel

Given that the administration has been unable to produce evidence of healthcare savings from increased coverage, it is fair to say that the president is feeling pressure to come up with some statistical result that will make costs appear more reasonable  (at least ahead of the next CBO estimate). Moreover, without speculating too much as to the overall structure of bureaucratic management, I don’t think it is unlikely that individual analysts are also feeling the pressure to deliver “good” results, especially with all of these cool new “big data” tools so prominently featured in the news.

The result is predictable: a sort of magical thinking arises where data-mining and complex models become panacea for turning poorly conducted statistical tests into predictive models showing large savings from new “innovative” approaches to delivering healthcare. Of course the results are all confirmation bias, but who’s going to look a gift horse in the mouth? Certainly not an administration desperate for good news on the healthcare front.

Now admittedly, I have no inside information, but if this kind of sloppy analysis is indeed going on then it is certainly a cause for concern. The one-sided use of over-optimistic healthcare predictions could lead the CBO to perennial underestimate the cost of supporting programs like Medicare in their current state. This in turn could ultimately doom these program’s long-term solvency (not to mention the long term solvency of the country) since politicians are all too willing to forgo necessary reform in the light of CBO reports that tell them healthcare costs will come down on their own accord.

But ultimately this problem is not political. It stems from a cultural approach to data analysis that is far too prevalent in industry and in government. I like to think of it as a modern day Rupelstilskin story. What do we have? Reams of uncontrolled data. What Do We Want? Optimistic predictive results. With this point of view, it’s tempting to simply lock analysts in a room and ask them to build mathematical models until they finally manage to spin the straw data into golden predictive models like the miller’s daughter from the aforementioned fairy tale.

But just as in the fairytale, when we force someone to spin straw into gold, it shouldn’t be surprising when magical methods play a large role in their process. Moreover, in the case of the government’s own analysis the Rupelstilskin metaphor can be taken yet further. For in trusting their magic numbers, our current leaders may have put the next generation on the line for the results.

The Truth About Data Science

DataScience

From a recent conference on data science :

” A data scientist is a statistician who lives in San Francisco.”

“Data Science is statistics on a Mac.”

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”

True words.

For the Weekend : Seven Quick Takes #7QT

Using a tried but true internet meme. Seven quick reactions to interesting things on the internet this week.

1.) Math : One of the Strangest Numerical Identities Ever

Who would have thought that 1 + 2 + 3 + 4 + 5 ……. = -1/12? The guys at the youtube channel NumberPhile have the details:

The result is sound but a little deceptive. The idea a non-converging series having a finite numerical identity is already such an abstract concept that it’s hard for me to really expect a rational looking result moving forward. Still the result is weird, valid and apparently applicable to string theory. But how can you add bunch of positive numbers together and still get a negative one?

2.) Privacy – Good News : Obama Scales back the NSA spying program

The Obama Administration looks like it’s finally embarrassed itself enough and is now making some retractions on spying. Though, I suspect this ultimately may amount to little more than a strategic repositioning. It looks like there is still going to be some kind of data retention mandate that forces the third parties to retain data in case the government has “legitimate” reasons to make queries. Of course, we all know that we can count on large corporations to securely hold vast collections of personal data they do not expect to profit from. Right?, right?

3.) Privacy – Bad  News : Target’s Personal Data Breach May Be Worse than expected

So, unsurprisingly, Target has lost an astronomical amount of customer data including e-mail, telephone numbers, and PINs. It’s looking like negligence on the part of Target played a role in the breach but I am not expecting anyone to ultimately care about this. Even if we can expect a large amount of identity theft coming from the data showing up on a Russian hacker site, is the public really going to link the increase in identity theft back to bad corporate data management?

4.) Folk Music : Is Inside Llewyn Davis Accurate?

It looks like a number of folk singers have pushed back against the portrayal of early 60s Greenwich Village. No comment here, but the discussion is quite interesting.

5.) Entertainment :  The Academy Award Nominated Animated Shorts

Probably the highlight of the Academy Award season is the Animated Shorts category. It’s usually worth it to catch the shorts in theaters and now we have the lineup with trailers:

This year’s fare seems a little disappointing however. Most look like very straightforward Pixar clones. In fact, I’m finding myself looking forward more to a feature-length nomination, namely, The Wind Rises , Miyazaki’s last film detailing the life of Jiro Horikoshi, the father of Japanese aviation and the infamous creator of the Zero.

6.) Religion : Atheist “Church” in Schism Several months after starting

I blogged several month’s ago about the coming of an “Atheist Babylon”, where atheist communities would schism along ideological lines so stark that most non-believers would prefer to associate with like-minded believers rather than their atheist brethren. Sure enough in the news today:

The squabbles led to a tiff and finally a schism between two factions within Sunday Assembly NYC (Atheist congregations DL). Jones reportedly told Moore that his faction was no longer welcome in the Sunday Assembly movement.

Moore promises that his group, Godless Revival, will be more firmly atheistic than the Sunday Assembly, which he now dismisses as “a humanistic cult.”

Someone still needs to explain what “humanistic cult” means to a militant atheist, it sounds like something a conservative Catholic would say.

7.) Data : What will the Impact of Big Data be on Pharmaceutical Marketing?

Finally, a new speculation about uses of data-mining in the pharmaceuticals industry. This looks like a long shot to me, but it’s worth keeping an eye on. I might start worrying if the results are more successful.

Anyway, Happy Martin Luther King Day Weekend! I’ll stay updated.

Byte Counter Byte: Human vs. Machine Judgment

The central question in the digital age may be “who owns our data?” but this could just as easily be rephrased as “who makes the decisions over how our data is used?”. So far the decisions have been increasingly made my machines. The conversation shown here illustrates that there may be some caveats to this assumption.

In a nutshell, this is more or less the disagreement as it’s seen from a data-scientist’s perspective. But I think that there are more fundamental questions for consumers. Beyond which of the two models will ultimately dominate the market, do we want machines to manage how our data is used and analyzed?

Distributist Resolutions for the Digital Age

I’ve been thinking a lot about becoming more responsible for my digital property in 2014. It’s not just the scandal with the NSA. It’s realizing how much of one’s life is essentially tied up in strings of “1’s” and “0’s” stored on large corporate-owned servers. If there is one thing I’ve learned in 2013, it’s how fundamentally essential my digital information is to my personal well-being.

This general attitude has only been reinforced since I heard from a friend who lost $10,000 dollars in BitCoin when he cancelled a cloud account without including proper backup. At first it sounded outlandish to be so invested in pieces of information that essentially didn’t exist beyond their presence on a third party server. Then I thought of the copious amounts of ebooks, apps, and music I “owned” but that could be easily rescinded by the party in charge of the DRM.

So in 2013 I will be trying out some new resolutions, not just to ensure my own information is secure, but to really become part of the collective solution that will eventually be needed to solve the issue.

1.) Use Non-Proprietary and DRM-free file formats

I think one of the main ways to ensure privacy is to establish boundaries between data owned by the user and the data owned by the service. Nothing has hurt this distinction more than existence of pervasive DRM. By now everyone if familiar with horror stories of people loosing their collection of ebooks or mp3’s based on legalistic mismanagement on the part of Amazon or Apple. But beyond ridiculous worst-case scenarios, the truly destructive part of DRM is the implicit understanding it embodies that a user does not own a digital piece of media the way they own a physical copy of the same material. In order for any sort of rational concept of information ownership to emerge, DRM must go.

For myself this is a daunting task. Like most users of my generation I bought intdataownershipo the digital marketplace early and without thinking of the in infrastructure I was creating. As a result I have invested thousands in media formats that are DRM locked. Yes, I can strip it off, but this takes time and is not exactly legal. For the time being, at least I can stick to the formats that are open. No more kindle books or iTunes media.

2.) Use the Open Source Alternative

Alright, Alright. I have already written about the general futility of trying to work without proprietary software. But I am also sick and tired of companies abusing their market dominance of applications to rope people’s data into their own personal cloud systems. There is no way to survive (in the corporate world at least) without Microsoft Office; but I’ve noticed increasingly that the application is trying to move my documents from the hard drive to the cloud. Creepy, but especially creepy considering that, due to Microsoft’s dominance, the open source alternatives for word processing and spreadsheet management provide no real alternative in a modern work flow.

For the time being it’s baby steps: using Firefox instead of Google Chrome, GIMP instead of Adobe Photoshop. Though, it might be a while before I can ditch iTunes or Microsoft Office.

3.) Keep Updated With Privacy News and Networks

There are plenty of ways to keep abreast of the various updates to the status of online privacy. However, I have to confess, as much as I love talking about privacy in the abstract, I hate actually following the day-to-day news concerning which new groups have, most recently, been abusing digital privacy. Still, there is no real way to handle the issue without being informed. Not to mention, I’d be a bit of a hypocrite to complain about user apathy when I can’t be bothered to read a three page article about the new Google terms of service.

I should have my work cut out for me for the next year. I also plan to exercise regularly, sustain a low-carb diet, and lose ten pounds; but, of course, that should resolution should be relatively easy to keep.

Why I’m Praying for More Judicial Activism against Online Privacy

We all knew this was coming. Yesterday, the courts pushed back against earlier rulings on privacy and the NSA’s data-collection schemes. From the New York Times :

“A federal judge on Friday ruled that a National Security Agency program that collects enormous troves of phone records is legal, making the latest contribution to an extraordinary debate among courts and a presidential review group about how to balance security and privacy in the era of big data

In just 11 days, the two judges and the presidential panel reached the opposite of consensus on every significant question before them, including the intelligence value of the program, the privacy interests at stake and how the Constitution figures in the analysis.”

I do hope that the US Supreme Court picks this case up. Not that I’m expecting the court to rule in favor of privacy, I just want some definitive status-quo so that an honest discussion of the issue can take place. I get the sense reading the news that no one really understands what’s at stake or the relevant precedence in law for online privacy. The technology is changing fast and, consequently, no one feels like its worth developing a strong opinion. At this time, a large judicial decision might help people wake up and become involved with the issue of digital privacy.

I think this is more or less the role that Roe vs. Wade had on the issue of abortion. Before the landmark ruling, the anti-abortion movement was a disorganized coalition of church groups shell-shocked by the sexual revolution and unable to put forward any argument beyond dogma. Forty years later, with the specter of Roe vs.Wade still looming, the Pro-Life community had formed itself into a cohesive and burgeoning movement dwarfing its opposition on the national stage.

I would hope something similar might be possible for the advocates of online privacy. A setback wrought through judicial activism would be bad; but could anything be worse than the slow deterioration of privacy through apathy and public ignorance?