Peter W Frey's image Rank 13th
Posts 19
Thanks 3
Joined 7 Aug '10 Email user

The objective for participants in the HHP contest is to forecast how many days persons in the database will spend in the hospital in the next year.  This outcome measure would seem to depend on how healthy the person is and on the person’s propensity for seeking medical attention for a perceived illness.

The data provided by the sponsors include age and gender but there is no information on family medical history, smoking history, dietary preferences, liquor consumption or exercise activity.  Standard data from annual checkups such as blood test information (providing 40 or so measures), a list of drugs the person is taking and physical measurements such as height, weight and waist circumference are also missing.

Information on the individuals past history in seeking medical help is also very limited.  Some individuals seek medical attention when they feel a minor twinge and others only seek medical assistance if they are incapacitated.

One would think that the measures described above would be readily available to medical practitioners.  If the participants had access to this information, their forecasts would be more accurate.   One of the objectives of the contest is to determine if modern predictive analytical techniques can make useful medical predictions.  Given this, why have the sponsors organized a contest that handicaps the participants by not providing relevant data?

The sponsors have also mangled the data.  Drug count and lab count are truncated and length of stay (the outcome measure) has been converted into a non-linear numeric.  These conversions degrade the estimate of the cost of future hospitalizations and ignore the value of methodologies that are effective with outliers.

Much has been said about protecting the patients’ privacy but research efforts in other areas such as financial services in which highly sensitive personal data is used have been subjected to a less draconian privacy stance.  The person’s name, address, phone number and social security number have been removed from each record.  In theory, an individual’s pattern of financial activity might be used to identify that person.  However, the probability that a single record among several hundred thousand records could be linked to one of the several hundred million people in this country is extremely small.

A cost benefit analysis would surely indicate that improvements in healthcare and reduction in healthcare costs that could result from more sophisticated medical data processing would outweigh by many orders of magnitude the negative impact of an occasional identification of a person in the database.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

Where is there a freely available dataset of personal financial transactions, with only name/address/phone #/SSN removed?

 
Peter W Frey's image Rank 13th
Posts 19
Thanks 3
Joined 7 Aug '10 Email user

Financial data are only provided to data analysts within a business relationship in which strong confidentiality agreements are in place.  However, the privacy issue still remains.  The financial institution that provides the data has solid commitments to protect the privacy of its customers and that includes privacy from third parties.  Stripping the data of all obvious identification information is generally considered an adequate precaution to protect the customers privacy.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

Peter W Frey wrote:

Financial data are only provided to data analysts within a business relationship in which strong confidentiality agreements are in place.  However, the privacy issue still remains.  The financial institution that provides the data has solid commitments to protect the privacy of its customers and that includes privacy from third parties.  Stripping the data of all obvious identification information is generally considered an adequate precaution to protect the customers privacy.

This is a different situation from the Heritage Prize, where the data is posted on a public website, where any malicious person can access it.

Furthermore, with a 3 million dollar prize at stake, the incentive to "cheat" through data de-anonymization is much greater than in the case of a contractor with an ongoing business relationship with a financial institution.

Need I remind you what happened to the Netflix Prize 2?

 

Thanked by Chris Raimondi
 
Peter W Frey's image Rank 13th
Posts 19
Thanks 3
Joined 7 Aug '10 Email user

If the HHP sponsors had addressed privacy by withholding the standard identity information on each record, I fail to see how a malicious person could do significant damage.   The perpetrator would have to have access to medical records for a significant number of patients.  These records could then be compared with records in the database.   Since the malicious person would not know whether the known patients were in the HHP database, he or she could not guarantee that a high degree of similarity actually indicated a match, especially with a large database.  To have 100% confidence in a match, the malicious person would have to have medical information on each known person from another source that includes most of the variables in the HHP database.   In this case, the privacy of the patient’s information would already have been compromised.  Am I missing something?

 
Signipinnis's image Posts 94
Thanks 25
Joined 8 Apr '11 Email user

Peter W Frey wrote:

Am I missing something?

Yes.

You keep looking at this issue while wearing your shoes.

It looks different if one is standing in Kaggle'/HeritageProviderNetwork's shoes.

Much as I regret it, that is reality. We could do better, if we had more to work with, no question whatsover about that. But we must make do with what we are given.

 

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

Am I missing something?

Yes - lawyers, Class Action, Summary Judgement :)

Right now it seems close to impossible for me to know anything about a specific patient. If I spent a whole lot of time I could probably figure out stuff about the bigger hospitals, but not about the patients - there simply isn't enough data.

People were able to deanonymize the data in the Netflix prize.  Granted - these people has basically blogged about their entries or put them on IMDB if I recall correctly, but there were specific titles and people were able to make a connection.

All it takes is one or two people and they can file a lawsuit.  It isn't a matter of who is right - or who is wrong.  Everyone (especially those who haven't been in this situation) act all tough and say they will fight things, but as a practical matter - if they can survive summary judgement - you are screwed.  I am not sure of the exact costs, but it would be managable to fight upto summary judgement - after that - it gets ridiculous.

IMHO - None of the theories people have put forth matter.  They matter from a logical standpoint on the way things SHOULD be, but none of that matters.  All they need is someone who can make a claim that would survive summary judgement. 

Me suing them cause I want them to give them prize to the person with the most submissions would not make it past summary judgement.  You could (in theory) pay a lawyer a few thousand dollars and it would go away. 

Someone suing them for accidently disclosing their genital warts WOULD most likely survive past summary judgement.  It doesn't matter if it would be almost impossible for someone to link it without knowing x,y,z.  If THEY know how to link it -- or can provide a theory how it can be-- that is a question of fact -- you are screwed.  They will probably end up getting a class certified.  There is a cottage industry of lawyers who file these suits strategically - not cause they think they can win, but because they know they can can spend x and cost you 100 * x - and basically you are forced to settle.

To have 100% confidence in a match, the malicious person...

In this case the person I am concerned about is the lawyer.

I don't think you are going to convince their lawyers any differently.  Lawyers want to play it safe.

I agree they could take more risk.  They could release more data, but where is the line - I don't know.  But I know it isn't where we are now - and I don't think anyone in their right mind would either.  I am dissapointed as well - I had all kinds of theories of what I was going to do with the prescription data and such.  I don't like it, but I understand it.

ps.  I have nothing against lawyers - and even class action lawsuits.

Just lawyers who file suits solely to extort money - same thing with patent trolls.

off soap box for now....

 

Thanked by Zach
 
LuckyLindy's image Posts 3
Thanks 1
Joined 14 Jul '11 Email user

As much as I'd love additional data, I agree that it would jeopardize privacy. For example, let's say they released actual drugs administered (something that would be EXTREMELY helpful to us). A pharmacy tech/assistant at a place like Walgreen's could easily write down the people with the top 100 oddball drug combinations and determine which people matched those on the list. That tech would then be a phone call away from cashing in on an anti-privacy lawsuit (and losing his/her job, but still, it could happen).

This prize is an awesome opportunity for us, and we're all on the exact same playing field. Sure, as analysts we'd love more info, but finding partial answers from insufficient data is a fact of life.

 
Peter W Frey's image Rank 13th
Posts 19
Thanks 3
Joined 7 Aug '10 Email user

We live in a weird society in which legal entanglements with precious little merit stand in the way of applying technology advancements that could improve health care, save lives and reduce health care costs.  Isn’t this a bit like the tail wagging the dog?  The issue goes beyond whether the contest is fair or whether the sponsors are optimizing the benefits of having a contest.    Being able to detect adverse drug interactions and to provide early warning of serious health complications would benefit millions of people.  Instead, useful technology is being held hostage to the greed of a few unscrupulous lawyers.  Our elected representatives, who routinely complain about out-of-control health care costs, should enact a legislative solution to this problem.

 
Zarko's image Posts 1
Joined 28 Jun '11 Email user

It looks to me like a waste of $3M. Somebody will win, by skill or by chance. I am sure that there is correlation between the number of prescription and number of days spent in hospital. But, what is the actionable value of a such model? Pay less claims, approve fewer prescriptions and you reduce the cost by "optimizing" the number of prescriptions per sex/age? If it turns out that the least amount of visits for a group is those with two prescriptions, will the third one be denied and the healthy be ask to pickup two drugs, any two drugs?

 
Sarkis's image Posts 41
Thanks 5
Joined 5 Apr '11 Email user

Zarko wrote:

It looks to me like a waste of $3M. Somebody will win, by skill or by chance. I am sure that there is correlation between the number of prescription and number of days spent in hospital. But, what is the actionable value of a such model? Pay less claims, approve fewer prescriptions and you reduce the cost by "optimizing" the number of prescriptions per sex/age? If it turns out that the least amount of visits for a group is those with two prescriptions, will the third one be denied and the healthy be ask to pickup two drugs, any two drugs?

Thank you for the message. I'd like to make a couple of notes about this. First, it looks like it is going to be $500k instead of $3M. Second, according to my conservative calculations, if one would hire a team of experts to work on this problem, it would cost more than $5M to come up with similar solution:

20 ($ per hour) x 10 (hours per week) x 4 (weeks per month) x 24 (months) x 300 (teams) = $5,760,000.

Ultimately it is for HPN to decide the utility of the final solution. I don't expect to win anything, but for me personally, it's a win-win situation; I'm learning new things and enjoying the competition. Cheers!

Thanked by FrogEater
 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?