More data would be cool n' stuff

« Prev
Topic
» Next
Topic
Aeoliana's image
Posts 17
Thanks 15
Joined 4 Apr '11
Email User

So I am finding there are 24,327 unique condition combinations in the data. And there are 77k~ points of data for Days in Hospital... Best case thats like a whopping 3 unique points of information for any given set of conditions. (In actuality its a lot of single entries with more common condition combinations reaching as many as 4,145)

So I know it is unlikely, but is there any way we could get like... 50 times as much info as we have here? :) Kind of hard to make any statistically significant observations with only one point of data, even if we may get 2 more with Y2/3 data.

 
Justin Washtell's image
Posts 48
Thanks 15
Joined 26 Aug '10
Email User

Well, the conditions have already been generalized massively (and somebody else was complaining about *that*) So perhaps you need to try and find useful ways to generalize them further? By the way, I do agree, the data is a nightmare.

 
Michael Benjamin's image
Posts 4
Thanks 1
Joined 30 Mar '11
Email User
GIGO, as they say
 
Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 383
Thanks 73
Joined 20 Jan '10
Email User
From Kaggle
You will get some procedure code information in the May 4 release. I understand the frustration but data privacy is a priority for HPN.
 
Justin Washtell's image
Posts 48
Thanks 15
Joined 26 Aug '10
Email User

Michael Benjamin wrote:

GIGO, as they say

Likewise this applies to assumptions, attitudes and approaches... even in the face of good data.
There's presently no foolproof methodology for separating the two scenarios.

 
Aeoliana's image
Posts 17
Thanks 15
Joined 4 Apr '11
Email User
All I am saying is that there are some members who's profile is so unique it cannot be matched to anyone else's with any degree of confidence. For example, I go from a RMSE of .505112 with a certain model, and when I incorporate members for comparison that have only one additional condition the member I am considering does not have. It jumps to ~.52 But if I don't consider members+1, in some cases, my algorithm cannot generate a confident estimate because of the minimal data. I will hold out for the next set of data, labs etc will no doubt provide a great level of detail.
 
Zach's image
Rank 9th
Posts 360
Thanks 91
Joined 2 Mar '11
Email User

Aeoliana wrote:

All I am saying is that there are some members who's profile is so unique it cannot be matched to anyone else's with any degree of confidence. For example, I go from a RMSE of .505112 with a certain model, and when I incorporate members for comparison that have only one additional condition the member I am considering does not have. It jumps to ~.52 But if I don't consider members+1, in some cases, my algorithm cannot generate a confident estimate because of the minimal data. I will hold out for the next set of data, labs etc will no doubt provide a great level of detail.

Maybe you need a more general model for members with very unique profiles.  Or a way of generalizing the catigorical profiles into some kind of continuous or ordinal variable.

 
Justin Washtell's image
Posts 48
Thanks 15
Joined 26 Aug '10
Email User

@Aeoliana:  I more-or-less undesrstand you. I assume you have considered using several degrees of generalisation/aggregation simultaneously, or "backing off" to a slightly more general model when there is not enough matching fine-graned evidence for a given case?

 
Aeoliana's image
Posts 17
Thanks 15
Joined 4 Apr '11
Email User

Yea, it starts at the highest level of specificity and falls through to more generalized constraints until it finds usable information.

So like if the member doesn't match up, the first thing it drops is the gender constraint and so on.

 
Justin Washtell's image
Posts 48
Thanks 15
Joined 26 Aug '10
Email User

@Aeoliana Ah, sounds good! Despite all the talk, I've not yet come up with anything groundbreaking in my corner. Lots of cool ideas. Very average results. Keep chipping away I guess!

 
Signipinnis's image
Posts 95
Thanks 25
Joined 8 Apr '11
Email User

Anthony Goldbloom wrote:

You will get some procedure code information in the May 4 release. I understand the frustration but data privacy is a priority for HPN.

And understandibly so.

But consider this re the Condition codes, aka categorical roll-ups of diagnosis codes .... which I will pointedly note are NOT included in the HIPAA proscribed list of PHI (Protected Health Identifiers) data elements .... there's an implicit assumption that some combination of Dx codes potentially makes a unqiue "Dx-fingerprint" that can be traced back to an individual. And based on that possibility, 90% of the information content that's available via the Dx codes has been withheld.

But think about the fingerprint metaphor .... having a fingerprint from a perp does you no good ... unless you have a master database of fingerprints to check your specimen against.

And who has such a master database of dx-fingerprints ??? Nobody legitimately, other than people who already work with the exact data as a part of their jobs. And those people are legally prohibited from using that access to the data for this purpose.

Well, some might be tempted to breach that wall of trust, you say.

True .... but here's the point ... if they're vulnerable to that temptation .... they ALREADY HAVE illegitimate/unethical access to the source data. The Kaggle-ified version gives them NOTHING THEY DON'T ALREADY HAVE.

So for a mythical vulnerability, we're compromising the prospects that contestents ... and thus also the sponsor Heritage ... will be able to get truly useful insights out of the data.

Furthermore, I can't think of any reason to believe that Dx codes will be any more likely to create unique fingerprints than procedure codes ... which apparently will be released. Or pharmaceutical information, which supposedly will be released to some degree. Or lab results, ditto.

Please join me in the chant of liberation:

"Free the diagnosis codes !"

"Free the diagnosis codes!"

"Free the diagnosis codes!"

:)

 
FrogEater's image
Posts 3
Thanks 1
Joined 18 Feb '11
Email User

A jug fills drop by drop - Buddha

 
Zach's image
Rank 9th
Posts 360
Thanks 91
Joined 2 Mar '11
Email User

Signipinnis wrote:

But think about the fingerprint metaphor .... having a fingerprint from a perp does you no good ... unless you have a master database of fingerprints to check your specimen against.

I will point out (again) that the second Netflix prize was cancelled because some enterprising individuals (without access to Netflix's database) were able to de-anonymize the database and identify actual people along with their movie renting habits.  Netflix was sued, which is generally considered to be a Bad Thing.


Now, imagine that this happened with a dataset of medical records...

 
Signipinnis's image
Posts 95
Thanks 25
Joined 8 Apr '11
Email User

Zach wrote:

I will point out (again) that the second Netflix prize was cancelled because some enterprising individuals (without access to Netflix's database) were able to de-anonymize the database and identify actual people along with their movie renting habits.  Netflix was sued, which is generally considered to be a Bad Thing.


Now, imagine that this happened with a dataset of medical records...

The Netflix data had exact dates. The IMDB database of reviews had exact dates. When you have an obscure title, and an exact date in two large databases to also compare as a key, finding a small number of hits isn't unexpected. (I'm not saying necessarily that the number of hits was small, merely that finding only a small number of matches was necessary to prove the point that the linkage was possible.)

I don't recall how the researchers in the NetFlix case established that they had in fact positively made a correct linkage, unless they contacted the people by email and asked "Are you the person who rented "Pitfalls of being a Vegetarian in the Piranha Pool" on April 26, 2005, and posted a review to IMDB on April 30th of that same year?"

Assume for the sake of argument that there's a place on the internet where large numbers of people have posted information about their medical condition that is so voluminous and detailed that it's comparable to the IMDB ratings database. I don't think there is such a thing, but maybe there is.

Without dates, the "dx-fingerprint" is very fuzzy.

That being the case, the "confirmation" becomes "Are you a 50-59 year old female who had the following ICD-9 diagnosis codes in a recent year .... [list of codes, which mean nothing to most people] ... and saw your PCP three times during the year, and a podiatrist once?"

(A) It's unknowable if there is ONLY ONE person who fits the criteria,

(B) It takes a voluntary contribution from the probable matchee to "confirm" that match ... the person can say "No" or "Not Interested, Go Away" and the anonymity is preserved,

(C) any "researchers" who attempted to match such linkages, after agreeing not to, would be potentially liable for the financial impact on the organizations running and sponsoring the contest, and likely liable under federal law for HIPAA violations.

So while I am aware of the NetFlix contest precedent, there are differences that are .... arguably ... different.

Can you make a coherent argument why diagnosis data should be withheld, but procedure, pharmaceutical and lab data should be made available during a later phase of data release ?

I think it's hard to make such a distinction.

 
Zach's image
Rank 9th
Posts 360
Thanks 91
Joined 2 Mar '11
Email User

Signipinnis wrote:
Can you make a coherent argument why diagnosis data should be withheld, but procedure, pharmaceutical and lab data should be made available during a later phase of data release ?

I think it's hard to make such a distinction.

No I can't, but I'm sure kaggle's lawyers can, and have.  I guess we should ask Anthony to chime in here.

 
Aeoliana's image
Posts 17
Thanks 15
Joined 4 Apr '11
Email User

If a corporation wants to find which employees are high-risk for future health problems, they'll know roughly what a person was sick with, when they were sick, and how long they were out. They wont know what specific labs or prescriptions they had ordered. Just one example...

Just to be clear, I dont think that more specific data is needed, just a greater volume of it.

 
Zach's image
Rank 9th
Posts 360
Thanks 91
Joined 2 Mar '11
Email User

Aeoliana wrote:

If a corporation wants to find which employees are high-risk for future health problems, they'll know roughly what a person was sick with, when they were sick, and how long they were out. They wont know what specific labs or prescriptions they had ordered. Just one example...

Just to be clear, I dont think that more specific data is needed, just a greater volume of it.

The host of this contest is a healthcare provider who is interested in decreasing member hospitalization rates.

 
Aeoliana's image
Posts 17
Thanks 15
Joined 4 Apr '11
Email User

I was giving a reason for not disseminating more detailed diagnoses....

The more detailed the diagnosis the easier it is to find anonymous 'members' that match profiles of your employees. Whereas your employer doesnt really have access to the procedures and prescriptions, so those don't matter too much to release.

Thanked by Zach
 
Zach's image
Rank 9th
Posts 360
Thanks 91
Joined 2 Mar '11
Email User

Aeoliana wrote:

I was giving a reason for not disseminating more detailed diagnoses....

The more detailed the diagnosis the easier it is to find anonymous 'members' that match profiles of your employees. Whereas your employer doesnt really have access to the procedures and prescriptions, so those don't matter too much to release.

Oh, I'm sorry!  I misunderstood your post.  Thanks for clarifying.

Thanked by Justin Washtell
 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?