Dave Mullen's image Posts 8
Thanks 6
Joined 5 Aug '10 Email user

So what are your initial impressions of the dataset (in it's partially released state) ?

For me, the major problem I see is trying to build a coherent model using many variables, when so much data is blank or unknown or downright nonsense.

Just looking at the ClaimsY1, what are we to assume where paydelay is blank ? Has it not yet been paid ? Has the claim been denied outright ? These missing data might play a part in the final result (off the top of my head, if a patient makes many claims, and has many denied, it may be suspected he/she is hypochondriac and is less likely to be admitted) ... but with an empty value, and no indication what it represents (or rather fails to represent), it's not much use is it ?

LengthOfStay is another one with mostly blank values, and while the concensus view is that a blank value means the patient did not stay at all, it is again not clear.

Are we going to get an official statement on what values NULL implies in each column of these tables ?

There are so many nonsensical data points, I suspect this really is going to be a crapshoot.

On TWO separate occasions, patient 911633904 spent 3 DAYS in an Ambulance ? WTF ?

There are instances of 0-9 year old boys being pregnant ? And I'm sure many other strange aspects will come to light.

Seriously, I understand the need for randomizing and anoymizing the data, but unless they have some way to unrandomize it afterwards, any algorithms we create will serve no real world application.

This project is all about finding correlations and links between historical conditions, claims, medicines etc ... if the data is garbage, the result will be overfitted garbage suited for this dataset only and no other.

So far, I'm very disappointed ... I competed on Netflix Prize, and IMHO the dataset was far higher quality.

 
trog's image Posts 3
Thanks 1
Joined 9 Nov '10 Email user

I was thinking the same thing.... 

It's also bothersome that the prediction is the number of days rather than a straightforward binary:  returned / no return.  How many of us have gone to a doctor who is really paranoid about being sued who practices defensive medicine as opposed to focusing on what's medically necessary--  longer stays and more tests.

Another issue:  depending on insurance provider, some will pay for an extra day or two, whereas, others not so much.  When my wife had our first child she was allowed to stay 3 days; our second child she was out the next day because we had different insurance.

 
Justin Washtell's image Posts 48
Thanks 15
Joined 26 Aug '10 Email user

trog wrote:

:  depending on insurance provider, some will pay for an extra day or two, whereas, others not so much.  When my wife had our first child she was allowed to stay 3 days; our second child she was out the next day because we had different insurance.

The provider/payer is part of the supplied information, so your models at least stand a chance of factoring that sort of thing out.

 
Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 382
Thanks 72
Joined 20 Jan '10 Email user
From Kaggle

daveime wrote:

Seriously, I understand the need for randomizing and anoymizing the data, but unless they have some way to unrandomize it afterwards, any algorithms we create will serve no real world application.

@daveime, the data is messy not because it's been peturbed but because it's real-world data. Anonymization focused on generalizing (again not peturbing).. The the nine-year old pregnant males actually exist in the raw data.

For info, I'm told that this is one of the cleaner medical claims datasets around.

 
Phil's image Posts 3
Joined 31 Mar '11 Email user

anthony.goldbloom wrote:

daveime wrote:

Seriously, I understand the need for randomizing and anoymizing the data, but unless they have some way to unrandomize it afterwards, any algorithms we create will serve no real world application.

@daveime, the data is messy not because it's been peturbed but because it's real-world data. Anonymization focused on generalizing (again not peturbing).. The the nine-year old pregnant males actually exist in the raw data.

For info, I'm told that this is one of the cleaner medical claims datasets around.

 

To be fair, I always assumed the 0-9 year old pregnancies were actually referring to the newborn child (e.g. premature birth needing additional care, etc.), so this data could be valid, but I agree there are other data issues (e.g. the even vs odd days in hospital mentioned elsewhere) which lead me to worry that we're actually trying to model the specific data acquisition techniques rather than the underlying medical situation.

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?