So what are your initial impressions of the dataset (in it's partially released state) ?
For me, the major problem I see is trying to build a coherent model using many variables, when so much data is blank or unknown or downright nonsense.
Just looking at the ClaimsY1, what are we to assume where paydelay is blank ? Has it not yet been paid ? Has the claim been denied outright ? These missing data might play a part in the final result (off the top of my head, if a patient makes many claims, and has many denied, it may be suspected he/she is hypochondriac and is less likely to be admitted) ... but with an empty value, and no indication what it represents (or rather fails to represent), it's not much use is it ?
LengthOfStay is another one with mostly blank values, and while the concensus view is that a blank value means the patient did not stay at all, it is again not clear.
Are we going to get an official statement on what values NULL implies in each column of these tables ?
There are so many nonsensical data points, I suspect this really is going to be a crapshoot.
On TWO separate occasions, patient 911633904 spent 3 DAYS in an Ambulance ? WTF ?
There are instances of 0-9 year old boys being pregnant ? And I'm sure many other strange aspects will come to light.
Seriously, I understand the need for randomizing and anoymizing the data, but unless they have some way to unrandomize it afterwards, any algorithms we create will serve no real world application.
This project is all about finding correlations and links between historical conditions, claims, medicines etc ... if the data is garbage, the result will be overfitted garbage suited for this dataset only and no other.
So far, I'm very disappointed ... I competed on Netflix Prize, and IMHO the dataset was far higher quality.