1 . In the beginning I would like to thank Dr. Richard Merkin for statement of an interesting task for Kaggle community. It is necessary to be wise and far-sighted to organize statement of such complex challenge with a prize fund 3 million dollars. It is sure what exactly 3*NobelPrize attracted to the solution of an objective huge number of experts from all over the world.
2 . I also want to express words of gratitude of Anthony Goldbloom and all Kaggle team for an oganization of a site kaggle.com round which the community of experts in the field of Data Mining and Artificial Intelligence gathered. That's great!
3 . I want to congratulate participants who got to group "potential winners". I hope that soon we will see the original decision which will enrich our professional knowledge.
4 . Now about the main thing. Surprisingly, but in two years any of participants of competition didn't overcome boundary value 0.4 accuracy of the prediction Target_ DaysInHospital which has been set by the organizer of competition. It turns out that more than 5 thousand experts in the field of Data Mining in two years didn't solve an objective. It doesn't say that we are weak and aren't capable to solve similar problems. Absolutely not. I consider that the task is set not absolutely correctly.
5 . I will explain why. For this purpose we will consider one any participant from set of Target data. From us it is required to predict in the current year on the basis of a number of values of medical indicators of the participant value of a target variable next year. Thus we can use data of one year and partially – data of previous year. I think that you guessed that it is Markov's chain and a time span of a step is equal to one year. It turns out that we have very short Markov's chain consisting of three links Y1,Y2 and Y3. On their basis we have to construct model and learn to predict value of DaysInHospital_Y3. Afterit it is required to execute a prediction of value of the target DaysInHospital_Y4 variable. It is a limit case of statement of tasks which is possible to call "On one step it was trained – the following step predicted" and is ineffective both for the customer, and for those who solves this problem.
Other feature of a task consists that from all participants in Y2 for which the model of a prediction of DaysInHospital_Y3 was constructed only ~ 70% are present at Target sample. Other participants, according to their identification number, are new to the constructed model. Question following. Whether are these 30% of new participants the carrier of properties which wasn't in initial selection of Y2 and how it will affect prediction accuracy?
Some success in the solution of a task is connected only by that we have not one object in the form of Markov's chain, demanding a prediction, and set of such individual objects in the form of Markov's chain. And we receive the decision only on the basis of search of regularities in structure of the relations of properties objects within one year.
6 . I suggest to change a problem definition and instead of a step lasting 1 year to consider a step lasting 3 months. It won't demand absolutely any additional expenses from organizers of competition, but will allow to solve a problem with high precision. For this purpose it is necessary to meet the following conditions:
a) basic data for years of Y2 and Y3 need to be divided quarterly into 4 parts everyone and to provide to completely registered participants of competition from Kaggle community for research and creation of model of a prediction. Thus, we will receive 8 samples:
Claims_Y2_1, Claims_Y2_2, Claims_Y2_3, Claims_Y2_4,
Claims_Y3_1, Claims_Y3_2, Claims_Y3_3, Claims_Y3_4
Existence of such data will allow participants of competition to construct strong model and to achieve high precision of a prediction.
b) basic data for the first 9 months of year of Y4 need also to be divided quarterly into 3 parts. After that each part needs to be divided in a proportion 30: 70 and to create in such proportion in a random way two samples. Here it is the extremely important that selection of objects in group was made in a random way. The first sample of 30% has to be provided to participants of competition for improvement and adaptation of the developed model, and the second sample of 70% - for testing of the reached results. In this case participants of competition will receive three samples
Claims_Y4_30_1, Claims_Y4_30_2, Claims_Y4_30_3
for improvement and adaptation of the developed model, and also three more samples
Target_Y4_70_1, Target_Y4_70_2, Target_Y4_70_3
for prediction of an absent target variable.
Following the results of competition the group of participants with the best results has to be created. The group of the best has to include those participants, at which
- all three prediction: Target_Y4_70_1, Target_Y4_70_2, Target_Y4_70_3 less than 0.4
- and thus average value of these three estimates is less, than at other participants.
To define the winner of competition, it is necessary to use 100% given for the last quarter years of Y4. That participant who again will receive an error of a prediction less than 0.4 and this assessment will be the smallest to within the 6th sign after a comma – will be considered as the winner. If there is very improbable event that at two participants estimates to within the 6th sign after a comma completely will coincide, then it is necessary to consider the 7th, the 8th, etc. signs after a comma. If in this case results completely coincide (that again is improbable), participants will be considered as equal winners.
c) it is necessary to exclude those indicators which don't bear any sense for the solution of this task from basic data, namely: ProviderId, Vendor
d) it is desirable to include in these new attributes concerning to this task if those are available.
7 . Proceeding from the above I want to address to organizers of the Heritage Health Prize Competition, namely to Dr. Richard Merkin, Heritage Provider Network founder and CEO with the offer to organize the second stage of competition Heritage Health Prize Competition-2 for a period of 6 months on the conditions offered above, with a prize fund which is less initial at a size of a consolation prize and other accompanying expenses.
8 . I address to all participants of competition who would like to participate further in competition, to support my address to organizers of competition.
In summary … I want to note once again successes of our colleagues which got to group of the best. They are good fellows and achieved the best results, in the conditions of this problem definition.
However let's think, what means for the customer the accuracy of the prediction 0.435583?
Considering a proportion in the frequency of occurrence of various DaysInHospital_Y2 values and DaysInHospital_Y3 we can generate artificially an example of the prediction of DaysInHospital_Y4 values which as a result will give out the following values of errors of the prediction:
9670 (1), 614 (2), 438 (3), 323 (4), 175 (5), 143 (6), 124 (7), 112 (8), 104 (9), 101 (10), 67 (11), 52 (12), 40 (13), 41 (14), 210 (15)
9670(1) - means that there were 9670 cases of discrepancy between initial and predicted values and the difference between them made 1 day; similarly 614(2) - means that there were 614 cases of discrepancy between initial and predicted values and the difference between them made 2 days, etc.
Here we believe that real values of 1, 2, … 15 DaysInHospital_Y4 are mistakenly predicted as 0 and respectively the same number of zero are changed to values 1, 2, … 15. These are pains heavy mistakes, than other options. If to calculate value of a mistake according to a formula specified in the conditions of competition those we will receive value 0.4334, i.e. it is slightly better than the best result. The received values for the customer mean the following:
a) the total of mistakenly predicted days for such hypothetical decision with an error of the prediction 0.4334 will make 24 552 hospital-days. (It only for 30% of participants provided for the analysis).
9670*1 + 614*2 +... +210*15 = 24552
How much is a mistake in 1 hospital-day and whether it is necessary to receive the best decision?
b) the size of an error of the predicted DaysInHospital_Y4 on the average falling on one participant from group 70 942 makes 0.346085, i.e. the prediction for any three objects will surely generate a mistake in 1 hospital-day.
c) besides that the prediction what this object next year will carry out in hospital of 15 days means for the customer? What actions can be planned, without knowing about a month when this event partially or completely will come? In model with the quarterly forecast such uncertainty decreases at least by 4 times.
d) I am not sure, whether will remain at such problem definition in the future prediction accuracy, for example DaysInHospital_Y5…
Yours faithfully,
ericgrig
Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?

with —