About the reason of the unsatisfactory solution of Heritage Health Prize Competition task and why it is necessary to continue competition. EricGrig ericgrig@gmail.com April 18, 2013

« Prev
Topic
» Next
Topic

1 . In the beginning I would like to thank Dr. Richard Merkin for statement of an interesting task for Kaggle community. It is necessary to be wise and far-sighted to organize statement of such complex challenge with a prize fund 3 million dollars. It is sure what exactly 3*NobelPrize attracted to the solution of an objective huge number of experts from all over the world. 

2 . I also want to express words of gratitude of Anthony Goldbloom and all Kaggle team for an oganization of a site kaggle.com round which the community of experts in the field of Data Mining and Artificial Intelligence gathered. That's great! 

3 . I want to congratulate participants who got to group "potential winners". I hope that soon we will see the original decision which will enrich our professional knowledge.

4 . Now about the main thing.  Surprisingly, but in two years any of participants of competition didn't overcome boundary value 0.4 accuracy of the prediction Target_ DaysInHospital which has been set by the organizer of competition.  It turns out that more than 5 thousand experts in the field of Data Mining in two years didn't solve an objective.  It doesn't say that we are weak and aren't capable to solve similar problems.  Absolutely not.  I consider that the task is set not absolutely correctly.

5  .  I will explain why.  For this purpose we will consider one any participant from set of Target data.  From us it is required to predict in the current year on the basis of a number of values of medical indicators of the participant value of a target variable next year. Thus we can use data of one year and partially – data of previous year. I think that you guessed that it is Markov's chain and a time span of a step is equal to one year.  It turns out that we have very short Markov's chain consisting of three links Y1,Y2 and Y3. On their basis we have to construct model and learn to predict value of DaysInHospital_Y3. Afterit it is required to execute a prediction of value of the target DaysInHospital_Y4 variable.  It is a limit case of statement of tasks which is possible to call "On one step it was trained – the following step predicted" and is ineffective both for the customer, and for those who solves this problem.

  Other feature of a task consists that from all participants in Y2 for which the model of a prediction of DaysInHospital_Y3 was constructed only ~ 70% are present at Target sample. Other participants, according to their identification number, are new to the constructed model. Question following. Whether are these 30% of new participants the carrier of properties which wasn't in initial selection of Y2 and how it will affect prediction accuracy?

  Some success in the solution of a task is connected only by that we have not one object in the form of Markov's chain, demanding a prediction, and set of such individual objects in the form of Markov's chain. And we receive the decision only on the basis of search of regularities in structure of the relations of properties objects within one year.

6 . I suggest to change a problem definition and instead of a step lasting 1 year to consider a step lasting 3 months. It won't demand absolutely any additional expenses from organizers of competition, but will allow to solve a problem with high precision. For this purpose it is necessary to meet the following conditions:

a)     basic data for years of Y2 and Y3 need to be divided quarterly into 4 parts everyone and to provide to completely registered participants of competition from Kaggle community for research and creation of model of a prediction. Thus, we will receive 8 samples:

            Claims_Y2_1,     Claims_Y2_2,     Claims_Y2_3,     Claims_Y2_4,

            Claims_Y3_1,     Claims_Y3_2,     Claims_Y3_3,     Claims_Y3_4

Existence of such data will allow participants of competition to construct strong model and to achieve high precision of a prediction.

 b) basic data for the first 9 months of year of Y4 need also to be divided quarterly into 3 parts. After that each part needs to be divided in a proportion 30: 70 and to create in such proportion in a random way two samples. Here it is the extremely important that selection of objects in group was made in a random way. The first sample of 30% has to be provided to participants of competition for improvement and adaptation of the developed model, and the second sample of 70% - for testing of the reached results. In this case participants of competition will receive three samples

            Claims_Y4_30_1,     Claims_Y4_30_2,     Claims_Y4_30_3

for improvement and adaptation of the developed model, and also three more samples

            Target_Y4_70_1,      Target_Y4_70_2,      Target_Y4_70_3

for  prediction of an absent target variable.

Following the results of competition the group of participants with the best results has to be created. The group of the best has to include those participants, at which

-  all three prediction: Target_Y4_70_1, Target_Y4_70_2, Target_Y4_70_3   less than   0.4 

-  and thus average value of these three estimates is less, than at other participants.

 To define the winner of competition, it is necessary to use 100% given for the last quarter years of Y4. That participant who again will receive an error of a prediction less than 0.4 and this assessment will be the smallest to within the 6th sign after a comma – will be considered as the winner. If there is very improbable event that at two participants estimates to within the 6th sign after a comma completely will coincide, then it is necessary to consider the 7th, the 8th, etc. signs after a comma. If in this case results completely coincide (that again is improbable), participants will be considered as equal winners.

c) it is necessary to exclude those indicators which don't bear any sense for the solution of this task from basic data, namely: ProviderId, Vendor

 d) it is desirable to include in these new attributes concerning to this task if those are available.

7 . Proceeding from the above I want to address to organizers of the Heritage Health Prize Competition, namely to Dr. Richard Merkin, Heritage Provider Network founder and CEO with the offer to organize the second stage of competition Heritage Health Prize Competition-2   for a period of 6 months on the conditions offered above, with a prize fund which is less initial at a size of a consolation prize and other accompanying expenses.

8 . I address to all participants of competition who would like to participate further in competition, to support my address to organizers of competition.

In summary … I want to note once again successes of our colleagues which got to group of the best. They are good fellows and achieved the best results, in the conditions of this problem definition.

However let's think, what means for the customer the accuracy of the prediction 0.435583?

Considering a proportion in the frequency of occurrence of various DaysInHospital_Y2 values and DaysInHospital_Y3 we can generate artificially an example of the prediction of DaysInHospital_Y4 values which as a result will give out the following values of errors of the prediction:

9670 (1),    614 (2),    438 (3),    323 (4),    175 (5),    143 (6),    124 (7),    112 (8),    104 (9),  101 (10),    67 (11),    52 (12),    40 (13),    41 (14),    210 (15)

9670(1) - means that there were 9670 cases of discrepancy between initial and predicted values and the difference between them made 1 day; similarly 614(2) - means that there were 614 cases of discrepancy between initial and predicted values and the difference between them made 2 days, etc.

Here we believe that real values of 1, 2, … 15 DaysInHospital_Y4 are mistakenly predicted as 0 and respectively the same number of zero are changed to values 1, 2, … 15. These are pains heavy mistakes, than other options. If to calculate value of a mistake according to a formula specified in the conditions of competition those we will receive value 0.4334, i.e. it is slightly better than the best result. The received values for the customer mean the following:

a)     the total of mistakenly predicted days for such hypothetical decision with an error of the prediction 0.4334 will make 24 552 hospital-days. (It only for 30% of participants provided for the analysis).

                        9670*1 + 614*2 +... +210*15 = 24552

            How much is a mistake in 1 hospital-day and whether it is necessary to receive the best  decision?

b) the size of an error of the predicted DaysInHospital_Y4 on the average falling on one participant from group 70 942 makes 0.346085, i.e. the prediction for any three objects will surely generate a mistake in 1 hospital-day. 

c) besides that the prediction what this object next year will carry out in hospital of 15 days means for the customer? What actions can be planned, without knowing about a month when this event partially or completely will come? In model with the quarterly forecast such uncertainty decreases at least by 4 times.

d) I am not sure, whether will remain at such problem definition in the future prediction accuracy, for example DaysInHospital_Y5…

Yours faithfully,

                                               ericgrig  

Eric Grig, Odessa wrote:

...

  Some success in the solution of a task is connected only by that we have not one object in the form of Markov's chain, demanding a prediction, and set of such individual objects in the form of Markov's chain. And we receive the decision only on the basis of search of regularities in structure of the relations of properties objects within one year.

...

Some work along these lines has been done at Johns Hopkins. See "Morbidity trajectories as predictors of utilization:" by Chang et al (2011).

These days you can easily find such literature references using Google and Google Scholar, so it isn't necessary to give the full reference. What gets in the way of performing such more detailed analyses with the Heritage data is that the Heritage data had to be "fuzzified" in various ways to preserve patient privacy. Unfortunately, this reduces the possible predictive accuracy of models built using this data alone.

But even adding more and more possible predictors doesn't necessarily mean that more and more predictive accuracy can be obtained. There are some misconceptions in both the statistical and machine learning literatures over this point.

Debugging model code can be difficult, because a model can still generate halfway reasonable answers and still be wrong. One could even feed the data into SAS PROC REG (even though its underlying model is wrong for the Heritage data.) One would probably get an answer, but it might not be a very GOOD answer.

I just discovered a major problem with my own code, which gets back to a Forum thread right at the beginning of the competition - "The "Optimized Constant Value" Benchmark". It was noted that there was a major discrepancy between the optimized constant value and the arithmetic mean of the DaysInHospital field. This issue was not resolved. It's also not clear that the original analysis was fully correct, and I'm now working on what needs to be done to get improved answers. It's not at all obvious, and I don't recall having seen this specific problem addressed in the standard literature. 

Actually my perspective is that key data was missing from the competition for political reasons, it seems that political correctness has greater bias than seeking reality and the truth. Things like race (white, black, hispanic etc), address/zip code, income bracket, smoker/non-smoker, drug-offender, criminal-history, single/married/divorced, are very necessary in order to develop a comprehensive and holistic model, everything else is fluff without, but what would be the ramifications if predjudice were to be made or implied on true demographics and sustained lifestyle choices, clearly the media would target the competition if the winning entry took advantage of such metrics.  

We all probably have different political views here and there's some consensus on what's too much butting in by insurance companies. Many of us may not have participated if we thought the results were going to be used to discriminate on the basis of those criteria.

Some of the data which we did have to work with, such as the pay delay, could act as proxies for income bracket, or perhaps for traits that have to do with health.

Someone earlier mentioned the average days in hospital and now I am kind of kicking myself because I accidentally normalized my log(1+value) mean to the optmized constant benchmark instead of the log(1+optimized constant). 

GoldenHind, I think the discrepancy is to be expected because the goodness of fit statistic is the rms between log(1+guess) and log(1+value). The best constant guess is not the mean value. Maybe it was ok for my mean to be significantly higher than the optimized benchmark? If you actually WERE able to predict with great accuracy the days in hospital you'd end up sending in predictions very close to actual days in hospital, and thus your mean would be close to their mean. I imagined our entries would be perturbations from the optmized constant based on our information, but that we'd keep to that same mean... I played more with the code and method than the normalization. 

One trick I used for prediction normalization for my submissions was to submit multiple copies of the same forecasts with different multipliers (applied to log(1+guess)) and then do a simple quadratic fit of the resulting leaderboard scores to the multiplier value.  This lets one find the "optimal" multiplier for this prediction set.  One needs at least 3 points (I often chose multiplier values 0.95, 1.00, and 1.05), and I tried a few more if necessary so that the best leaderboard score of my set lay at one of the points between the lowest and highest chosen multiplier values.

ADP wrote:

Actually my perspective is that key data was missing from the competition for political reasons, it seems that political correctness has greater bias than seeking reality and the truth. Things like race (white, black, hispanic etc), address/zip code, income bracket, smoker/non-smoker, drug-offender, criminal-history, single/married/divorced, are very necessary in order to develop a comprehensive and holistic model, everything else is fluff without, but what would be the ramifications if predjudice were to be made or implied on true demographics and sustained lifestyle choices, clearly the media would target the competition if the winning entry took advantage of such metrics.  

This question has come up before in other contexts, in particular in the field of consumer credit - the granting of home mortgages and credit cards. The original practice in the US was to deny credit on the basis of such things as race and residential area (the well known practice of "redlining"). As the result of public outcry, the criteria for granting credit were changed to supposedly be less discriminatory.

There are a lot of technical issues here concerning statistical prediction and modeling. It's been a while since I looked at all this, so I don't know how it all turrned out, but I do know that because many predictors of creditworthiness are quite strongly correlated among themselves, it should be possible to model creditworthiness without expliclicitly including race or residence.

I've been working on these statistical issues for some time now. A proper solution would I think be of value in many application areas. One consideration is that of "proxy predictors", where I earlier referenced the Kraemer book. Another technique which I am sure many entrants in this competition would be famiiiar with is principal components, with its eigenvectors and eigenvalues. This was used in the Netflix competition, but has been around for one hundred years (Karl Pearson 1901).

David J. Slate wrote:

One trick I used for prediction normalization for my submissions was to submit multiple copies of the same forecasts with different multipliers (applied to log(1+guess)) and then do a simple quadratic fit of the resulting leaderboard scores to the multiplier value.  This lets one find the "optimal" multiplier for this prediction set.  One needs at least 3 points (I often chose multiplier values 0.95, 1.00, and 1.05), and I tried a few more if necessary so that the best leaderboard score of my set lay at one of the points between the lowest and highest chosen multiplier values.

Thanks for sharing this trick. My guess is that most of the top 10 teams used leaderboard scores to do some sort of blending. I don't think this can be used in practice though. That's why I suggest, for the future competitions, to disallow using leaderboard scores in the Prediction Algorithm.

I remember reading Exploiting knowledge of test set distribution post from Merck Molecular Activity Challenge forum where a similar topic has been discussed. Cheers!

Looking back at what I did, I think I must have adjusted log means to the Leaderboard before the 3rd Milestone and then forgotten the need to do that, and kept the same normalization. In retrospect, I might have improved if I didn't keep that normalization fixed--I still moved up 295 places on the Leaderboard in the very last week, but perhaps could have done better. I think the fact that adjusting to the Leaderboard beyond a certain point was detrimental indicates this wasn't that serious a problem with the competition. 

Couldn't agree more, ensemble / blended models should be limited to maximum of say 5 or thereabouts, to prevent a 2yr competition attracting ensemble sumbmissions, based on possibly 300+ prior submissions for for example, a situation that could be considered a computationally-impractical overall solution.

Sarkis wrote:

Thanks for sharing this trick. My guess is that most of the top 10 teams used leaderboard scores to do some sort of blending. I don't think this can be used in practice though. That's why I suggest, for the future competitions, to disallow using leaderboard scores in the Prediction Algorithm.

I remember reading Exploiting knowledge of test set distribution post from Merck Molecular Activity Challenge forum where a similar topic has been discussed. Cheers!

We used the leaderboard scores in our blends. The ridge regression trick is really clever and as with any algorithm, you make assumptions and then run tests to verify those assumptions. I don't think it's fair to limit modeling techniques in a competition based on what is computationally feasible today. I agree that some of these blends may not be ready for production, but the individual models are probably still very valuable.

I figured I'd chime in since I actually wrote a rather overly lengthy blog-post about this recently.

In short, while the target 0.4 was obviously VERY difficult to reach, I think it was chosen appropriately, or at least reasonably close.  My basic logic is that an exceptional prize ($3 Million) deserves an exceptional solution.  $500,000 is well worth sqeezing a lot of juice out of existing algorithms (which, as of the last few milestone prizes, is more or less what happened), but it's difficult to say that anything revolutionary has really ocurred.

But also, I don't think the threshold was so difficult to reach that it stopped anyone from trying, and that's equally important to their goals.  Of course, the guaranteed $500,000 was probably enough to push competition anyway, but if it had been 0.4 or nothing I bet many people would have lost interest in the last year.

I'd be very curious to see what work the HHP/Kaggle folks put into the competition before choosing the 0.4 threshold.  How much of it was a SWAG, or just rounding-to-the-next-lowest-tenth?  They're certainly capable of putting a lot of work into it, but three years of experts and amateurs was a lot of brainpower... I don't envy the task of making the decision that put millions of dollars on the line!

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?