<123>

"That would be a VERY BAD IDEA, all around"

Since you feel bad for killing off the forum conversation [!] I thought I'd drop a reply...

I'm not sure whether I agree with your points or not, really, from the point of view of whether it's the right decision to make NOW; I think some of the concern is that there was an expectation set in the rules and the forums before the competition started about the detail available for review at the milestones, and it's unclear if it's being met.

Remember that there was some call to boycott the competition early on, and concerns about visibility, openness, the right to publish results, and similar issues about the clarity of the rules scattered through the forums.  Clearly some of us were expecting to be able to precisely match the milestone winner's results.  In my case I hoped for a learning exercise, and the papers as published do that sufficiently, but for the variety of reasons that have come up in the forums, this sort of diversion from expectation may be more of an issue.  If, for example, someone chose to enter, rather than boycott, because of this expectation of openness which is in the rules, then I think there's an obligation to meet that expectation.

The rules state that "documentation [...] must be written so that individuals trained in computer science can replicate the winning results" and (separately) that "Sponsor will deliver the Prediction Algorithm and documentation to the judges and also post the information on the Website for review and testing by other Entrants".  Even though the final decision is at the "sole discretion" of the judges, the expectation has been set.

I do think you could achieve that expectation without code, by the way, but you'd have to describe the specific algorithms and parameters, data selection and cleansing, modeling and ensembling techniques in great detail. This level of detail is just not provided currently.

ChipMonkey wrote:
I'm not sure whether I agree with your points or not, really, from the point of view of whether it's the right decision to make NOW; I think some of the concern is that there was an expectation set in the rules and the forums before the competition started about the detail available for review at the milestones, and it's unclear if it's being met.

Remember that there was some call to boycott the competition early on, and concerns about visibility, openness, the right to publish results, and similar issues about the clarity of the rules scattered through the forums. 

That is absolutely correct, and to my mind then, and to my mind now, the responses left uncertainity about precisely what the Rules were meaning to say, and how they would be interpreted.

For example, I did not participate in any phase of the NetFlix contest, so I never read those rules as critically & carefully as I did the rules for this contest, but my take on the rules for this contest was that they were inspired by and "like NetFlix" in terms of expected papers and disclosure by prize-winning teams to other teams. Hence, as I've said previously, in general the write-ups as delivered have met my expectations. (Some questions raised by other people have made me realize there are some spots in the descriptions that are tantamout to "magic happens here" moments to me.)

But I can't say that anyone's expecation, based on the early forum conversations, that he/she would see a runnable script or equilavent was wrong. The administrators could have squelched that line of thought, early, and didn't.

(However I think that in fairness we have to also concede that conducting big-dollar, crowd-sourcing contests is a very new innovative technology, there's not a lot of history to have learned from, yet. With every new contest we're establishing the norms and precedents for the future.)

ChipMonkey wrote:
I do think you could achieve that expectation without code, by the way, but you'd have to describe the specific algorithms and parameters, data selection and cleansing, modeling and ensembling techniques in great detail. This level of detail is just not provided currently.

Agree with that too.

So here we are, and (perhaps) what happens turns on the finely nuanced interpretations of just a few key phrases in the Rules:

 * Does "individuals trained in computer science" mean the Judges, or other entrants ? Reasonable people could interpret that either way.

* If it's other entrants, does "replicate the winning results" mean "validate to 4 decimal places", which is (from memory) the standard imposed on the Judges, or does it mean "replicate" in the same sense that the results of peer-reviewed academic papers can be "replicated" by other scientists, with a separate independent study? (And does the 30-day window apply only to the validation by the Judges, or is that also applicable to validation efforts by other contestants?) Reasonable people could interpret these things either way.

* Perhaps most critically, does the "also post the information" part of "Sponsor will deliver the Prediction Algorithm and documentation to the judges and also post the information on the Website for review and testing by other Entrants" mean ALL the information/documentation provided to the Judges, or only the portion of the documentation that was prepared specifically as the methodology write-up for other contestants? Reasonable people could interpret this either way, too.

However, if you are a provisional Prize-Winner, and you think "ALL" is the intent and requirement of the Rules, then you would give the Judges a runnable script, or a series of them, and you would expect the Judges to share your script/s fully with all the other contestants. Therefore you would say "I submit these code scripts, which, when run on the prescribed software and hardware platform, will precisely re-create my target file submission of mm-dd-yy. Since it will do that, this code script is complete and sufficient documentation of the methodology, as required by Rule ##." And you certainly would not spend additional time & effort making an additional narrative explanation.

So far in contests like this ... i.e., with mile-post prizes ... I'm not aware that a mile-post winner's response has  ever been "the script IS the doco, that's all you're getting."

How such things have been interpreted and handled before doesn't necessarily apply to THIS contest, of course. But that seems to have become the norm. To me, these Rules as written don't clearly say "be aware, the methodology sharing rules of this contest will break unprecedented new grounds for the level of detailed disclosure." (But maybe I wasn't paying enough attention.)

* * *

Regardless of how I thought, at the beginning, that the share & disclose rules would be interpreted, I'd still be entered in this contest, and futtering along in my sporadic, errratic way.

You ?

 We have read through the commentary and thank all the readers for their feedback. We have aggregated the following responses to hopefully answer the questions and fill in any gaps that were pointed out in the methodology. We would like to say that even though the rules seem to lean toward a source-code oriented deliverable, we were asked by the organizers to format the methodology more like a paper, and we were given the Netflix Prize winner's papers as the example. This format may be less conducive to replication by a computer scientist, but perhaps more conducive to replication by an experienced predictive modeler. Nevertheless, we did provide a fair amount of source code to assist with the replication process. If others disagree with this format, they should probably take it up directly with the organizers. However we do happen to agree with this format. We believe that most of the top teams used ensemble models which have some level of randomness, which means that there is some level of adaptation required to adjust to the variation in running the same technique twice. Providing exact coefficients could therefore be mis-leading, but providing the method for combining (e.g. linear regression) allows for this adaptation to be made and therefore provides for a better replication.

Phil, Dave & Randy

@Edward

  1. Thanks for the example software, but shouldn't we get models and parameters inorder to reproduce the results?Is this one of the models used in the ensembling, or just an example?

The example code provided was just an example. A more accurate model can be achieved by including the remaining variables and fine tuning the algorithm parameters.

  1. This paper suggests that they improved their results by assigning a score toeach PCG based on medical knowledge. I think the actual values used should bepublished as this could be considered additional data and is absolutely requiredto reproduce their results.

PCG

Risk

Risk (Age > 70)

AMI

5

5

APPCHOL

1

1

ARTHSPIN

1

2

CANCRA

3

4

CANCRB

3

4

CANCRM

4

3

CATAST

3

4

CHF

5

5

COPD

5

5

FLaELEC

1

1

FXDISLC

1

1

GIBLEED

4

3

GIOBSENT

3

2

GYNEC1

1

1

GYNECA

4

3

HEART2

4

3

HEART4

1

2

HEMTOL

3

2

HIPFX

1

4

INFEC4

1

1

LIVERDZ

3

3

METAB1

3

4

METAB3

1

1

MISCHRT

2

3

MISCL1

1

1

MISCL5

1

1

MSC2a3

1

1

NEUMENT

2

1

ODaBNCA

1

1

PERINTL

1

1

PERVALV

4

5

PNCRDZ

2

3

PNEUM

1

2

PRGNCY

1

1

RENAL1

2

3

RENAL2

4

3

RENAL3

2

2

RESPR4

1

1

ROAMI

3

4

SEIZURE

2

2

SEPSIS

2

5

SKNAUT

2

1

STROKE

5

5

TRAUMA

1

2

UTI

1

2

Specifically the two predictors derived from this chart were the mean and max of the risk score across all of a member's claims. These predictors were the most influential predictors in our model.

  1. For the Neural Networks it looks like some proprietary software has been used,(Tiberius) is this correct?What are the models used, the number of hidden layers, functions, number ofneurons, parameters, etc.?Nothing of this has been claryfied.

We did use Tibeirus for building the Neural Networks, primarily because of familiarity as it was developed by a team member. The same ideas could easily be implemented in R.

We used a single hidden layer with 3-4 neurons, the backpropagation of error weight update rule, with a learning rate of 0.007 and 500 epochs. 50 models were built with random weight initialisation and then the predictions averaged (this is essentially a single giant neural network).

For those wanting to develop their own neural network code please see http://www.philbrierley.com/code.html

  1. For the other algorithms (GBM, Bagged trees and Linear regression), there arealso many models derived (20 is mentioned but with some models consisting of evenmore models), but the models itself are unclear to me.Note that many models were generated (up to 60?, the exact number is unknown), sothere are many ways to incorparate different models with different subsets,different parameters, and possibly different initialisations of weights.I cannot guess what these subsets or parameters are, so I am not able toreproduce the results. A good descrioption of every model with the usedsubsets are necessary for me to reproduce the results.


The algorithms we used have very few learning parameters that need to be initialised. For each algorithm we did try to use the most appropriate parameters as determined by a combination of experience and techniques such as cross-fold validation.

We tried numerous subsets of variables and records, ultimately relying on the results to tell us what was working. The best models were tuned GBMs using all records and all variables.

  1. This paper suggests that they improved their results by assigning a score toeach PCG based on medical knowledge. I think the actual values used should bepublished as this could be considered additional data and is absolutely requiredto reproduce their results.

See 2)

  1. Combinations of each PCG, Specialty and PG resulted in many fields which werereduced. Where is de model used for this? This is unclear."Building classification models" seems to be a very vague description for this.

 The task here is to remove 'useless' variables in order to reduce the data set size – for example if there is a specialty * PCG combination that only occurs for only one patient then including this combination as a variable if futile, will add little to the overall model accuracy but will contribute to overfitting.

 There are numerous techniques commonly used in classification problems for variable elimination, such as stepwise logistic regression. To make the problem binary (0/1), we considered it as a 'did you go to hospital' prediction task (if DIH >= 1 then DIH = 1).

 We considered the counts of each paring individually (PCG * Specialty, PCG * PG, Specialty * PG). For each we built a logistic regression model using all combinations (ie variables), and then calculated the variable importance to the model of each combination. If the least important did not affect the model accuracy it was removed, and the process started again. This was repeated until all combinations in the model suggested they were important.

 In order to calculate the model variable importance, each variable is in turn randomly permuted and the model accuracy (AUC/Gini ) recalculated. This permutation is repeated several times and and average resulting accuracy taken. If the resulting model accuracy with the permuted variable is not significantly diminished, then this variable can be safely removed in the knowledge that it has little effect in the overall model.

 Due to the random permutation, repeating this process will not always result in the same subset of selected variables, but they should result in producing the same model accuracy. Hence we do not expect you to be able to exactly replicate the variables we ended up with, and if we repeated the process we ourselves would end up with a different subset in our final modelling data set, but this would be of little concern, as the important variables would be there, but maybe just not the same variables of lesser importance.


  1. What are the weights used for the linear blending of the models?

 We have described the method we used to determine the weights. Specific weights are meaningless unless we give the source code to reproduce all our models exactly.

@Mark Waddle

Market Makers wrote:

Using n-fold cross validation to generate evaluation sets, further capping within these limits could be evaluated to determine if this decreased the error.

  • The “further capping” technique is not explained.

Some algorithms (such as linear regression) will result in negative predictions, even on the training set. We know this is impossible, and can improve the RMSE by 'manually' capping the negative predictions to zero. This demonstrates the flaws in the algorithm in trying to minimise the RMSE.

The 'further capping' we mention is to test what happens if rather than choose zero as the cap, other values such as 0.1, 0.2 etc. are investigated. We can use the cross validation sets to test what the best value to cap at is, in order to result in the lowest RMSE (in-fact cross validation sets are not required to determine if an algorithm is performing inefficiently at the extremes, the actual training results from the training data can be used to see if capping improves the training RMSE).

Market Makers wrote:

The technique of blending algorithm dependent predictions withalgorithm independent predictionswas successful in a previous Kaggle competiton.

  • This seems to hint that the team also used algorithm independent predictions. Which algorithm independent predictions were used? I did not find these described in the documentation.

When we say algorithm independent predictions, we are referring to the median models. For each record, the actual prediction could be from any of the algorithms in the median mix (whichever one gives the median score for that record).

@AndyWocky 


  • What is the admission risk score, defined on page 2 as “... An admission risk score was developed around the Primary Condition Group that was based on medical experience only. This score was a 1-5 ranking for each PCG, and split by age band.” This needs to be defined mathematically, or programmatically, such that it can be reconstructed either from their SQL data, or the raw contest data.

Please see response 2) to @Edward

  • The neural networks package cited (Tiberius) appears have a limited trial availability — this appears to contradict contest rules in that the software should not be proprietary Kaggle should comment on this directly.

Please see response 3) to @Edward. There is a link to the source code at the heart of the Tiberius algorithm.

  • There is no detail to support “… Multiple models were built on the two data sets using various parameter settings and variable subsets.” The competitors need to provide the precise models, model parameters, variable subsets and supporting data set for each model that was used to generate their winning submission.

If you try the holistic approach we described, then you will be able to generate a leaderboard score the same as ours. It might not be exactly the same, but it will be the same accuracy. The heart of the approach is the stochastic nature, and if we started again, we would not look to replicate everything exactly.

  • There is no detail to support “… Truncation of the predictions was found to be useful in certain models.” Did truncations vary by model, or data set? What is the truncation function? The competitors need to provide precise details of the truncation function, and the models (with corresponding data sets) to which it was applied.

Please see first response to @Mark Waddle

  • The term “multiple n-fold cross validation” is unclear in “…During the model generation process, multiple n-fold cross validation was used to esentially generate out-of-sample prediction sets for the training data.” What is multiple n-fold CV? For example: is it simply n-fold CV applied for each model? Are the folds the same for each model, if not, how do they differ? What value of n is used?

Multiple n-fold cross validation is where you do n-fold cross validation not just once, but multiple times. The cross validation set is then just an average. Predominantly we used two values of n, 2 and 10. When 2 was use we repeated multiple times until the cv error converged. The reason for using multiple 2-fold cv was mainly to overcome computer memory issues (the training data set is half the size of the complete data set) and to decrease processing time for each pass of the algorithm, rather than any specific mathematical benefits.

  • The linear regression ensembling needs to be described in reproducible detail. For example: was this a standard linear regression with an intercept? Were the folds for each model the same? Was any sampling performed during this fitting procedure? How were out-of-bound predictions altered (e.g., was a negative value simply rounded to 0)? What models were included in the regression and what were the final weights?

Yes, it was standard linear regression with intercept. The key here was mainly experience in knowing not to use too many models in the mix and not to use highly correlated models, otherwise overfitting and model instability could occur.

Please see reponse to @Mark Waddle and reponse 7) to @Edward.

  • It appears that the “… alternative to weighting by model was to just take the median predictions…” was essentially median-bagging, and was yet another model used in the final ensemble. Please confirm


Yes, median models were included in the final ensemble.

  • The weightings were applied to the log scale version of the predicted values” — how were out-of-bound values handled? Were they rounded into range, or was a shrinking strategy used? Were out-of-bound values modified uniformly for all patients, or differently by subset?

Out of bound values were dealt with at the individual model level in the first instance. See response to @Mark Waddle

  • Complete details of the "...final solution ... ensemble of approximately 20 models” is needed: which models were used, what are their associated parameters, which data sets were they trained on and what is their corresponding weight in the final solution? 

Please see response to you 3rd question

  • In the last paragraph of the report, “We found that calibrating the final predictions so that the overall average predicted days in hospital matched the optimized constant value benchmark gave a small improvement” — full details are required, for example: did they simply mean-shift their entire submission targets? Was any scaling or patient-subset specific shifting performed? What was the final value used for translating predictions? What was the impact of the improvement?

The mean of the predictions should be around 0.209179, the optimised constant value benchmark for the leaderboard set. This can be achieved by performing a y=mx or y=mx+c transformation, to name but two. The former is easier to apply and should result in an improvement. When we applied this transformation there was an improvement but it was very minimal.


Hi all,

We are in the process of liaising with the judges. We'll report their decision as soon as we have everybody's feedback.

I've now received all judges comments on both papers, with one exception. I am working on consolidating the comments now, and hope to have this complete in the next couple of days.

I have quick question to Willem Mestrom.

You are using Provider ID, PCP, and Vendor ID for MC2 dataset.. But there are some Provider, PCP, and Vendor who appears only on prediction dataset. How do you create the fi, gi for these categories? Are you just using average fi, gi values? Or you have better way to estimate these parameters.

Thanks.

Hi thonda,

That is a good question. I didn't think of it so I'm not doing anything smart with it. The fi and gi are initialized with random data (uniform between -0.01 and +0.01). If there is no data in the learning set they will never be updated and will still have the original (random) data when the predictions are made. Probably it would be better to set them to the overall mean or perhaps the mean of just the ones with few observations if that is significantly different.

@John: Browsing through the topic I noticed I missed your final question. I don't know any rule of thumb the find a good value for alpha parameter. Try and error is not going to work since you will be using the alpha parameter to prevent overfitting the leaderboard and improve the private score so you don't get any feedback. An alpha of zero is probably going to give the best leaderboard score. I tried to find a good value based on a similar set of predictions for Y1 and simulate the leaderboard scoring and blending procedure.

Willem

Quick update: I've now received the final judge's comments. Hopefully I'll be able to have compiled it all together by tomorrow; Monday at the latest.

I have question to Willem Mestrom.
In 1st milestone solution you are using stochastic gradient descent. You gave detailed example of update in model CatVec1. The question is about it.
What is 'e' between nf and (\sum gi) in update for \hat(f)i. There should be only nf and gradient ( which is sum ), but what is 'e'?
Thank you and congratulation with your results!

Hello everyone,
I am a student at college thinking of choosing this topic as a data mining project to work on just for my class. So I found out about this competition now and signed up to assess to the forums and such but I could not view winners' paper because I did not accept the rules which I really could not because "This competition is CLOSED TO NEW ENTRANTS" Could anybody share the papers by those winners if it's allowed and legal? Thanks ahead

Hello everyone,
I am a student at college thinking of choosing this topic as a data mining project to work on just for my class. So I found out about this competition now and signed up to assess to the forums and such but I could not view winners' paper because I did not accept the rules which I really could not because "This competition is CLOSED TO NEW ENTRANTS" Could anybody share the papers by those winners if it's allowed and legal? Thanks ahead

Anyone can access the papers here: https://www.kaggle.com/wiki/HeritageMilestonePapers

I'll work on getting the links fixed.

<123>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?