 Congratulations to the winners of the first milestone prize! We should be able to reproduce the scores of the two winners, according to the rules, which combined will probably result is an even better score. In order to reproduce the scores I am missing some vital information.  In the Market Makers document, so I have some questions: 1) Thanks for the example software, but shouldn't we get models and parameters in order to reproduce the results? Is this one of the models used in the ensembling, or just an example? 2) This paper suggests that they improved their results by assigning a score to each PCG based on medical knowledge. I think the actual values used should be published as this could be considered additional data and is absolutely required to reproduce their results. 3) For the Neural Networks it looks like some proprietary software has been used, (Tiberius) is this correct? What are the models used, the number of hidden layers, functions, number of neurons, parameters, etc.? Nothing of this has been claryfied. 4) For the other algorithms (GBM, Bagged trees and Linear regression), there are also many models derived (20 is mentioned but with some models consisting of even more models), but the models itself are unclear to me. Note that many models were generated (up to 60?, the exact number is unknown), so there are many ways to incorparate different models with different subsets, different parameters, and possibly different initialisations of weights. I cannot guess what these subsets or parameters are, so I am not able to reproduce the results. A good descrioption of every model with the used subsets are necessary for me to reproduce the results. 5) This paper suggests that they improved their results by assigning a score to each PCG based on medical knowledge. I think the actual values used should be published as this could be considered additional data and is absolutely required to reproduce their results. 6) Combinations of each PCG, Specialty and PG resulted in many fields which were reduced. Where is de model used for this? This is unclear. "Building classification models" seems to be a very vague description for this. 7) What are the weights used for the linear blending of the models?
 Congratulations to the winners! Requirements for the winner's papers are not properly defined. For authors there is an obvious dilemma: professionalism and saving competing ability. I think a possibility to repeat winners results (even if the code and submission strategy are published) is almost 0. Even pure theoretically a 'repeater' needs to reproduce a lot of submissions. So, no one can check if the papers match rules or not. And actually that means that there is no requirement! It is only subjective estimation if there is sufficient information to reproduce results or no sufficient. Suggestion: Organizers or/and community should formulate list of questions to which authors have to answer. It should be done before next marlstone prize. General description in free format may be also included in the papers but minimum should be defended.
 Thank you for Papers and code to start with.
 Before I give my responses to the documentation, I would like to ask a question of the Kaggle Admins. Rules wrote: Documentation must be written in English and must be written so that individuals trained in computer science can replicate the winning results. Rules wrote: 1. conditional winners will have 21 days from receipt of notification to document their methodology as described in Rule 12 above. Sponsor will deliver the Prediction Algorithm and documentation to the judges and also post the information on the Website for review and testing by other Entrants.; 2. other Entrants will have the opportunity to submit comments/complaints relating to conditional winners' methodologies for 30 days after such conditional winners' methodologies are published on the Website.; After reading the above rules, I expect that the documentation must be specific to the point that it would allow a computer scientist to replicate the results within a reasonable time period, something like a week. Is my interpretation correct?
 Algorithm which uses a kind of based on public score 'blending' may be reproduced only after repeating all original submissions. So, ether original submission should be published or some procedure for their repetition should be defined: One needs 3 weeks to apply 20 submissions with quota = 1 submission/day. I even do not say that reproducing submission without detailed description (= code) almost impossible. It means that "….individuals trained in computer science can replicate the winning results" +"… information on the Website for review and testing by other Entrants" = senseless construction
 sled wrote: Algorithm which uses a kind of based on public score 'blending' may be reproduced only after repeating all original submissions. So, ether original submission should be published or some procedure for their repetition should be defined: One needs 3 weeks to apply 20 submissions with quota = 1 submission/day. I even do not say that reproducing submission without detailed description (= code) almost impossible. It means that "….individuals trained in computer science can replicate the winning results" +"… information on the Website for review and testing by other Entrants" = senseless construction Well, the winning submission is one single submission, and as long as all the inputs are provided, it should not be hard to replicate. By the same token, SQL server express took years and \$ to build. As long as other Entrants can use it, I don't see any difficulty using it for testing. Mark Waddle raised an interesting question and I'd like to hear answer to that question as well. Thank you!
 Jeremy Howard (Kaggle) Kaggle Admin: Mark Waddle: Yes, it should be in enough detail to allow a skilled person to replicate it. I wouldn't say however that it should be within a week. All the information should be provided, but actually using the information could take longer than that, since there is a lot of work in some of these entries! Edward: All of your requests are appropriate, Note that regarding (1) it is not necessary for the code of models to be published, only the details of their parameters, architecture, etc. John: Yes, the blending technique should also be documented. I'll ask the teams to provide this additional information. Many thanks for all the feedback so far.
 Hi John, The paper I referred to for the blending technique is the first reference I could find but is indeed not very useful for implementation. A good description of the technique I used can be found in section 7 of this paper: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf. Hope this helps. (The lambda parameter in this paper is the alpha parameter I describe in my paper).
 Congratulations to the winners. It is my understanding of the rules, that the winners of the milestone prices are required to provide their algorithm to their competitors, to allow all us to improve on this milestone results. After reading the 2 papers I'am quite sure that nobody will be able to reproduce the results (just from the papers) Has anyone independently reproduced the milestone results just from the papers?  Market Makers paper goes to great length to describe the (quite obvious) benefits of  blending, but has only a very vague description about the 60 sub models created (and the 20 used). We only learn that there are basically 3 different model techniques (GBM, Forrest Tree and Neural Network) used and that some form of ensemble models are used (how many, which size). It does not describe how the weights for blending were reached. (Telling us, that taking the median of the submodels would reach a top 10 position is nice, but does not fullfill the rule in my eyes). Providing the sample code is definitely helpful, but the code provided is just 1 of the 60 submodels with no blending at all.  Willem Mestrom definitely gives more detail about his models, although the  description (even with the cited Wikipedia article) probably requires knowledge beyond the paper.  Maybe some code (probably all) would help here. I know that revealing the exact algorthms used to reach the milestone price will hurt the leaders (because they losse their advantage as now everybody should be able to produce their score), but this will be the only way, that other participants can improve on this results.
 Congratulations to the milestone winners! In addition to Edward's questions above, I have a few requests of my own for clarifications regarding Market Makers' documentation. Market Makers wrote: Using n-fold cross validation to generate evaluation sets, further capping within these limits could be evaluated to determine if this decreased the error. The "further capping" technique is not explained.  Market Makers wrote: The technique of blending algorithm dependent predictions with algorithm independent predictions  was successful in a previous Kaggle competiton. This seems to hint that the team also used algorithm independent predictions. Which algorithm independent predictions were used? I did not find these described in the documentation.
 ruediger wrote: I know that revealing the exact algorthms used to reach the milestone price will hurt the leaders (because they losse their advantage as now everybody should be able to produce their score), but this will be the only way, that other participants can improve on this results. (emphasis added.) Gee, I was thinking that one possible path to success was by conjuring up some ideas on my own that other people WEREN'T doing. Silly, I know.
 Jeremy Howard (Kaggle) wrote: Mark Waddle: Yes, it should be in enough detail to allow a skilled person to replicate it. I wouldn't say however that it should be within a week. All the information should be provided, but actually using the information could take longer than that, since there is a lot of work in some of these entries! Jeremy: The rules only provide for rolling 30-day windows for competitors to comment & evaluate winner solutions.  Given this constraint, it should be be clear that the details provided must be sufficient to reproduce results within a week or two.  If this were not the case, how would competitors truly be able to verify the solutions against errors and omissions?  Also, please don't confuse the effort required to generate the winning algorithm with the effort required to reproduce it.  The latter can be much faster, perhaps only limited by the CPU time required to build models. John: Yes, the blending technique should also be documented. I'll ask the teams to provide this additional information. Many thanks for all the feedback so far. What is the timeline for obtaining additional details?  Does this reset the 30-day clock? Will these be provided as addenda to the reports? Thanks, Andy
 Like most of the other competitors posting to this forum thread, I wish to congratulate the winning teams on a job well-done!  Their scores are impressive.  At the same time, I wish to express my frustration with the lack of reproducible detail in the reports.  I do not fault the winning teams for this issue-- documentation is a time-consuming task, there were no clear guidelines from Kaggle on the document format and it is clearly preferable to minimize disclosure for competitive reasons.  However, the rules do seem to be clear that skilled practitioners should be able to reproduce the winning results based on these descriptions, and it is evident from the posts in this thread that this is not possible.  Here, I do fault the Kaggle contest administrators: it is obvious to novice and skilled data miners alike that these papers were insufficient, and the Kaggle administrators should've intervened prior to release to require expanded descriptions.  I hope that the Kaggle team learns from this and improves the process for milestone 2. Thanks, Andy
 Here is my contribution to the "required details" for the Market Makers solution (apologies for duplicated concerns, I just want to keep everything together): What is the admission risk score, defined on page 2 as "... An admission risk score was developed around the Primary Condition Group that was based on medical experience only.  This score was a 1-5 ranking for each PCG, and split by age band."  This needs to be defined mathematically, or programmatically, such that it can be reconstructed either from their SQL data, or the raw contest data. The neural networks package cited (Tiberius) appears have a limited trial availability — this appears to contradict contest rules in that the software should not be proprietary Kaggle should comment on this directly. There is no detail to support "… Multiple models were built on the two data sets using various parameter settings and variable subsets."  The competitors need to provide the precise models, model parameters, variable subsets and supporting data set for each model that was used to generate their winning submission. There is no detail to support "… Truncation of the predictions was found to be useful in certain models."  Did truncations vary by model, or data set?  What is the truncation function?  The competitors need to provide precise details of the truncation function, and the models (with corresponding data sets) to which it was applied.   The term "multiple n-fold cross validation" is unclear in "…During the model generation process, multiple n-fold cross validation was used to esentially generate out-of-sample prediction sets for the training data."  What is multiple n-fold CV?  For example: is it simply n-fold CV applied for each model?  Are the folds the same for each model, if not, how do they differ?  What value of n is used? The linear regression ensembling needs to be described in reproducible detail.  For example: was this a standard linear regression with an intercept?  Were the folds for each model the same?  Was any sampling performed during this fitting procedure?  How were out-of-bound predictions altered (e.g., was a negative value simply rounded to 0)?  What models were included in the regression and what were the final weights? It appears that the "… alternative to weighting by model was to just take the median predictions…" was essentially median-bagging, and was yet another model used in the final ensemble.  Please confirm "The weightings were applied to the log scale version of the predicted values" — how were out-of-bound values handled?  Were they rounded into range, or was a shrinking strategy used?  Were out-of-bound values modified uniformly for all patients, or differently by subset? Complete details of the "... final solution ... ensemble of approximately 20 models" is needed: which models were used, what are their associated parameters, which data sets were they trained on and what is their corresponding weight in the final solution? In the last paragraph of the report, "We found that calibrating the final predictions so that the overall average predicted days in hospital matched the optimized constant value benchmark gave a small improvement" — full details are required, for example: did they simply mean-shift their entire submission targets?  Was any scaling or patient-subset specific shifting performed?  What was the final value used for translating predictions?  What was the impact of the improvement?
 This is my attempt to diagram Team Market Makers solution.  Comments and suggestions for improvement welcome. https://docs.google.com/drawings/pub?id=1c_PneY0NqrQHyo5SjjXqtZcgyn0Wwwp0Qa7C5C8i6J4&w=960&h=720 Andy