<123>

sled wrote:

Algorithm which uses a kind of based on public score ‘blending’ may be reproduced only after repeating all original submissions. So, ether original submission should be published or some procedure for their repetition should be defined: One needs 3 weeks to apply 20 submissions with quota = 1 submission/day. I even do not say that reproducing submission without detailed description (= code) almost impossible.
It means that
“….individuals trained in computer science can replicate the winning results”
+”… information on the Website for review and testing by other Entrants”
= senseless construction

Well, the winning submission is one single submission, and as long as all the inputs are provided, it should not be hard to replicate. By the same token, SQL server express took years and $$$ to build. As long as other Entrants can use it, I don't see any difficulty using it for testing. Mark Waddle raised an interesting question and I'd like to hear answer to that question as well. Thank you!

Mark Waddle: Yes, it should be in enough detail to allow a skilled person to replicate it. I wouldn't say however that it should be within a week. All the information should be provided, but actually using the information could take longer than that, since there is a lot of work in some of these entries!

Edward: All of your requests are appropriate, Note that regarding (1) it is not necessary for the code of models to be published, only the details of their parameters, architecture, etc.

John: Yes, the blending technique should also be documented.

I'll ask the teams to provide this additional information. Many thanks for all the feedback so far.

Hi John,

The paper I referred to for the blending technique is the first reference I could find but is indeed not very useful for implementation. A good description of the technique I used can be found in section 7 of this paper: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf. Hope this helps. (The lambda parameter in this paper is the alpha parameter I describe in my paper).

Congratulations to the winners.

It is my understanding of the rules, that the winners of the milestone prices are required to provide their algorithm to their competitors, to allow all us to improve on this milestone results.

After reading the 2 papers I'am quite sure that nobody will be able to reproduce the results (just from the papers) Has anyone independently reproduced the milestone results just from the papers? 

Market Makers paper goes to great length to describe the (quite obvious) benefits of  blending, but has only a very vague description about the 60 sub models created (and the 20 used). We only learn that there are basically 3 different model techniques (GBM, Forrest Tree and Neural Network) used and that some form of ensemble models are used (how many, which size). It does not describe how the weights for blending were reached. (Telling us, that taking the median of the submodels would reach a top 10 position is nice, but does not fullfill the rule in my eyes). Providing the sample code is definitely helpful, but the code provided is just 1 of the 60 submodels with no blending at all. 

Willem Mestrom definitely gives more detail about his models, although the  description (even with the cited Wikipedia article) probably requires knowledge beyond the paper.  Maybe some code (probably all) would help here.

I know that revealing the exact algorthms used to reach the milestone price will hurt the leaders (because they losse their advantage as now everybody should be able to produce their score), but this will be the only way, that other participants can improve on this results.

Congratulations to the milestone winners!

In addition to Edward's questions above, I have a few requests of my own for clarifications regarding Market Makers’ documentation.

Market Makers wrote:

Using n-fold cross validation to generate evaluation sets, further capping within these limits could be evaluated to determine if this decreased the error.

  • The “further capping” technique is not explained.

 

Market Makers wrote:

The technique of blending algorithm dependent predictions with algorithm independent predictions  was successful in a previous Kaggle competiton.

  • This seems to hint that the team also used algorithm independent predictions. Which algorithm independent predictions were used? I did not find these described in the documentation.

ruediger wrote:

I know that revealing the exact algorthms used to reach the milestone price will hurt the leaders (because they losse their advantage as now everybody should be able to produce their score), but this will be the only way, that other participants can improve on this results.

(emphasis added.)

Gee, I was thinking that one possible path to success was by conjuring up some ideas on my own that other people WEREN'T doing.

Silly, I know.

Jeremy Howard (Kaggle) wrote:

Mark Waddle: Yes, it should be in enough detail to allow a skilled person to replicate it. I wouldn't say however that it should be within a week. All the information should be provided, but actually using the information could take longer than that, since there is a lot of work in some of these entries!

Jeremy: The rules only provide for rolling 30-day windows for competitors to comment & evaluate winner solutions.  Given this constraint, it should be be clear that the details provided must be sufficient to reproduce results within a week or two.  If this were not the case, how would competitors truly be able to verify the solutions against errors and omissions?  Also, please don't confuse the effort required to generate the winning algorithm with the effort required to reproduce it.  The latter can be much faster, perhaps only limited by the CPU time required to build models.

John: Yes, the blending technique should also be documented.

I'll ask the teams to provide this additional information. Many thanks for all the feedback so far.

What is the timeline for obtaining additional details?  Does this reset the 30-day clock? Will these be provided as addenda to the reports?

Thanks,

Andy

Like most of the other competitors posting to this forum thread, I wish to congratulate the winning teams on a job well-done!  Their scores are impressive. 

At the same time, I wish to express my frustration with the lack of reproducible detail in the reports.  I do not fault the winning teams for this issue-- documentation is a time-consuming task, there were no clear guidelines from Kaggle on the document format and it is clearly preferable to minimize disclosure for competitive reasons.  However, the rules do seem to be clear that skilled practitioners should be able to reproduce the winning results based on these descriptions, and it is evident from the posts in this thread that this is not possible.  Here, I do fault the Kaggle contest administrators: it is obvious to novice and skilled data miners alike that these papers were insufficient, and the Kaggle administrators should've intervened prior to release to require expanded descriptions.  I hope that the Kaggle team learns from this and improves the process for milestone 2.

Thanks,

Andy

Here is my contribution to the "required details" for the Market Makers solution (apologies for duplicated concerns, I just want to keep everything together):

  1. What is the admission risk score, defined on page 2 as “... An admission risk score was developed around the Primary Condition Group that was based on medical experience only.  This score was a 1-5 ranking for each PCG, and split by age band.”  This needs to be defined mathematically, or programmatically, such that it can be reconstructed either from their SQL data, or the raw contest data.
  2. The neural networks package cited (Tiberius) appears have a limited trial availability — this appears to contradict contest rules in that the software should not be proprietary Kaggle should comment on this directly.
  3. There is no detail to support “… Multiple models were built on the two data sets using various parameter settings and variable subsets.”  The competitors need to provide the precise models, model parameters, variable subsets and supporting data set for each model that was used to generate their winning submission.
  4. There is no detail to support “… Truncation of the predictions was found to be useful in certain models.”  Did truncations vary by model, or data set?  What is the truncation function?  The competitors need to provide precise details of the truncation function, and the models (with corresponding data sets) to which it was applied.  
  5. The term “multiple n-fold cross validation” is unclear in “…During the model generation process, multiple n-fold cross validation was used to esentially generate out-of-sample prediction sets for the training data.”  What is multiple n-fold CV?  For example: is it simply n-fold CV applied for each model?  Are the folds the same for each model, if not, how do they differ?  What value of n is used?
  6. The linear regression ensembling needs to be described in reproducible detail.  For example: was this a standard linear regression with an intercept?  Were the folds for each model the same?  Was any sampling performed during this fitting procedure?  How were out-of-bound predictions altered (e.g., was a negative value simply rounded to 0)?  What models were included in the regression and what were the final weights?
  7. It appears that the “… alternative to weighting by model was to just take the median predictions…” was essentially median-bagging, and was yet another model used in the final ensemble.  Please confirm
  8. “The weightings were applied to the log scale version of the predicted values” — how were out-of-bound values handled?  Were they rounded into range, or was a shrinking strategy used?  Were out-of-bound values modified uniformly for all patients, or differently by subset?
  9. Complete details of the "... final solution ... ensemble of approximately 20 models” is needed: which models were used, what are their associated parameters, which data sets were they trained on and what is their corresponding weight in the final solution?
  10. In the last paragraph of the report, “We found that calibrating the final predictions so that the overall average predicted days in hospital matched the optimized constant value benchmark gave a small improvement” — full details are required, for example: did they simply mean-shift their entire submission targets?  Was any scaling or patient-subset specific shifting performed?  What was the final value used for translating predictions?  What was the impact of the improvement?

This is my attempt to diagram Team Market Makers solution.  Comments and suggestions for improvement welcome.

https://docs.google.com/drawings/pub?id=1c_PneY0NqrQHyo5SjjXqtZcgyn0Wwwp0Qa7C5C8i6J4&w=960&h=720

Andy

Nice diagram.

I wonder, based on the emphasis the team put on what they learned through visualization, if there's a feedback loop missing somewhere in the diagram where cleansing would be done based on residuals or some other means.  I think that is one of the areas where more detail would be useful in the winning teams' documents if we're aiming for full reproducibility.  (Kudos to the winning teams, and I think the documents are great, but I agree with a lot of the thread memgers that the docs do fall short of full reproducibility).

That is, once the 60 base models are produced, were residuals calculated and then used for cleansing and then the results used to build more accurate models?  For example were the missing 40 models used for cleansing only, to produce the remaining 20?  It doesn't seem clear to me.  I think this is separate from prediction bias / correction because it would retrain the models, where correction is correctly depicted as a last step.

---Chip

Regarding William Mestrom's paper, my first batch of questions concern some basic formulation and notational conventions:

  1. MCN coding looks like count data, for example MC1 seems to tabulate counts across 10 factors ("columns")  with a combined total of 131 levels ("categories"). Is this correct?
  2. It is unclear how some variables are encoded in the models, for example age.  Is that dummy coded, or coded as an integer?  Which variables are used as numeric values, and which are coded as indicators?
  3. It is not explicitly stated: is the objective function to minimize for each model the root mean square error between ln(DIH+1) and pm?
  4. How are the number of phases selected — is this a heuristically or experimentally chosen parameter for each model?
  5. Are the learning rates optimized in-sample or out-of-sample?  Do the validation sets used for optimization vary by model?  
  6. What parameters are specified for the optimization routine, for example step size?  How are these selected or tuned?
  7. Are the extreme value adjustment parameters (c.f. 3.3 "Post Processing") optimized globally, or per model?  Before or after the model is trained?  Using which data subsets?
  8. Please clarify the ridge regression used for final model blending.

The primary difficulty I have with this paper is understanding the notation.  It seems confusing, and perhaps imprecise.  For example, referring to 4.1 "CatVec1":

  • Is pm really an inner product of the sums vectors?  If so, there is a missing transpose operator.
  • The i and m subscripts are confusing.  How are we to interpret "i element of MC2_m", the elements of the summation in the pm equation?  Perhaps an explanation for a concrete value of i, e.g., the i corresponding to AgeAtFirstClaim=50-59, would be helpful.
  • The e equation looks like it should be subscripted, e_m.  Should the f,g equations also be subscripted?
  • Are the fi, gi update equations correct since the second terms sum over gi, fi (cross-error)?

I have more questions about the CatVec1 explanation, but perhaps they will be resolved by answers to the above.

Thanks,

Andy 

Willem Mestrom wrote:

 A good description of the technique I used can be found in section 7 of this paper: http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf. Hope this helps. (The lambda parameter in this paper is the alpha parameter I describe in my paper).

Willem,

thanks for the reference. It gives a lot of details. Any idea on how to optimize the alpha parameter? Just try and error or there is a rule-of-thumb

First I would like to make a general comment. I will try to explain what I did as far as I can but I think it is not realistic to think you can reproduce all results in a matter of weeks (there are over 13000 lines of code in my implementation, it could be done in a lot less but still there is a lot of work). I do understand the request for code but this is not required by the rules and more importantly not very useful. You need to understand what your doing (if you win you need to explain it yourself!) and as the rules say when you submit something it should be your original work.

Now for Andy's questions:

  1. I find this hard to explain clearly but I'll give it another try: MC1_m is the set of unique categories associated with member m. So if member m has an age of 40 the set MC1_m would include the category "AgeAtFirstClaim=40-49". The counts are separate variables: count_m,i is the number of times category i occurs in the claims of member m. So for i = "placesvc=office" a count_m,i of 2 indicates member m has 2 claims with placesvc = office. The cardinality of the set MC1_m for member m is therefore equal to the number of non-zeros values of count_m,i for member m over all 131 categories.
  2. The numeric values of age, length of stay, etc are never used in any model, all columns are treated as categorical data only.
  3. This is correct.
  4. As is very often the case with parameter optimisation this is more of an art than a science. It is in fact an (fairly high dimensional) optimization problem with a very expensive objective function. A lot of experimenting is the key to find a good setting.
  5. Learning rates are optimized with an out-of-sample validation set. Many different validation sets were used, sometimes even multiple for a single model. I cannot give you all the details because I don't remember. It is a very interactive process with a lot of trial and error, manual interventions and only stop when your happy with the result. Because there is no way this process could every be repeated exactly the results (all parameter settings) are given in appendix A of the paper.
  6. For the rosenbrock procedure there are 3 parameters: when a change is succesful the stepsize is multiplied by 1.3, when a change is not succesful the stepsize is multiplied by -0.5 and the initial stepsize is 0.1 times the current parameter value.
  7. These parameters are optimized as all other parameters per model and all (or sometimes a selected subset) at the same time.
  8. Please see my answer to John's question.

About the other points:

  •  Yes you are right, there should have been a transpose operator.
  • The summation is only over the members of the set MC2_m, so if member m has an age 50-59 then the set will include "AgeAtFirstClaim=50-59" and the summation will include this category. If the member has a different age the set will not include "AgeAtFirstClaim=50-59" and the summation will not include this category.
  • The e is indeed per member so e_m would be better. f and g are not per member.
  • As far as I can see the update rules are correct. If you take the derivative of p_m with respect to f_i it will be the summation of g. The update takes the current value (minus regularisation) and adds the error times the gradient times the learningrate.

@Kaggle: Is the idea to provide an updated paper to include the above corrections?

Willem Mestrom wrote:

@Kaggle: Is the idea to provide an updated paper to include the above corrections?

You're welcome to attach updates as an attachment to this topic for the time being.

Willem Mestrom wrote:

First I would like to make a general comment. I will try to explain what I did as far as I can but I think it is not realistic to think you can reproduce all results in a matter of weeks (there are over 13000 lines of code in my implementation, it could be done in a lot less but still there is a lot of work). I do understand the request for code but this is not required by the rules and more importantly not very useful. You need to understand what your doing (if you win you need to explain it yourself!) and as the rules say when you submit something it should be your original work.

Both milestone winners (and Netflix)  tell us, that a single model is not the key, but blending different models is. So one way to improve on the milestone winners would be to blend their submissions (in case we are able to reproduce them) without even understanding there structure and blend them with other models. I see no problem with the original work clause, because the submission are public data (as the winners are required to publish the algorihm so that others can replicate the results). Applying that new blend to different data (if required) is a bit more complicated, as the judges must than apply the old winners models first (which they should be able to do, as they have confirmed the milestone win) also on the new data, but this problem is not fundamentally different from determining the RMSLE of your submodels. So one could improve on the milestone results without understanding the underlying models. 

Totally agree with ruediger about the blending stuff.

Also, I'm curious about if you guys have successfully reproduced the features (including the way of how the features (of the winners) are calculated from the original data)?

Thanks

@Kaggle: Are we going to get any clarifications on this feedback process? I'm waiting for additional details from Team Market Makers, and I'm expecting Kaggle to solicit additional information from both winners and publish it to the site. What is the schedule for this?

Thanks,
Andy

On October 30, the judges in their sole discretion, decide whether or not the documentation is sufficient (taking account of the comments made on this forum). If they decide the documentation is not sufficient, they can impel the winners to address their concerns in the seven days following October 30. If the winners are asked to resubmit, participants have another 30 days from November 6 to raise any additional complaints. 
The judging panel are experienced academic reviewers:
 
 
 
 
 
 
 
 

Thanks, Anthony.

I take it that the interpretation of the Judges so far .... and I will pointedly note that under the Contest Rules, the Judges have Final Authority over interpretation of the ALL Contest Rules ... is that there is NOT a requirement that the most detailed "how we got our results, from A to Z" scripts, that were presumably submitted to Kaggle and the Judges for verification purposes, are going to be also handed over to all the other contestents.(If that was the opinion of the Judges, the Surrender of the Scripts would have already happened.)

Some contestents seem to believe that they will be handed a copy of runnable code that will allow them to get to the same level of performance as our front-runners, at the push of a button, with no effort.

That would be a VERY BAD IDEA, all around:

(1) It's bad for the Leaders, who have invested considerable time in creating their intellectual property, including the mgmt infrastructure of knowing how to do these kinds of projects, esp. for dispersed teams;

(2) It's bad for the other top-ranked competitors too, who have also invested much effort to get where they are, and get no residual value for that effort if every JohnnyComeLately is handed keys to zoom up to their level with no effort;

(3) It's even bad for other middle of the pack contestants, too. Sorry, running somebody else's script to replicate their results doesn't do anything for you (or me), except get you to where they were a month ago, and with no ideas how to take the next step. To really LEARN will require intellectual effort, and studying and pondering the slightly generic descriptions of how these folks got where they did should get your mental juices going and maximize the learning. You may even inadvertently implement some of their ideas differently than they did, and perhaps that'll even be better. There is no totally free lunch, and what we have been given already is very useful;

(4) It's bad for the profession & the supply of data miners, which grows in depth and breadth because of the intense efforts needed in competitions of this caliber;

(5) It's bad for the Client, because having 1000 identical replicants as of Date-X is a genetic bottleneck of sorts that limits the odds that an really good but different solution will emerge; and

(6) It's bad for Kaggle for all of the reasons listed above.

My $0.02. I've wanted to put this to words for a while, sorry I didn't do it sooner.

<123>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?