The papers written by the milestone winners are now available here. As described in section 13 of the rules, if you have any concerns about these papers, you have 30 days from their posting to provide your feedback.
vote

Hi Edward and Willem. Congratulations on your success at Milestone 2! Here are some questions for you regarding your Paper and its contained Algorithms: A. 2.3 Training  What is the tree ensemble maximising? Or in other words, what determines which column to branch on at each branch in the tree? B. 2.3 Training  what form does the tree ensemble branching take for the 2 numerical columns? Binary split? C. 2.3 Training  Why did you 'perform the ensemble of trees in DiH'? Wouldn't the log(1+DiH) metric be more sensible? D. 2.3 Training  For each of the 2000 trees, was the 80 random columns and the 12.5% random selected rows constant through the tree creation (unlike, say, Random Forest where the random columns choice vary per branch decision)? E. 2.4 Post Processing  Is the 'corrected' value from the formula considered to be the eventual DiH estimate for that member,year classification through that tree? F. 2.4 Post Processing  Are the P1..P4,Pmin,Pmax the same across all 2000 trees? G. 2.4 Post Processing  Please explain the genetic algorithm used to calculate P1..P4,Pmin,Pmax? H. 2.4 Post Processing  What values were finally used for P1..P4,Pmin,Pmax? I. 3.1 Data  Explain combination 3. Is this a subtraction being performed? This makes sense if it is 'take year 1 column counts and subtract year 2 column counts', but how are the noncounted columns like sex, age etc handled? J. 3.1 Data  Explain combination 4. The description appears to be consistent with the '1 Year Claim' method, not the '2 Year Claim' method. If the description is correct, and Y1data > Y2Pred is used for training, how can that work when only '2 Year Claim' method is being used, and the text states that the test set is 25% of Y3 Predictions? K. 3.4 Ensemble of trees input  How was the weight for the tree data calculated by optimising the RMSE result? What eventual value did you use? L. 3.x. What is the stochastic gradient descent optimising? (a linear sum? of what exactly?) M. 3.x. What update rules are used to calculate the next (iterative) values of the weights of the variables in the equation above? [In your Milestone 1 paper, you describe for CatVec1 the summation equations and the gradient and update rules  please supply them for this model too] N. 3.5. Please explain the 'weight bias'. Is this a constant term (same for all member,year data) in the equation for the model? O. 3.10. How is the initial gradient (before correction) calculated? P. 3.11. What were the eventual factors used for the blend of the 4 models? How were these factors calculated? Thanks very much Dave 
vote

Hello DaveC, Here are the answers to your questions: A) The Gini Index has been used, as explained on the following (and follow up) webpages:
I) Yes, it is a subtraction being performed.
M) The update rule is presented in paragraph 3.3.
Thanks, Edward & Willem 
votes

Hi Edward and Willem Thanks very much for your reply. Just one clarification please... In question k), you reply that the weight bias for the tree input is 0.2 for combinations 3 and 4. I notice in the text of your paper, you put the value as 0.02. Which is correct? Thanks Dave 
votes

Hi team Market Makers. Just one question concerning your excellent milestone 2 result and paper. When you are performing your ridge regression for blending your results, what value did you use for the quiz set variance? (ie, the first term in the formula 49 in the 'GrandPrize_2009_BPC_BigChaos.pdf' paper which Willem/Edward referenced and you also used)? How did you estimate this quiz set variance value (for the 30% of the test data set), and what value did you end up with? Thanks Dave Clark 
votes

First of all congrats to both teams for winning once again. I know this is after the 30 days, but I was wondering if/when Edward & Willem update their paper if they would consider out of the goodness of their heart updating the appendix. I have been hibernating the past few months (so am a little rusty), but am pretty sure that both A.2 and A.3 are not written as intended  as: A.2 Appears to contain primary conditions and not procedures I realize that you mentioned that you made changes to what Market Makers did, but from reading the text  I was under the impression you still were using the four groups. Also  not complaining, but if you are updating the paper  I think there is a typo  I think you mean "SigClaimVec7" and not "SicClaimVec7" on page 16. I recieved great enjoyment out of reading both papers and hope to be able to return the favor in the future... 
votes

Hi all, As Chris noted in the previous comment, the 30day question/comment period specified in the rules has ended. Feel free to continue using this thread for questions/discussion. Many thanks to everyone for your questions, and thanks (and congratulations) to the winners. David 
votes

In another thread called "2nd Milestone Winners" I posted the following, now 24 days ago: Anyone who studies the Milestone descriptions cannot but be impressed by the cleverness, the persistence and the comprehensiveness of the methods used. As time goes on you see the best scores on the leaderboard shaving 0.001 or thereabouts off the previous best score. However, it occurs to me that these methods will not be practical in real life situations because they are a composite of several algorithms some of which take considerable time to compute. Is the situation going to arise where the method(s) that finally win according to the RMSLE objective function will NOT be the ones actually used in practice? To me the rules seem to have assumed that each entry will be based on ONE algorithm. Has this issue been discussed anywhere in the forum? So far no one has responded to this comment on that thread. Since then I have searched and read the forum to see if the issue of a single algorithm has been discussed but found nothing. And I have reread the Rules a few times. It is noteworthy that the
rules expect a single algorithm. In Rule 12: Once an Entry is selected as eligible for a prize, the conditional winner must deliver the Prediction Algorithm’s code and documentation ... If this issue has been settled and I have missed it would someone (Kaggle Admin?) tell me where and what the conclusion was. 
vote

@JC36: I don't see any problem, in fact I think that what is causing you confusion is all the work that goes above and beyond the actual algorithm. All the prize winning entries, as far as I can tell, embody a method where you can take the data set, perform specific manipulations of it, and arrive, deterministically, at the predictions which scored best on the target dataset. This should fit any useful definition of "algorithm". The difficult part is describing how the teams ARRIVED at these aglorithms, relying on computeintensive nondeterministic (at least dependant on randomization) algorithms to CREATE the final algorithm. It does make the head hurt, but the resulting algorithm is largely independant of how it was created. For example, I tend to run two algorithms based on the "gbm" and a "randomForest" packages in "R" and ensemble them. If I published the R code to do that (which is largely what one of the milestone winning teams has done), and rerun it mutliple times on multiple machines, the results will be different. However, if I pick one of those runs, I can SAVE the resulting models and apply them repeatedly to new or existing data sets, using the same ensembling math, and thereby function as a repeatable, deterministic, "single" algorithm. In theory that algorithm could be decomposed to some very basic math, although it's easier to talk about in the rich language of the modeling used to create it. I hope that viewpoint helps. Basically, the large computational time tends to be used in creating an algorithm, and while the resulting algorithm could be computationally intensive itself, I haven't seen anything yet to indicate that it would be unusably so. Just my opinion. 
votes

JC36, thanks for the question (and apologies for the delayI wasn't watching the other thread as closely), and thanks for your comments, ChipMonkey. I'd add that we're definitely fine with ensembles  for one thing, they're a good way to build a model. For another, I don't think it's possible to give a definition that cleanly marks a boundary between one algorithm and several combined. 
vote

Not to beat a dead horse, but a single random forest could in and of itself be considered an ensemble and not a single algo under some very strict definitions. I believe the generic name is something like "ensemble of trees" after all. Also, training time is usually much longer than prediction time. I have algos that take overnight and possibly days to train that will make predictions for new data in under 10 minutes. Further, viewing these as impractical is kind of like looking at Lance Armstrong's gram sized reduction in clothing as unpractical for the average rider, while ignoring all the other advances in understanding the sport & human body that the top tour contestants have made. The less restrictions made upon contestants in my mind the more they can concentrate on what works  practical or not  something that appears impractical now might be more practical in the future. A retina display would have been impractical 10 years ago  now I can't live without it (sligh exaggeration)  without people working on the cutting edge we would be living with mediocrity. 
votes

@Chris Ramondi said: Not to beat a dead horse, but a single random forest could in and of itself be considered an ensemble and not a single algo under some very strict definitions. I don't think the horse will mind, and the forums have been quiet, so I rather enjoy the conversation. I still, respectfully, disagree, though... the creation of the random forest is not the "algorithm" in terms of the contest rules (and JC's question). The resulting forest is. And, like @DavidC, I cannot think of a definition of practical algorithm that it would not fit. One would not argue that a quadratic equation is not an "algorithm" because it is an ensemble of three lesser functions (a square, a multiplication, and a constant). I could see an argument that the algorithm must be deterministic, but again we can't confuse the training with the prediction. Your overnight randomForest may generate slightly different output every time (a la "random"), but the 10 minute predictions will be identical given the same inputs. Personally, I'd argue that they're all algorithms anyway; determinism may be required under the "reproducible" rule, but that's up to Kaggle. And certainly it makes sense to disclose how the results were derived  to discuss the training of the models rather than simply document the result (i.e. publish a saved randomForest object), if only because it gives the rest of us something interesting to read. I completely agree, though, with everything else you said. I haven't gotten a retina display yet simply so that I can continue to live without it. 
vote

Thanks guys (ChipMonkey, DavidC and Chris Raimondi) for your comments. Let me say right away that, of course, I accept Kaggle Admin's ruling as to whether the milestone winners' methods comply with the rules. (I might use the same techniques myself if my
methods get good enough.) There were four underlying algorithms used in our models, all of which are freely available in the R language for statistical computing. Online references for each algorithm are given in the hyperlinks below.
I have deleted the hyperlinks for clarity. Market Makers go on to use "ensembling" which blends the results of the different algorithms. 
vote

JC36 wrote: Thanks guys (ChipMonkey, DavidC and Chris Raimondi) for your comments. Let me say right away that, of course, I accept Kaggle Admin's ruling as to whether the milestone winners' methods comply with the rules. (I might use the same techniques myself if my
methods get good enough.) There were four underlying algorithms used in our models, all of which are freely available in the R language for statistical computing. Online references for each algorithm are given in the hyperlinks below.
I have deleted the hyperlinks for clarity. Market Makers go on to use "ensembling" which blends the results of the different algorithms.
Surely you guys wouldn't argue that a combination of four such different algorithms is AN algorithm
Yes I would  as mentioned before  many different things though of as a single algo  or even equation is actually a combination  or linear blend. Is the the Pythagorean theorem an algo? Or is it a linear ensemble of A^2 plus B^2? The fact that you are calling Bagged Trees  for example  AN ALGO  shows the problem with this appproach. The R package randomForest is simply a combination of CART trees  therefore  even a single random forest under what you are stating  wouldn't count as a single algo  as it was someone putting together a bunch of CART trees in a clever manner. I understand what you are saying  and similar objections to practicality were raised during the netflix competition. Google uses over 200 different signals (what we call features) and a combination of algos, but their overall method  as you would call it  is still refered to as "The Google Algorithm". See here for example  the singular is used eight times  the plural never: http://bits.blogs.nytimes.com/2011/11/14/googlerevealstweakstoitssearchalgorithm/ I do not disagree that MM and W&E used a combination of algorithims  I just disagree that you can't call that combination an algorithm as well. I think you can combine four movies (in some cases)  and still consider it A MOVIE  just as you can put up 100s or thousands of orange pieces of cloth in central park and call it A PIECE OF ART. Should a cheeseburger be disqualified as the most delicious piece of food on the planet  simply because it combines cheese and a hamburger? Can the United Kingdom not be considered A COUNTRY, because it contains the countries of England, Scotland, et. al? Not trying to be a smart ass  ok maybe a little bit :) 
votes

Wikipedia says... While there is no generally accepted formal definition of "algorithm," an informal definition could be "a set of rules that precisely defines a sequence of operations." RandomForest = (tree1 + tree2) / 2 Is that an algorithm? myNewAlgorithm = (RF + GBM + NN + LINREG) / 4 Is that a set of rules that precisely defines a sequence of operations? Is that an algorithm? 
votes

JC36, can you please give us your definition of algorithm? (At least a crude definition, a description?) (1) I still disagree that a combination of several algorithms should be considered "AN algorithm". It should be called "a method". (2) Surely you guys wouldn't argue that a combination of four such different algorithms is AN algorithm. 
votes

Oleg, My definition of an algorithm is the same as the dictionary definitions I have seen which are all singular ie refer to "a procedure" or "a method". Ok, I can see that some people could classify a procedure that embodies the use of four totally distinct algorithms (eg Market Makers' milestone 1 solution) as fitting this definiton and is therefore "an algorithm". It is just that that is not my position. 
votes

Has the data prep and modelling code been released, I recall Phil had posted a replicable code for previous milestone? Any such posting for this one so that we can test it out or it needs to be gleaned from the papers itself?? 
with —