The papers written by the milestone winners are now available here. As described in section 13 of the rules, if you have any concerns about these papers, you have 30 days from their posting to provide your feedback.
Hi Edward and Willem. Congratulations on your success at Milestone 2! Here are some questions for you regarding your Paper and its contained Algorithms: A. 2.3 Training  What is the tree ensemble maximising? Or in other words, what determines which column to branch on at each branch in the tree? B. 2.3 Training  what form does the tree ensemble branching take for the 2 numerical columns? Binary split? C. 2.3 Training  Why did you 'perform the ensemble of trees in DiH'? Wouldn't the log(1+DiH) metric be more sensible? D. 2.3 Training  For each of the 2000 trees, was the 80 random columns and the 12.5% random selected rows constant through the tree creation (unlike, say, Random Forest where the random columns choice vary per branch decision)? E. 2.4 Post Processing  Is the 'corrected' value from the formula considered to be the eventual DiH estimate for that member,year classification through that tree? F. 2.4 Post Processing  Are the P1..P4,Pmin,Pmax the same across all 2000 trees? G. 2.4 Post Processing  Please explain the genetic algorithm used to calculate P1..P4,Pmin,Pmax? H. 2.4 Post Processing  What values were finally used for P1..P4,Pmin,Pmax? I. 3.1 Data  Explain combination 3. Is this a subtraction being performed? This makes sense if it is 'take year 1 column counts and subtract year 2 column counts', but how are the noncounted columns like sex, age etc handled? J. 3.1 Data  Explain combination 4. The description appears to be consistent with the '1 Year Claim' method, not the '2 Year Claim' method. If the description is correct, and Y1data > Y2Pred is used for training, how can that work when only '2 Year Claim' method is being used, and the text states that the test set is 25% of Y3 Predictions? K. 3.4 Ensemble of trees input  How was the weight for the tree data calculated by optimising the RMSE result? What eventual value did you use? L. 3.x. What is the stochastic gradient descent optimising? (a linear sum? of what exactly?) M. 3.x. What update rules are used to calculate the next (iterative) values of the weights of the variables in the equation above? [In your Milestone 1 paper, you describe for CatVec1 the summation equations and the gradient and update rules  please supply them for this model too] N. 3.5. Please explain the 'weight bias'. Is this a constant term (same for all member,year data) in the equation for the model? O. 3.10. How is the initial gradient (before correction) calculated? P. 3.11. What were the eventual factors used for the blend of the 4 models? How were these factors calculated? Thanks very much Dave
Thanked by
Vijay Ram


Hello DaveC, Here are the answers to your questions: A) The Gini Index has been used, as explained on the following (and follow up) webpages:
I) Yes, it is a subtraction being performed.
M) The update rule is presented in paragraph 3.3.
Thanks, Edward & Willem
Thanked by
Vijay Ram


@JC36: I don't see any problem, in fact I think that what is causing you confusion is all the work that goes above and beyond the actual algorithm. All the prize winning entries, as far as I can tell, embody a method where you can take the data set, perform specific manipulations of it, and arrive, deterministically, at the predictions which scored best on the target dataset. This should fit any useful definition of "algorithm". The difficult part is describing how the teams ARRIVED at these aglorithms, relying on computeintensive nondeterministic (at least dependant on randomization) algorithms to CREATE the final algorithm. It does make the head hurt, but the resulting algorithm is largely independant of how it was created. For example, I tend to run two algorithms based on the "gbm" and a "randomForest" packages in "R" and ensemble them. If I published the R code to do that (which is largely what one of the milestone winning teams has done), and rerun it mutliple times on multiple machines, the results will be different. However, if I pick one of those runs, I can SAVE the resulting models and apply them repeatedly to new or existing data sets, using the same ensembling math, and thereby function as a repeatable, deterministic, "single" algorithm. In theory that algorithm could be decomposed to some very basic math, although it's easier to talk about in the rich language of the modeling used to create it. I hope that viewpoint helps. Basically, the large computational time tends to be used in creating an algorithm, and while the resulting algorithm could be computationally intensive itself, I haven't seen anything yet to indicate that it would be unusably so. Just my opinion.
Thanked by
kit001


Not to beat a dead horse, but a single random forest could in and of itself be considered an ensemble and not a single algo under some very strict definitions. I believe the generic name is something like "ensemble of trees" after all. Also, training time is usually much longer than prediction time. I have algos that take overnight and possibly days to train that will make predictions for new data in under 10 minutes. Further, viewing these as impractical is kind of like looking at Lance Armstrong's gram sized reduction in clothing as unpractical for the average rider, while ignoring all the other advances in understanding the sport & human body that the top tour contestants have made. The less restrictions made upon contestants in my mind the more they can concentrate on what works  practical or not  something that appears impractical now might be more practical in the future. A retina display would have been impractical 10 years ago  now I can't live without it (sligh exaggeration)  without people working on the cutting edge we would be living with mediocrity.
Thanked by
Sarkis


Thanks guys (ChipMonkey, DavidC and Chris Raimondi) for your comments. Let me say right away that, of course, I accept Kaggle Admin's ruling as to whether the milestone winners' methods comply with the rules. (I might use the same techniques myself if my
methods get good enough.) There were four underlying algorithms used in our models, all of which are freely available in the R language for statistical computing. Online references for each algorithm are given in the hyperlinks below.
I have deleted the hyperlinks for clarity. Market Makers go on to use "ensembling" which blends the results of the different algorithms.
Thanked by
Vijay Ram


JC36 wrote: Thanks guys (ChipMonkey, DavidC and Chris Raimondi) for your comments. Let me say right away that, of course, I accept Kaggle Admin's ruling as to whether the milestone winners' methods comply with the rules. (I might use the same techniques myself if my
methods get good enough.) There were four underlying algorithms used in our models, all of which are freely available in the R language for statistical computing. Online references for each algorithm are given in the hyperlinks below.
I have deleted the hyperlinks for clarity. Market Makers go on to use "ensembling" which blends the results of the different algorithms.
Surely you guys wouldn't argue that a combination of four such different algorithms is AN algorithm
Yes I would  as mentioned before  many different things though of as a single algo  or even equation is actually a combination  or linear blend. Is the the Pythagorean theorem an algo? Or is it a linear ensemble of A^2 plus B^2? The fact that you are calling Bagged Trees  for example  AN ALGO  shows the problem with this appproach. The R package randomForest is simply a combination of CART trees  therefore  even a single random forest under what you are stating  wouldn't count as a single algo  as it was someone putting together a bunch of CART trees in a clever manner. I understand what you are saying  and similar objections to practicality were raised during the netflix competition. Google uses over 200 different signals (what we call features) and a combination of algos, but their overall method  as you would call it  is still refered to as "The Google Algorithm". See here for example  the singular is used eight times  the plural never: http://bits.blogs.nytimes.com/2011/11/14/googlerevealstweakstoitssearchalgorithm/ I do not disagree that MM and W&E used a combination of algorithims  I just disagree that you can't call that combination an algorithm as well. I think you can combine four movies (in some cases)  and still consider it A MOVIE  just as you can put up 100s or thousands of orange pieces of cloth in central park and call it A PIECE OF ART. Should a cheeseburger be disqualified as the most delicious piece of food on the planet  simply because it combines cheese and a hamburger? Can the United Kingdom not be considered A COUNTRY, because it contains the countries of England, Scotland, et. al? Not trying to be a smart ass  ok maybe a little bit :)
Thanked by
ChipMonkey


with —