The papers written by the milestone winners are now available here. As described in section 13 of the rules, if you have any concerns about these papers, you have 30 days from their posting to provide your feedback.

Hi Edward and Willem. Congratulations on your success at Milestone 2!

Here are some questions for you regarding your Paper and its contained Algorithms:

A. 2.3 Training - What is the tree ensemble maximising? Or in other words, what determines which column to branch on at each branch in the tree?

B. 2.3 Training - what form does the tree ensemble branching take for the 2 numerical columns? Binary split?

C. 2.3 Training - Why did you 'perform the ensemble of trees in DiH'? Wouldn't the log(1+DiH) metric be more sensible?

D. 2.3 Training - For each of the 2000 trees, was the 80 random columns and the 12.5% random selected rows constant through the tree creation (unlike, say, Random Forest where the random columns choice vary per branch decision)?

E. 2.4 Post Processing - Is the 'corrected' value from the formula considered to be the eventual DiH estimate for that member,year classification through that tree?

F. 2.4 Post Processing - Are the P1..P4,Pmin,Pmax the same across all 2000 trees?

G. 2.4 Post Processing - Please explain the genetic algorithm used to calculate P1..P4,Pmin,Pmax?

H. 2.4 Post Processing - What values were finally used for P1..P4,Pmin,Pmax?

I. 3.1 Data - Explain combination 3. Is this a subtraction being performed? This makes sense if it is 'take year 1 column counts and subtract year 2 column counts', but how are the non-counted columns like sex, age etc handled?

J. 3.1 Data - Explain combination 4. The description appears to be consistent with the '1 Year Claim' method, not the '2 Year Claim' method. If the description is correct, and Y1data -> Y2Pred is used for training, how can that work when only '2 Year Claim' method is being used, and the text states that the test set is 25% of Y3 Predictions?

K. 3.4 Ensemble of trees input - How was the weight for the tree data calculated by optimising the RMSE result? What eventual value did you use?

L. 3.x. What is the stochastic gradient descent optimising? (a linear sum? of what exactly?)

M. 3.x. What update rules are used to calculate the next (iterative) values of the weights of the variables in the equation above? [In your Milestone 1 paper, you describe for CatVec1 the summation equations and the gradient and update rules - please supply them for this model too]

N. 3.5. Please explain the 'weight bias'. Is this a constant term (same for all member,year data) in the equation for the model?

O. 3.10. How is the initial gradient (before correction) calculated?

P. 3.11. What were the eventual factors used for the blend of the 4 models? How were these factors calculated?

Thanks very much

Dave

Hello DaveC,

Here are the answers to your questions:

A) The Gini Index has been used, as explained on the following (and follow up) webpages:
   http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm


B) It is a binary spit by calculating a random threshold. Each possible threshold has the same chance.


C) The algorithm was made using DiH. Using the log(DiH+1) metric would probably be more sensible.


D) The column and row selections were randomly chosen before each tree, so each tree has a different
   combination of rows and columns.


E) Yes. The formula parameters are optimized using the 25% testset, and used to transform every DiH estimate.


F) Yes. Note that this can be calculated afterwards for the ensemble of trees, because there are no stopping
   criteria on the testset RMSLE (like in SGD).


G) A Genetic algorithm is used without crossovers, so only mutations are used for reproduction.
   The mutations change the value of the parameters.
   A population of 10 groups of parameter values is used.
   The fitness value is equal the the resulting RMSLE value of the testset.
   The winner of each generation is used four times in the next generation, the second place three times,
   the third place two times, and the fourth place one time.
   The search is terminated if the fitness does not improve by more then 0.000001 point after 40 generations
   with a maximum of 500 generations
   See also http://en.wikipedia.org/wiki/Genetic_algorithm


H) P1=   0.9199
   P2=  -0.0115
   P3=   1.2579
   P4=   0.6246
   Pmin= 0.0001
   Pmax= 1.3748

I) Yes, it is a subtraction being performed.
   This action is only used for the claims data (and not for the separate sex, age, etc. data),  but some columns containing combined categories including age, sex, etc. are subtracted too, and could lose their meaning for these columns (see also answer on question J)


J) We will change the sentence to:
   For these models, the "One years claims" or "Two years claims" method is used, using only release3 data.
   The Y1 claim data for combination 4 is only used as additional training set.
   The Y2data->Y3pred on the prediction line of combination 4 must be removed.
  


K) We = fixed column weight of ensemble of tree column
        (This weight is multiplied with the DiH predictions of the ensemble of tree input column)
   Bias = fixed column with offset
        (This bias is added with each prediction)
   Combination 1: We = 0.0      Bias = 0.44
   Combination 2: We = 0.0      Bias = 0.44
   Combination 3: We = 0.76     Bias = 0.20
   Combination 4: We = 0.76     Bias = 0.20


L) It is minimizing the RMSE between the prediction and the true DiH.

M) The update rule is presented in paragraph 3.3.
   The predictions are calculated as the inner product of the parameter vector (p) and the value vector for the member,year (vm)
     prediction = sumi(pi * vm,i)
     error = prediction - DiH
     gradient = -error
     (apply gradient correction here)
   for each parameter pi the update rule has the same form, only the learning rate (η) is different for each parameter
     pi <= (1="" -="">i * λ) * (pi + ηi * gradient)


N) Yes, it is a constant term, the same for every member,year datarow.


O) All initial gradient (power) values are 1.0 (so there is no initial correction for the gradient).
   With optimalisations other values are tried.


P) These factors were calculated using the same Genetic algorithm optimalisation as for question G).
    Combination 1: -0.836
    Combination 2:  0.734
    Combination 3:  0.517
    Combination 4:  0.585

Thanks, Edward & Willem

Hi Edward and Willem

Thanks very much for your reply.

Just one clarification please...

In question k), you reply that the weight bias for the tree input is 0.2 for combinations 3 and 4. I notice in the text of your paper, you put the value as 0.02. Which is correct?

Thanks

Dave

Hello Dave,

It should be 0.2

Thanks,
Edward & Willem

Hi team Market Makers.

Just one question concerning your excellent milestone 2 result and paper.

When you are performing your ridge regression for blending your results, what value did you use for the quiz set variance? (ie, the first term in the formula 49 in the 'GrandPrize_2009_BPC_BigChaos.pdf' paper which Willem/Edward referenced and you also used)? How did you estimate this quiz set variance value (for the 30% of the test data set), and what value did you end up with?

Thanks

Dave Clark

Hi Dave,

We used 0.27272

This is the all zeros benchmark squared.

First of all congrats to both teams for winning once again.

I know this is after the 30 days, but I was wondering if/when Edward & Willem update their paper if they would consider out of the goodness of their heart updating the appendix.  I have been hibernating the past few months (so am a little rusty), but am pretty sure that both A.2 and A.3 are not written as intended - as:

A.2 Appears to contain primary conditions and not procedures
A.3 Appears to contain two primary conditions and no procedures at all

I realize that you mentioned that you made changes to what Market Makers did, but from reading the text - I was under the impression you still were using the four groups.

Also - not complaining, but if you are updating the paper - I think there is a typo - I think you mean "SigClaimVec7" and not "SicClaimVec7" on page 16.

I recieved great enjoyment out of reading both papers and hope to be able to return the favor in the future...

Hi all,

As Chris noted in the previous comment, the 30-day question/comment period specified in the rules has ended.

Feel free to continue using this thread for questions/discussion.

Many thanks to everyone for your questions, and thanks (and congratulations) to the winners.

David

In another thread called "2nd Milestone Winners" I posted the following, now 24 days ago:-


Anyone who studies the Milestone descriptions cannot but be impressed by the cleverness, the persistence and the comprehensiveness of the methods used. As time goes on you see the best scores on the leaderboard shaving 0.001 or thereabouts off the previous best score. However, it occurs to me that these methods will not be practical in real life situations because they are a composite of several algorithms some of which take considerable time to compute. Is the situation going to arise where the method(s) that finally win according to the RMSLE objective function will NOT be the ones actually used in practice? To me the rules seem to have assumed that each entry will be based on ONE algorithm. Has this issue been discussed anywhere in the forum?


So far no one has responded to this comment on that thread. Since then I have searched and read the forum to see if the issue of a single algorithm has been discussed but found nothing. And I have re-read the Rules a few times. It is noteworthy that the rules expect a single algorithm.
For example:
In the Introduction:-
"The winning team will create an algorithm..."
In Rule 5:-
A "Prediction Algorithm" is the algorithm used to produce the data in an Entry taken as a whole (i.e., its particular total configuration) but does not include individual components of the Prediction Algorithm or tools used for analysis or development of the Prediction Algorithm.

In Rule 12:-
Entries will be judged based on the degree of accuracy of their Prediction Algorithm’s predictions of DaysInHospital for Y4...

Once an Entry is selected as eligible for a prize, the conditional winner must deliver the Prediction Algorithm’s code and documentation ...

If this issue has been settled and I have missed it would someone (Kaggle Admin?) tell me where and what the conclusion was.

@JC36:  I don't see any problem, in fact I think that what is causing you confusion is all the work that goes above and beyond the actual algorithm.

All the prize winning entries, as far as I can tell, embody a method where you can take the data set, perform specific manipulations of it, and arrive, deterministically, at the predictions which scored best on the target dataset.  This should fit any useful definition of "algorithm".

The difficult part is describing how the teams ARRIVED at these aglorithms, relying on compute-intensive non-deterministic (at least dependant on randomization) algorithms to CREATE the final algorithm.  It does make the head hurt, but the resulting algorithm is largely independant of how it was created.

For example, I tend to run two algorithms based on the "gbm" and a "randomForest" packages in "R" and ensemble them.  If I published the R code to do that (which is largely what one of the milestone winning teams has done), and rerun it mutliple times on multiple machines, the results will be different.  However, if I pick one of those runs, I can SAVE the resulting models and apply them repeatedly to new or existing data sets, using the same ensembling math, and thereby function as a repeatable, deterministic, "single" algorithm.  In theory that algorithm could be decomposed to some very basic math, although it's easier to talk about in the rich language of the modeling used to create it.

I hope that viewpoint helps.  Basically, the large computational time tends to be used in creating an algorithm, and while the resulting algorithm could be computationally intensive itself, I haven't seen anything yet to indicate that it would be unusably so.

Just my opinion.

JC36, thanks for the question (and apologies for the delay--I wasn't watching the other thread as closely), and thanks for your comments, ChipMonkey.

I'd add that we're definitely fine with ensembles -- for one thing, they're a good way to build a model. For another, I don't think it's possible to give a definition that cleanly marks a boundary between one algorithm and several combined.

Not to beat a dead horse, but a single random forest could in and of itself be considered an ensemble and not a single algo under some very strict definitions. I believe the generic name is something like "ensemble of trees" after all. Also, training time is usually much longer than prediction time. I have algos that take overnight and possibly days to train that will make predictions for new data in under 10 minutes. Further, viewing these as impractical is kind of like looking at Lance Armstrong's gram sized reduction in clothing as unpractical for the average rider, while ignoring all the other advances in understanding the sport & human body that the top tour contestants have made. The less restrictions made upon contestants in my mind the more they can concentrate on what works - practical or not - something that appears impractical now might be more practical in the future. A retina display would have been impractical 10 years ago - now I can't live without it (sligh exaggeration) - without people working on the cutting edge we would be living with mediocrity.

@Chris Ramondi said:

Not to beat a dead horse, but a single random forest could in and of itself be considered an ensemble and not a single algo under some very strict definitions.

I don't think the horse will mind, and the forums have been quiet, so I rather enjoy the conversation.  I still, respectfully, disagree, though... the creation of the random forest is not the "algorithm" in terms of the contest rules (and JC's question).  The resulting forest is.  And, like @DavidC, I cannot think of a definition of practical algorithm that it would not fit.  One would not argue that a quadratic equation is not an "algorithm" because it is an ensemble of three lesser functions (a square, a multiplication, and a constant).  I could see an argument that the algorithm must be deterministic, but again we can't confuse the training with the prediction.  Your overnight randomForest may generate slightly different output every time (a la "random"), but the 10 minute predictions will be identical given the same inputs.

Personally, I'd argue that they're all algorithms anyway; determinism may be required under the "reproducible" rule, but that's up to Kaggle.  And certainly it makes sense to disclose how the results were derived -- to discuss the training of the models rather than simply document the result (i.e. publish a saved randomForest object), if only because it gives the rest of us something interesting to read.

I completely agree, though, with everything else you said.  I haven't gotten a retina display yet simply so that I can continue to live without it.

Thanks guys (ChipMonkey, DavidC and Chris Raimondi) for your comments. Let me say right away that, of course, I accept Kaggle Admin's ruling as to whether the milestone winners' methods comply with the rules. (I might use the same techniques myself if my methods get good enough.)
However, I still disagree that a combination of several algorithms should be considered "AN algorithm". It should be called "a method". Have a look at Market Makers' paper describing their method for their winning round 1 milestone.
Here is what they write:-


There were four underlying algorithms used in our models, all of which are freely available in the R language for statistical computing. Online references for each algorithm are given in the hyperlinks below.

  1. Gradient Boosting Machines ...
  2. Neural Networks...
  3. Bagged Trees...
  4. Linear Models...

I have deleted the hyperlinks for clarity. Market Makers go on to use "ensembling" which blends the results of the different algorithms.
Surely you guys wouldn't argue that a combination of four such different algorithms is AN algorithm. If you do I would have to give away my geographical location and say you are a mob of "bush lawyers"!

JC36 wrote:

Thanks guys (ChipMonkey, DavidC and Chris Raimondi) for your comments. Let me say right away that, of course, I accept Kaggle Admin's ruling as to whether the milestone winners' methods comply with the rules. (I might use the same techniques myself if my methods get good enough.)
However, I still disagree that a combination of several algorithms should be considered "AN algorithm". It should be called "a method". Have a look at Market Makers' paper describing their method for their winning round 1 milestone.
Here is what they write:-


There were four underlying algorithms used in our models, all of which are freely available in the R language for statistical computing. Online references for each algorithm are given in the hyperlinks below.

  1. Gradient Boosting Machines ...
  2. Neural Networks...
  3. Bagged Trees...
  4. Linear Models...

I have deleted the hyperlinks for clarity. Market Makers go on to use "ensembling" which blends the results of the different algorithms.
Surely you guys wouldn't argue that a combination of four such different algorithms is AN algorithm. If you do I would have to give away my geographical location and say you are a mob of "bush lawyers"!

Surely you guys wouldn't argue that a combination of four such different algorithms is AN algorithm

Yes I would - as mentioned before - many different things though of as a single algo - or even equation is actually a combination - or linear blend.

Is the the Pythagorean theorem an algo?  Or is it a linear ensemble of A^2 plus B^2?

The fact that you are calling Bagged Trees - for example - AN ALGO - shows the problem with this appproach.  The R package randomForest is simply a combination of CART trees - therefore - even a single random forest under what you are stating - wouldn't count as a single algo - as it was someone putting together a bunch of CART trees in a clever manner.

I understand what you are saying - and similar objections to practicality were raised during the netflix competition.  Google uses over 200 different signals (what we call features) and a combination of algos, but their overall method - as you would call it - is still refered to as "The Google Algorithm".  See here for example - the singular is used eight times - the plural never:

http://bits.blogs.nytimes.com/2011/11/14/google-reveals-tweaks-to-its-search-algorithm/

I do not disagree that MM and W&E used a combination of algorithims - I just disagree that you can't call that combination an algorithm as well.

I think you can combine four movies (in some cases) - and still consider it A MOVIE - just as you can put up 100s or thousands of orange pieces of cloth in central park and call it A PIECE OF ART.  Should a cheeseburger be disqualified as the most delicious piece of food on the planet - simply because it combines cheese and a hamburger?  Can the United Kingdom not be considered A COUNTRY, because it contains the countries of England, Scotland, et. al?

Not trying to be a smart ass - ok maybe a little bit :)

Wikipedia says...

While there is no generally accepted formal definition of "algorithm," an informal definition could be "a set of rules that precisely defines a sequence of operations."

RandomForest = (tree1 + tree2) / 2

Is that an algorithm?

myNewAlgorithm = (RF + GBM + NN + LINREG) / 4

Is that a set of rules that precisely defines a sequence of operations? Is that an algorithm?

JC36, can you please give us your definition of algorithm? (At least a crude definition, a description?)
It seems that your definition should be very different from what one would find in a typical computer science textbook or, say, in Wikipedia.
And, please, could you provide a proof (based on your definition) that a combination of several "different" algorithms cannot be an algorithm. - That is your claim, right? Sorry if I misunderstood something, but here are two quotations from your comment:

(1) I still disagree that a combination of several algorithms should be considered "AN algorithm". It should be called "a method".

(2) Surely you guys wouldn't argue that a combination of four such different algorithms is AN algorithm.

Oleg, My definition of an algorithm is the same as the dictionary definitions I have seen which are all singular ie refer to "a procedure" or "a method". Ok, I can see that some people could classify a procedure that embodies the use of four totally distinct algorithms (eg Market Makers' milestone 1 solution) as fitting this definiton and is therefore "an algorithm". It is just that that is not my position.

Has the data prep and modelling code been released, I recall Phil had posted a replicable code for previous milestone? Any such posting for this one so that we can test it out or it needs to be gleaned from the papers itself??

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?