Please visit our data-mining blog!

« Prev
Topic
» Next
Topic
esla's image Posts 4
Joined 19 May '12 Email user

Hi Everybody,

This summer, I am teaching a data-mining class. In the class, we cover basic techniques in data mining.

Students, as part of class homework, should submit their prediction for HHP.

We try to publish our underestanding of the problem in our blog. Please visit us at:

http://machinelearningsummer.blogspot.com/

http://machinelearningsummer.blogspot.com/2012/05/course-description.html

 

Kind Regards,

Esla

 
esla's image Posts 4
Joined 19 May '12 Email user

One of interesting observation we made is following: students in class came up with their own features and model parameters (linear regression +norm2 regularization-ridge). Even, none of student has made top 100 YET, generally, they having very dirrefent prediction for the test set, but their best score is always in similar range (+/- 0.01). Is there any good explanation for such a behavior?

 
ChipMonkey's image Rank 84th
Posts 60
Thanks 13
Joined 20 Mar '11 Email user

If I understand correctly, using only linear regression is probably the biggest problem. Random Forests, Neural Networks, Gradient Boosting, and other techniques are going to get you better results once you've established a good baseline feature set.

Combining (ensembling) those results together, and across different feature sets, still seems to be the prevailing strategy among the best scorers (at least for those that have been talking about it).

Thanked by esla
 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Esla,

That is a very interesting phenomenon that the predictions are all different, but they generate similar scores. If they are using a validation dataset, it would be interesting to compare their scores on the validation dataset. Assuming they are not doing that (and k-fold validation is introducing too much complexity), it would be interesting to see how much their R^2 or RMSE differ on the training dataset.

Another interesting experiment would be to have some of the student teams merge and "ensemble" their submissions. Since the scores are so similar, you could reasonably ensemble the best submissions from multiple teams using the simple of their respective submissions. If their underlying predictions are actually disimilar, I think they'd be pleasantly surprised to learn how well this works.

Thanked by esla
 
esla's image Posts 4
Joined 19 May '12 Email user

Thanks ChipMomkey and DanB! I do definitely use your suggestion. At the moment, we are trying to underestand limits of linear regression (in general). Comparing R^2 is a nice suggestion! I agree it is very interesting to see how RMSE change by combining different linear regressions with different features. But, I guess (as you suggested DanB) we should have test set as RMSE always increases by adding new features.

For now, we are trying not to use other algorithms (since I don't to confuse everyone in the class with new concepts!) I hope we can cover other algorithms (i.e. Random Forset suggestion by ChipMoney) in the last two weeks

Thanks again!
Esla

 
David J. Slate's image Rank 13th
Posts 65
Thanks 25
Joined 5 Aug '10 Email user

DanB's suggestion to try ensembling the predictions from multiple teams is quite sensible. However, I think he intended to say "using the simple mean of their respective submissions" rather than "using the simple of their respective submissions". Actually, besides the mean one could also try the median, or even other measures of central tendency.

-- Dave Slate

Thanked by DanB , and esla
 
S.U.T.'s image Posts 43
Thanks 7
Joined 5 Sep '11 Email user

This effect was dubbed "Rashomon" in Elements of Statictical Learning (free online). The nick-name comes from the famous JApanese movie showing how a crime looks different to each person that witnessed it (the models) with none being the absolute truth.

It seems particularly relevant in the medical field  -the example they give is a predicting heart attacks where there are many ~10 predictors, and almost any group of ~6 predictors gives a very "different" but equally accurate model.

 
esla's image Posts 4
Joined 19 May '12 Email user

It is interesting that the combination of the weak models with best mean value help alot!

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?