It's a question. Chris didn't find a set with the same RMSE in Y3; I had more trouble finding a 30% sequence with a similar zero error in Y3 - it was easier in Y2, but I'm not sure the smaller bias associated with the test results is just dumb luck, because I haven't done enough runs to know.
|
Thanks 3 Joined 5 Apr '11 |
|
|
Posts 132 Thanks 55 Joined 9 Jul '10 |
You might be interested in this post by our new leader (Congrats Phil/Sali): http://anotherdataminingblog.blogspot.com/2011/05/learning-from-leaderboard-part-1.html See the graphic at the bottom - it is very relevant to this topic.
[edited typo] |
|
Posts 14 Thanks 1 Joined 4 Apr '11 |
B Yang wrote: boooeee wrote:
For example, my most recent (and best) score is from a fairly straightforward Random Forest model. The leaderboard score is 0.468123.
Care to share some info on your RF model ? I'm a newbie at RF and I've given it a few tries, but so far it badly underperforms against my linear regression model, this combined with sparse documentation and long running time without progress feedback, I'm about to give up on RF now.
Sure. Here is the Random Forest R statement I ran: rfmodel<-randomForest(rf.df,y3.m,nodesize=500,ntree=250) rf.df is a dataframe of dummy and categorical variables at the member level that I derived from the Claims file. Dummy variables include: Year 2 PrimaryConditionGroup, Year 1 PrimaryConditionGroup, Year 3 Claims Truncation, Year 2 Claims Truncation, Year 2 Charlson Index, Year 2 Days in Hospital (bucketed into 1 day, 2-4 days, and 5+ days). I combined age and gender into a single categorical variable (e.g. 80+M). Having agesex as a categorical variable worked out much better than setting up a dummy variable for each agesex combination. y3.m are the log(DaysInHospital+1) for Year 3 by member. I set the minimum nodesize to 500. The default is 5. I found that the model performed better if I upped the minimum nodesize to 500 (as a bonus, it runs much faster too). I set the number of trees to 250 because that seemed to be the point where the cross-validation error flatlined and adding new trees didn't appear to increase performance. I am running 64 bit R on a 2.5 year old iMac with RAM upgraded to 8GB. As you can tell from my leaderboard score, I haven't hit upon the secret sauce yet, but the above does give me better results than linear regression.
Thanked by
Kelly Tagtow
|
Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?