Submission versus Holdout scores

« Prev
Topic
» Next
Topic
DaveC's image Posts 14
Thanks 3
Joined 16 Feb '11 Email user

Hi all.

When I train my models and then measure the training set scores, I get values like 0.430. When I then test the same model against a holdout set of data that wasn't involved in the training, I get scores like 0.449.

However, when I submit to heritage a set of year 4 estimates, the resulting scores are about 0.466.

In other words, about 0.017 worse than I hope for. Why is this? It can't be that I'm drastically overtraining my models, since the holdout set performs satisfactorily.

It seems to me that the year 4 DIH data must therefore be very different that the previous years' data.

What are your experiences? What differences do you generally see between your internal scoring and when you submit to the website?

How are you decreasing this difference? (I haven't found that correcting for the difference in 'mean' values helps)

Cheers,

Dave

 

 

 

 
Karan Sarao's image Posts 52
Thanks 2
Joined 14 Mar '11 Email user

Have recently started submitting again, but what you are saying is in line with what I was used to get with validation holdouts, typically a 0.015 deterioration between own validation and leaderboard.

 
boooeee's image Rank 49th
Posts 18
Thanks 2
Joined 4 Apr '11 Email user

DaveC - My experience has been similar to yours.  Here is an earlier thread on the topic: Link

 

 

 

 

 
José A. Guerrero's image Rank 19th
Posts 145
Thanks 21
Joined 27 Jan '11 Email user

0.015 is my experience too. Frustrating until you learn to live with it :-)

Y4 may be different of previous years, (Y2 & Y3 have big differences too), so the problem isn't only overfitting.

 
DaveC's image Posts 14
Thanks 3
Joined 16 Feb '11 Email user

Ok, thanks for the replies. (at least it isn't just me suffering from this problem).

I think it is a shame that the problem we're being asked to solve is only slightly related to accurately modelling the training data. More advancement seemingly can be gained by accurately modelling the hidden 30% scoring set (and by extension presumably the other 70% too).

Cheers
Dave

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?