When I train my models and then measure the training set scores, I get values like 0.430. When I then test the same model against a holdout set of data that wasn't involved in the training, I get scores like 0.449.
However, when I submit to heritage a set of year 4 estimates, the resulting scores are about 0.466.
In other words, about 0.017 worse than I hope for. Why is this? It can't be that I'm drastically overtraining my models, since the holdout set performs satisfactorily.
It seems to me that the year 4 DIH data must therefore be very different that the previous years' data.
What are your experiences? What differences do you generally see between your internal scoring and when you submit to the website?
How are you decreasing this difference? (I haven't found that correcting for the difference in 'mean' values helps)