As I have included the number of distinct providers/vendors for a member in my models, the accuracy on my validation or out of bag sets dramatically increased (delta RMSLE of -0.02). However when I submitted these models, they performed very badly on the
leaderboard data (delta RMSLE +0.02).
Normalizing these variables for each year didn't help to improve the discrepancy between validation and leaderboard data.
Has anyone else observed a similar behaviour with these predictors? Is there an explanation for this or am I making a mistake?