This question isn't really about winning the competition, it's more about the actual goal of the competition, which presumably is to price health care premiums more accurately.
Just reading though the table schema I'm noting that a lot of fields are truncated at the 95%ile. Since were are talking about numbers such as 6 or 7 (7 becomes "6+" for DrugCount) or 163 or 162 for pay delay (163 becomes 162+ for days) etc. Storage cannot be the reason for this truncation ("162+" requires 4 bytes, 163 requires 1 etc), and doesn't this censorship, which is all over the schema, have a major effect on the actual real world utility of the forecasting data?
Also consider the four reasons for a sick person not making a claim "next year": 1, death; 2, loss of financial ability to pay premiums/copays etc.; 3, transfer to another provider; 4, health. Naively, I would expect the costs of these outcomes to differ and so the likelihood of each outcome having an effect on the premium one charges a subscriber.
Yes, I understand it has no effect on the ability to forecast the out-of-sample data set if that is censored in the same way; but doesn't it have a consequences on the real-world usage of the winning model? After all the actual costs, and actual liabilities of the provider, don't get trucated in this manner?