I've moved on to new projects. As a last experiment, I want to see if I can significantly improve your score by incorporating my predictions. I've read about boosting, but haven't tried it yet. I want to send someone my predictions and see if is useful to them. If this helped someone get in the money, I wouldn't want ask for any part of it.
Overview of my algorithm:
My algorithm was fairly straightforward, and I think it was different from what most people here are using. I created variables from the data that I thought would be predictive. I then ran an OLS regression on training data
I used the fitted values from that regression as an index of predicted health usage. I ran a very simple non-parametric estimator to map the index to predictions that minimizes rmsle.
What I was going to do next (In case anyone cares):
I'd like to include a quite a few more variables (e.g. more dummy vars for specific vendor, more interaction terms), but I think I have a method to reduce overfitting when I do so. I would have included these variables in a multi-level estimation framework that shrinks imprecise estimates towards group means. I was going to use methods from Gelman and Hill's book. This incorporate "regression to the mean" to reduce overfitting. I was going to implement this in PyMC, but you could do it in R too. I thought this was a really good idea (and I thought it was the big advantage of using a regression in the first stage rather than random forests.) I don't have time to follow it through, but hopefully the idea interests someone.
How to take me up on the offer:
If kaggle says I can make my predictions or my code publicly available, I'll do so. I cleaned the data in stata and did estimation in python. If I'm only allowed to give it to one team, I'd like to see if it helps someone that already has a better algorithm than me. Drop me a line though.
I'm out... have fun predicting.