PatternEngine's image Posts 6
Thanks 3
Joined 5 Apr '11 Email user

So, I figure we should have a thread for sharing thoughts/ideas about how we're getting good prediction results.

Of course, no one wants to give away the secret edge that's going to win them the prizes :-)  But there are clearly also going to be a range of 'standard' ideas that everyone will end up figuring out and using.  If we pool them here on the forum, we can all benefit and get on with working on cleverer/sneakier approaches.

To put my money where my mouth is, here are some things I've learned so far:  

Generating informative sets of features seems pretty important, straight off the bat.  I've found the following features to be informative.

Sex, Age, nDaysInHosptial (previous year)

And from the claims data for the previous year:

total nClaims, nCharlsonIndex of each category, Counts of primary conditions, Counts of  procedures, Counts of placeSvc, Counts of speciality

There may also be a benefit from also using the same features two years previous to thetarget values, but the effect seems pretty small.

(I feel like there's more one could do with the Claims data, but there are issues with large number os features)

 

Method-wise, I've started with simple linear regression (with stepwise feature selection).  I'm pretty sure this is too restrictive to be useful, but it's very handy for data exploration.  I'll be trying out some more interesting models in the near future.

I hope this is useful to people.  If you would like to reciprocate, that would be awesome  :-)  And if this thread gets going, I'm happy to keep contributing my thoughts to it, as I think we'll all benefit from it.

*braces for deluge of useful responses*

 

 

 
alexanderr's image Posts 42
Thanks 2
Joined 5 Apr '11 Email user

I got a score of 0.51 with my mysql program not liking the huge number of rows in each table and misplacing values and even inputting the wrong values.So a reasonable score can be obtained by luck! I think it is important people make sure

their approach is actually working before sharing faulty ideas.If I could do so reliably I would gather the sort of data you have got.But I am not analysing any data now for a few weeks at least until I have got my hardware and software updated-Revolutionary R and 4 GB memory  for my laptop.!

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user
How do you deal with nDaysInHosptial from year 1, which we do not have? Are you just building your model using year 3's data, and the predicting for year 4?
 
PatternEngine's image Posts 6
Thanks 3
Joined 5 Apr '11 Email user
I've just been using nDaysInHospital from the previous year (i.e. y3 when predicting the y4 outcomes; y2 when predicting the y3 outcomes, for testing). While we're not given nDaysInHospital for year 1, you can probably get close by adding up the stay lengths for Y1 claims.
Thanked by Tapani
 
Jose H. Solorzano's image Posts 103
Thanks 47
Joined 21 Jul '10 Email user
Has anyone estimated the average of log(Y4+1) in the test data sample? I'm sure it can be estimated by making a few submissions, but I don't think it makes sense for every competitor to do this.
 
Eric Jackson's image Posts 21
Thanks 9
Joined 9 Sep '10 Email user

Jose: see this thread:

 

http://www.heritagehealthprize.com/c/hhp/forums/t/523/interesting-submissions-with-scores

 

Thanked by Jose H. Solorzano
 
Karan Sarao's image Posts 52
Thanks 2
Joined 14 Mar '11 Email user
my submissions have a log+1 average of .21 and days average of .26 (.466 approx). I am using about 75 variables, most important (in descending order) are Gender, pay delay transformations, Pregnancy, age, dsfs transformations, charsonindex etc. My take is that most of these variables are highly collinear with the number of claims. Need to find some interactions which will further the model fit
 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

That's interesting Karan.  You've either found better transofrmations or you have different other covariates... because dsfs and paydelay never proved useful (e.g. improved fit) for me

 
Karan Sarao's image Posts 52
Thanks 2
Joined 14 Mar '11 Email user
DanB, it did get my model down to .466 from .468, playing around with the ID's now, no joy there...
 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user
That's a good improvement, Karan. Since you mentioned pregnancy: I'd point out that the data has the condition code is PRGNCY for some males as well as rather elderly folks. Since the code probably means something different for a 60 year-old dude than it does for a 20-something woman, I made interaction terms for male*pregnancy and over50*pregnancy. That improved my score a similar amount.
Thanked by Karan Sarao , and PatternEngine
 
Karan Sarao's image Posts 52
Thanks 2
Joined 14 Mar '11 Email user
Thanks for the tip DanB, I have a more fundamental question around interactions like the one you mentioned. It makes perfect sense to try and capture these effects for GLM based models. For me GLM models flatlined at .469-.470 so I switched to non parametric techniques like Random Forests, here since these are decision tree based, an interaction like you mentioned; if it is significant should be detected and incorporated automatically. Is this a safe assumption or should I be actively creating interaction variables and checking them out? One reason I can think of in the pro camp for creating interaction variables is since I have 175 variables now running RF on R using 12-13 variables (sqrt of no. of variables) and 250 trees, memory and computation time constraints , I might have to drastically increase the number of trees to detect all interesting interactions out there. In that case once an intelligent modeling dataset is created, its only a matter of brute force if RF can actually chance interactions robustly. For DSFS and Pay Delay I have created basic Min/Max and Averages for memberids over Y1 and Y2 seperately and used them alone in a Logit model. ROC of .6 , so they do have predictive power of their own. You can try it out!
 
Karan Sarao's image Posts 52
Thanks 2
Joined 14 Mar '11 Email user
sorry for the big para above, but this text box just doesnt take para breaks, I dont know why!
 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Karan:   I think you are correct that a RF should find important interactions if those interactions can be formed given the structure of your data. I'm still hand tuning a parametric model at this point.



Regarding dsfs and paydelay: I already have counts of claims in the model. I would expect dsfs to have explanatory power without controlling for claim count because it's a reasonable proxy for # of claims. If you have a lot of claims, you likely have some far apart as well as some that are close together. I just didn't find those had power controlling for everything else. Maybe we just have different controls.

I inserted paragraph breaks in this post using the link to the html editor, and typing the html control codes. There's probably an easier way. Maybe an admin will chime in on that.  :)

 
Karan Sarao's image Posts 52
Thanks 2
Joined 14 Mar '11 Email user

# of claims being collinear is something I have mentioned before, but I think breaking down the number of claims gives better insights.

 

Maybe you should test the effect of claims after controlling for DSFS, paydelay etc.  Also I suspect if Paydelay is large, thats because payment is large which in turn suggests a serious incident of whatever ailment the patient is suffering. 

 

Have you started using the PCP, Vendor ID's yet?

 
PatternEngine's image Posts 6
Thanks 3
Joined 5 Apr '11 Email user
I've started thinking about pcp, vendor etc but haven't gotten any benefit from them yet. In relation to Karan's earlier post, I also find that GLM doesn't get me much past 0.470. My best submissions so far have come from a neural net (although I've not yet tried RF). Looking at plots of the data, there is clearly nonlinear structure there, so this isn't super-surprising!
 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?