Can I boost your score on my way out?

« Prev
Topic
» Next
Topic
<12>
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

I've moved on to new projects.  As a last experiment, I want to see if I can significantly improve your score by incorporating my predictions.  I've read about boosting, but haven't tried it yet.  I want to send someone my predictions and see if is useful to them.  If this helped someone get in the money, I wouldn't want ask for any part of it.

-----

Overview of my algorithm:

My algorithm was fairly straightforward, and I think it was different from what most people here are using. I created variables from the data that I thought would be predictive.  I then ran an OLS regression on training data

DIH=beta*variables

I used the fitted values from that regression as an index of predicted health usage. I ran a very simple non-parametric estimator to map the index to predictions that minimizes rmsle.

-------

What I was going to do next (In case anyone cares):

I'd like to include a quite a few more variables (e.g. more dummy vars for specific vendor, more interaction terms), but I think I have a method to reduce overfitting when I do so.  I would have included these variables in a multi-level estimation framework that shrinks imprecise estimates towards group means.  I was going to use methods from Gelman and Hill's book.  This incorporate "regression to the mean" to reduce overfitting.  I was going to implement this in PyMC, but you could do it in R too.  I thought this was a really good idea (and I thought it was the big advantage of using a regression in the first stage rather than random forests.)  I don't have time to follow it through, but hopefully the idea interests someone.

-----

How to take me up on the offer:

If kaggle says I can make my predictions or my code publicly available, I'll do so.  I cleaned the data in stata and did estimation in python.  If I'm only allowed to give it to one team, I'd like to see if it helps someone that already has a better algorithm than me.  Drop me a line though.

I'm out... have fun predicting.

Thanked by Heuristic , Signipinnis , and Doron Rippel
 
B Yang's image Rank 2nd
Posts 195
Thanks 46
Joined 12 Nov '10 Email user

You have some strong linear models, I certainly like to know what your variables are.

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

B,

My linear model has quite a few vars. If I don't hear an objection from kaggle about it, I will clean up my source code, put together and better explanation, and make my work avail as a zip file. I thought this was an intersting project, so let me know if you have any questions or want to chat about anything.

Right now, I only have time to type of a list of what is in the model

counts of the number of charlson1 charlson2 charlson3 and charlson4 observations
whether the charlson1 charlson2 charlson3 and charlson4 observations in the previous year are positive.
the charlson index of the last claim in previous year
agemale dummies and agefemale dummies/intercepts (omited group is ageMissing)
Number of days for each specialty
A count of observations and total number of days in in-patient hospital
a count for claims in each other place of service
counts of claims for each primaryConditionGroup
Counts for Inpatient claims in each primary condition group with at least 50,000 claims and 5,000 in-patient claims
claimstruncated

of claims for each procedure

lagged Days in hospital (imputed based on gender and age if not available)
dummy for lagged days in hospital == 0
Dummy for lagged days in hospital >=10
pregnant
pregnantlabTests
pregnant
whether baby was delivered (based on in-patient stay)
pregnanttestswhether baby was delivered
pregMale
preg
age==20-29
pregage==30-39
preg
age==40-49
pregage==50-59
preg
age==60+
preg*ageMissing
outpatient claims for primaryconditiongroup msc2a3
peds claims for primaryconditiongroup msc2a3
suplos
number of claims for primary condition group codes and procedure combinations with many observations: I converted primary condition group and procedure groups to numbers based on alphabetical order. The list is
pcg 3 procedures 2, 3, 5 and 13 (this makes 4 variables)
pcg 12, procedures 2,4, 5,8
pcg 20 proceudre 4
pcg 23 procedures 2,3,4,7
pcg 27 procedures 2,3,4,5,7
pcg 28 procedures 2,3
pcg 38, procedures 2, 3, 5
pcg 42 procedure 11
pcg 44 procedure 3
Counts of claims for each specialty
count for place of service==in-patient
counts of claims with each pcp who had

Counts of claims for a group of vendors and pcp's that had enough observations to estimate this precisely. Those turned out to be (ignore the trailing Cou and Co characters)
pcp1303Cou
pcp2136Cou
pcp2448Cou
pcp2469Cou
pcp3394Cou
pcp4025Cou
pcp4313Cou
pcp4523Cou
pcp5300Cou
pcp9524Cou
pcp10164Co
pcp11148Co
pcp13281Co
pcp16757Co
pcp18175Co
pcp18880Co
pcp20090Co
pcp20893Co
pcp21146Co
pcp21579Co
pcp22193Co
pcp23056Co
pcp26051Co
pcp27467Co
pcp30569Co
pcp30870Co
pcp32724Co
pcp33193Co
pcp33303Co
pcp33843Co
pcp35565Co
pcp35832Co
pcp36452Co
pcp36955Co
pcp36990Co
pcp37301Co
pcp37759Co
pcp37796Co
pcp38110Co
pcp38583Co
pcp38762Co
pcp39372Co
pcp39946Co
pcp40607Co
pcp41370Co
pcp42381Co
pcp43790Co
pcp44164Co
pcp44537Co
pcp46162Co
pcp46795Co
pcp47414Co
pcp48905Co
pcp51763Co
pcp56126Co
pcp59950Co
pcp62284Co
pcp62871Co
pcp63771Co
pcp64709Co
pcp70119Co
pcp70171Co
pcp70222Co
pcp70553Co
pcp70686Co
pcp71040Co
pcp71847Co
pcp72000Co
pcp72351Co
pcp73550Co
pcp73982Co
pcp74354Co
pcp75037Co
pcp75876Co
pcp76634Co
pcp77134Co
pcp78718Co
pcp80381Co
pcp80533Co
pcp81146Co
pcp82373Co
pcp86472Co
pcp86510Co
pcp86658Co
pcp86723Co
pcp87960Co
pcp88511Co
pcp88661Co
pcp89127Co
pcp90868Co
pcp91972Co
pcp92411Co
pcp93075Co
pcp94201Co
pcp94891Co
pcp96614Co
pcp98627Co
pcp98900Co
pcp99068Co
pcp99196Co
vendor9717
vendor2610
vendor3194
vendor3556
vendor6476
vendor1110
vendor1224
vendor1403
vendor1526
vendor1648
vendor2400
vendor2518
vendor2536
vendor2862
vendor3066
vendor3274
vendor3698
vendor4254
vendor4725
vendor4913
vendor4962
vendor5054
vendor5597
vendor5606
vendor6178
vendor7063
vendor7850
vendor7912
vendor9722

Thanked by Chris Raimondi , Sarkis , B Yang , Shashi Godbole , andywocky , and 3 others
 
B Yang's image Rank 2nd
Posts 195
Thanks 46
Joined 12 Nov '10 Email user

Wow, now I don't feel so bad for using 50 variables.

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

I don't know how many variables are on my list... but if I'd continued working on this, I'd aim to have a lot more.

The OLS regression takes under 10 seconds to run on my laptop.  So computation is a non-issue.  Maybe some of those vendor codes are overfitting, but I think the hierarchical model (or any shrinkage estimator) would reduce overfitting.    

Of course, it's easy to talk about "what I would have done" as I walk out the door ;)

 
JJJ's image
JJJ
Posts 43
Thanks 8
Joined 9 Apr '11 Email user

Hi DanB. Congrats on just shooting up the leaderboard. Are you still on the way out? (If yes, care to share what you did?)

 
Sarkis's image Posts 41
Thanks 5
Joined 5 Apr '11 Email user

Winners never quit. I like the fact that DanB is not using random variables, and plus, he is a fellow Pythonista.

 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

I do not understand part of Dan's variables:

1)Charlson1,2,3,4(there are no 1,2,3,4 and I guess that he means to the 4 different options for charlson index)

2)age male dummies and age female dummies but not age missing?

Does it mean that he is using every age/male combination as 0 or 1 except people with missing age or missing gender that he does not use their age or their gender to predict the result?

3) a count of observations and total number of days in in-patient hospital(what is the meaning of total number of days (it can be days of stay and it can be days in hospital in previous year)

I guess that observations mean claims.

4)lagged days in hospital(does it mean days in hospital in previous year?)

5)What is the difference between Pregnantwhether baby was delivered and Pregnanttestswhether baby was delivered?

6)preg*ageMissing(does it mean preg and age missing)?

Thanked by JLDml
 
Mark Waddle's image Posts 32
Thanks 6
Joined 28 Mar '11 Email user

Hi Dan,

Thank you for sharing your work. What is the "lagged days in hospital"?

EDIT: I think I have figured it out to be the DIH for that year (as opposed to DIH for the next year). Thanks again!

Mark

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Uri,

My apologies.  I cut and pasted parts of that list, so it didn't contain much explanation:

 

1)Charlson1,2,3,4(there are no 1,2,3,4 and I guess that he means to the 4 different options for charlson index)

That's correct. 

2)age male dummies and age female dummies but not age missing?

I have a constant in the regression.  In this case, that represents the age missing group.  I could have included age missing and excluded the constant.

3) a count of observations and total number of days in in-patient hospital(what is the meaning of total number of days (it can be days of stay and it can be days in hospital in previous year)

The count is the number of claims.  The days in in-patient hospital is the sum of the days for individual claims. 

4)lagged days in hospital(does it mean days in hospital in previous year?)

Yes

5)What is the difference between Pregnantwhether baby was delivered and Pregnanttestswhether baby was delivered?

The latter is pregnant*# lab tests*delivered.  The first variable doesn't include the lab tests term.

6)preg*ageMissing(does it mean preg and age missing)?

Yes

 

Mark:

You are correct.  It is the days in hospital listed for the year before we want to predict.

Thanked by Sarkis , Signipinnis , Uri Blass , Mark Waddle , and JLDml
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

Dan,another question for variables that are about the previous year like the charlson index of the last claim in previous year

what do you do when there are no claims in the previous year?

 

part of the people that we need to predict are people without claims in the previous year and we may need to predict days in hospital in year 3 for people when we have claims in year 2 but not in year 1.

In the last case the Charlson index of the last claim in previous year is not defined and I wonder if you try to estimate it(for example based on charlson index of the first claim in the year 2 instead of charlson index of the last claim in year 1)

Note also that we do not always know what claim is the last claim and I wonder what do you do in case that we have more than one candidate with different charlson index(did not check if it is a problem).

 

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Uri,

When I said previous year, that may have been confusing.  I meant the year of the claims (the year prior to what we are predicting).  If I am predicting, Y3, I want to know whether the last claim in Y2 was "serious."   In this sense, perhaps I should have said current year.  That resolves the problem of lacking data, since we only estimate y[n] when there are claims in y[n-1].  

If there are multiple claims tied for last based on dsfs, I base this variable on the highest charlson index for those claims. I'm trying to capture health towards the end of the year.  If someone has a claim of severity 1 and a claim of severity 3, their condition is at least as bad as someone who has only a single claim of severity 3 in that month.

Thanked by Uri Blass
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

Thanks for replying.

I am still not sure about the meaning of days for individual claims from your previous reply.

I guess that you simply use the LengthOfStay column and you use some estimate in case of not having an exact number and having information like 1-2 weeks.

I wonder what you do when SupLOS=1 that means that we practically have no information about length of stay.

I already have a function that count number of claims for every specific condition like PrimaryConditionGroup=="AMI" that I used in my previous submissions.

I guess that another productive function can be to count number of days but I am not sure what should be the content of that function.

I plan to write list of the variable that you use based on your post and I will start in this post

1)counts of the number of charlson1 charlson2 charlson3 and charlson4 observations means 4 variables

namely in my code:

right.b$num0
#number of cases in year 2 charlson index is 0
right.b$num1to2
#number of cases in year 2 charlson index is 1 to 2
right.b$num3to4
#number of cases in year 2 charlson index is 3 to 4
right.b$num5plus
#number of cases in year 2 charlson index is 5+

2)whether the charlson1 charlson2 charlson3 and charlson4 observations in the previous year are positive means additional 4 logical variables(0 if falst or 1 if true)

namely
right.b$num0>0
right.b$num1to2>0
right.b$num3to4>0
right.b$num5>0

3)the charlson index of the last claim in previous year means 1 variable of the maximal charlson index of the claims in the last month of the year(still have no code for it)

4)agemale dummies and agefemale dummies/intercepts (omited group is ageMissing) means 2 logical variables

right.b$Sex=="M"
right.b$Sex=="F"(initially I thought that you have a variable for every age male group but it seems that it is not the case because in that case you could have also many groups of age missing and you could not ignore the missing  age group by a constant). 

5)Number of days for each specialty

I guess that it is not number of claims for each specialty and you use some estimate for number of days.

You have 13 different variables for the different options of specialty including the empty specialty


 

Thanked by JLDml
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

another question
what does the blue words of claims for each procedure after claimstruncated mean?

I understand that it is about procedure group but
it is in a different line so I am not sure if it is related to claims truncated or to the next lines.

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Uri,

Your code looks right to me.  

# of days is from length of stay.  I picked an integer whenver I was given a range.  For example, 1-2 weeks was 10 days.  

The stuff in blue is supposed to be 

# of claims for each procedure group.

 

Not sure what happened with the formatting to cut off some words and make it a big blue font. 

Thanked by Uri Blass
 
<12>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?