cbusch's image Posts 7
Joined 31 Aug '11 Email user

Could anyone comment on why the leaders' solution transformed age at first claim into multiple binary variables?

It seems more appropriate for a gbm based linear regression to have turned it into a single scale variable by using the median age for each band.

This is the code excerpt from the leaders' solution.

UPDATE Members SET age_05 = CASE WHEN ageATfirstclaim = '0-9' THEN 1 ELSE 0 END
UPDATE Members SET age_15 = CASE WHEN ageATfirstclaim = '10-19' THEN 1 ELSE 0 END
UPDATE Members SET age_25 = CASE WHEN ageATfirstclaim = '20-29' THEN 1 ELSE 0 END
UPDATE Members SET age_35 = CASE WHEN ageATfirstclaim = '30-39' THEN 1 ELSE 0 END
UPDATE Members SET age_45 = CASE WHEN ageATfirstclaim = '40-49' THEN 1 ELSE 0 END
UPDATE Members SET age_55 = CASE WHEN ageATfirstclaim = '50-59' THEN 1 ELSE 0 END
UPDATE Members SET age_65 = CASE WHEN ageATfirstclaim = '60-69' THEN 1 ELSE 0 END
UPDATE Members SET age_75 = CASE WHEN ageATfirstclaim = '70-79' THEN 1 ELSE 0 END
UPDATE Members SET age_85 = CASE WHEN ageATfirstclaim = '80+' THEN 1 ELSE 0 END
UPDATE Members SET age_MISS = CASE WHEN ageATfirstclaim IS NULL THEN 1 ELSE 0 END


 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

Keep in mid when people are organizing data - they may write code that is friendly for multiple methods. I don't know if that is the case here, but I know I do things like that for that purpose.

 
Signipinnis's image Posts 94
Thanks 25
Joined 8 Apr '11 Email user

It's a diversion to lead us up a false trail.

The day after Milestone 3, they're letting loose with the real good stuff.

 
S.U.T.'s image Posts 43
Thanks 7
Joined 5 Sep '11 Email user

I'm no expert but here's my take: this is the same as unordered categorical representation of Age, differering from a the classic numerical variable (1 to 10). Categorical Age, applied to gbm, then the splits the var "<1" or " >0", as opposed to a split on the numerical age which could have splits "<1", "<2", "<3", .. "<9".

As such, any nodes below an age split for the categorical vairable would have training information ONLY for members with the same age bracket.

In some situations, we could see how this would make sense: a 5 year old's likelihood of say Infection, would not be appropriate calculated with data on 45 year olds. Then the Age_05 idea makes sense.

However, in other situations, you probably lose information: now let's condier a 75 year old's likelihood of AMI. Here, you probably could learn something from 65 year old's experience. Then splitting AgeNumerical at <6 would make sense.

 
S.U.T.'s image Posts 43
Thanks 7
Joined 5 Sep '11 Email user

Thought about this some more...possibly better logic would be:

age05 = AgeAtFirstClaim = ("0-9" OR NULL )
...
...
age
85 = AgeAtFirstClaim = ("80+" OR NULL )
age_MISS = AgeAtFirstClaim( NULL )

As those with missing age DO infact have an age, we're just not given it. So, we assume it is equally likely over 5-85, bot note in the age_MISS var our uncertainty anout the matter.

 
Signipinnis's image Posts 94
Thanks 25
Joined 8 Apr '11 Email user

To be more serious, I've done the same, to be more consistent with std actuarial risk tables .. not that I'm using any of those, but actuaries tend to have a professionally tuned take on risk ... and assuming that some age*disease effects are distinctly non-linear, so maybe using age categories will allow sharper discontinuities and bends with some algorithms. Haven't explicitly tested that with this data yet, however.

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?