Cases missing Sex and Age code in release 3

« Prev
Topic
» Next
Topic
<12>
Matt Fornari's image Rank 92nd
Posts 7
Joined 1 Jan '11 Email user

Just started working on this so sorry if this has been addressed already.

Why are so many cases missing Sex codes?

Around 15% are missing, seems like an important and easilly discerned variable.

Is there a procedural reason they are missing?

Additionally, is there any reason behind cases missing Age codes?

Missing cases for both seem like significant predictors.

1 Attachment —
 
boooeee's image Rank 49th
Posts 18
Thanks 2
Joined 4 Apr '11 Email user
I'm curious how people are handling this in their modelling. Currently, I'm treating blank gender and ages as their own category (basically, each age band has three categories: male, female, and blank). But that's more laziness than anything else. Would it make sense to randomly assign gender when the field is blank (for training purposes)?
 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user
I'm doing the same thing. Except I call it "U" instead of blank :). I seriously doubt it would help to assign them randomly, but I haven't tried.
 
Matt Fornari's image Rank 92nd
Posts 7
Joined 1 Jan '11 Email user
I'm treating it as a seperate factor level. Just seems wierd that there would be so many cases missing, especially for sex. I wonder if this is due to some kind of obsifucation done be the contest administrators or by participation bias in a survey that provided those variables for the membership data. The participaton bias idea seems unlikely as you would expect a higher correlation between missing sex and age codes.
 
Signipinnis's image Posts 94
Thanks 25
Joined 8 Apr '11 Email user

I am concerned that the oddly high number of missing values for sex and age are symptomatic that our "member" data isn't truly from a single person, that all the covered persons under a family plan have been lumped together (i.e., attributed to "subscriber.").

Which would be a major monkey wrench for those trying to apply subject knowledge to the analysis.

Official comment?

 
mentula's image Posts 4
Joined 5 Apr '11 Email user

Besides treating missing values as a distinct level, I am going to try imputing sex=female for a subset of missing based on Primary Condition and Specialty that skew heavily (> 90%) toward women among members with known sex.  It adds noise to my female level but I figure that the benefit of reducing the size of the sexless pool is worth it. 

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user
It could also indicate members that have been around longer. It is possible that their database didn't originally record this information and they started collecting it later. Just a theory - not based on any real analysis.
 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

mentula wrote:

Besides treating missing values as a distinct level, I am going to try imputing sex=female for a subset of missing based on Primary Condition and Specialty that skew heavily (> 90%) toward women among members with known sex.  It adds noise to my female level but I figure that the benefit of reducing the size of the sexless pool is worth it. 

Try it, but the missing gender members have much higer average DaysInHospital so I don't think that is the best approach.

library("data.table")
X <- members[dih] # Join data
X[, list(Y2=mean(DaysInHospital.Y2, na.rm = TRUE),
Y3=mean(DaysInHospital.Y3, na.rm = TRUE)), by = Sex]
##       Sex        Y2        Y3
## [1,] <NA> 0.7940364 0.7415792
## [2,]    F 0.3173265 0.2727760
## [3,]    M 0.2467464 0.2062700

Let us know how you get on with your approach.

 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

s/maemever/member/g

The stupid software on this forum won't let me edit the original post (it "helpfully" deletes half of it when I hit edit).

 
Jose H. Solorzano's image Posts 103
Thanks 47
Joined 21 Jul '10 Email user
From what I've seen, a blank Sex is predictive.
 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

Jose H. Solorzano wrote:

From what I've seen, a blank Sex is predictive.

 

What about a blank age?

 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

Zach wrote:

What about a blank age?

Very (roughly as predictive as being >80, see below), but there are fewer of them.

> X[, list(Y2=mean(DaysInHospital.Y2, na.rm = TRUE),
Y3=mean(DaysInHospital.Y3, na.rm = TRUE)), by = AgeAtFirstClaim]
      AgeAtFirstClaim        Y2         Y3
 [1,]            <NA> 0.8445606 0.78466132
 [2,]             0-9 0.1266727 0.11669313
 [3,]           10-19 0.1099656 0.09750859
 [4,]           20-29 0.2989004 0.23772650
 [5,]           30-39 0.2282903 0.20640498
 [6,]           40-49 0.1585916 0.18281673
 [7,]           50-59 0.2306431 0.20167608
 [8,]           60-69 0.4063201 0.37350411
 [9,]           70-79 0.6637561 0.57651325
[10,]             80+ 0.9396067 0.71882022
> sum(is.na(X$AgeAtFirstClaim))
[1] 5359
> sum(is.na(X$Sex))
[1] 16299


Thanked by Sarkis
 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

Just because I had submissions to burn: Using the simplest model on Sex and AgeAtFirstClaim gives you a public score of 0.478118 which currently would place you at # 107 on the leaderboard.  I did:

library("data.table")
load("hhp-p030/HHPR3-data-table.RData")
levels(members$Sex) <- union(levels(members$Sex), "U")
members$Sex[is.na(members$Sex)] <- "U"
levels(members$AgeAtFirstClaim) <- union(levels(members$AgeAtFirstClaim), "U")
members$AgeAtFirstClaim[is.na(members$AgeAtFirstClaim)] <- "U"
model <- lm(log1p(DaysInHospital) ~ Sex + AgeAtFirstClaim, data = X)
S <- merge(target, members, by = "MemberID", sort = FALSE, all.x = TRUE)
S$DaysInHospital <- expm1(predict(model, newdata = S))
S <- S[, 1:3]
write.csv(S, file = "sex-age-mean.csv", quote = FALSE, row.names = FALSE)

(Wish I had more time for this competition.)

 
mentula's image Posts 4
Joined 5 Apr '11 Email user
I've looked at these missing Sex cases more closely, and I believe they may all be due to suppression. Missing values turn up most often among members present in both Y2 and Y3 DIH data. The claim trails are longer for members with missing values at every level of status, and DIH are higher as well. % Avg Clms Avg Clms Avg DIH Avg DIH Status No Sex Sex No Sex Sex No Sex Y2 8.3 7.8 20.4 0.51 1.15 Y3 7.0 6.1 17.7 0.34 0.73 Y4 7.2 6.4 17.6 Y2+Y3 25.2 15.7 24.2 0.91 1.42 Y3+Y4 16.9 14.9 33.7 0.27 0.99 Y2+Y4 12.1 12.0 28.7 0.10 0.41 Y2+Y3+Y4 24.4 29.9 57.6 0.61 1.69 Total 16.4 16.8 41.5 0.44 1.35 If the missing values are all due to suppression, imputing sex = female for high probability cases would not be helpful unless a four-level Sex variable (Male, Female Non-Suppressed, Female Suppressed but Imputed, Other Suppressed) seemed meaningful. I'm wondering whether anyone else is uncomfortable with the idea of using predictors that only address data suppression. The thread about Claim Truncation also touches on this issue. Exploiting regularities in the data suppression methodology does not seem to address the stated goal of the competition but I don't see how I can avoid going down that path....
 
mentula's image Posts 4
Joined 5 Apr '11 Email user
Sorry, my Excel table got scrunched in the previous post.
 
<12>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?