Cases missing Sex and Age code in release 3

« Prev
Topic
» Next
Topic
<12>
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

...seemed meaningful. I'm wondering whether anyone else is uncomfortable with the idea of using predictors that only address data suppression


If it works for you - I would use it.  Data is messy - what every you can do to help make it better I would consider.  I am trying all sorts of things - if it works I keep it, if not - I move on.

does not seem to address the stated goal of the competition

It might SEEM that way, but without doing these types of things - the real good stuff will remain hidden in the data.  Also - for 500k/3m - everyone and their brother is going to try whatever they can - you don't have a choice if you want to remain competitive.

Sorry, my Excel table got scrunched in the previous post.

FWIW - quick reply screws stuff up - the regular reply doesn't squish everything together.

Someone had a good post on how to post code (I think it was Jeff) - but I couldn't find it the other day though when I was looking for it.  It would be cool if we could sticky that or put the link somewhere near the reply buttton.

Thanked by mentula
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

Allan Engelhardt wrote:

mentula wrote:

Besides treating missing values as a distinct level, I am going to try imputing sex=female for a subset of missing based on Primary Condition and Specialty that skew heavily (> 90%) toward women among members with known sex.  It adds noise to my female level but I figure that the benefit of reducing the size of the sexless pool is worth it. 

Try it, but the missing gender members have much higer average DaysInHospital so I don't think that is the best approach.

library("data.table")
X X[, list(Y2=mean(DaysInHospital.Y2, na.rm = TRUE),
Y3=mean(DaysInHospital.Y3, na.rm = TRUE)), by = Sex]
##       Sex        Y2        Y3
## [1,] 0.7940364 0.7415792
## [2,]    F 0.3173265 0.2727760
## [3,]    M 0.2467464 0.2062700

Let us know how you get on with your approach.

 

I do not understand your code.

Can you explain what is dih?(it is in the original post but for some reason not in this post that I quote you)

Note that I also got the same results that people without sex have more days in hospital but I do not get the same numbers as you

Here is my code:

members hospital.y2 hospital.y3 colnames(hospital.y2) colnames(hospital.y3) aggmembersaggmembersright.a right.b mean(right.b$DaysInHospitalY3.x[right.b$Sex==""])
mean(right.b$DaysInHospitalY3.x[right.b$Sex=="F"])
mean(right.b$DaysInHospitalY3.x[right.b$Sex=="M"])
mean(right.a$DaysInHospitalY2.x[right.a$Sex==""])
mean(right.a$DaysInHospitalY2.x[right.a$Sex=="F"])
mean(right.a$DaysInHospitalY2.x[right.a$Sex=="M"])
 
[1] 0.8540239
[1] 0.3691459
[1] 0.2927755
[1] 0.9292074
[1] 0.3990862
[1] 0.3199843
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

I can add that I dislike the fact that R gives me DaysInHospitalY2.x and DaysInHospitalY2.y when both have the same content 

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

Allan,

My numbers match Uri's - I only checked year 3 (right.b)

agg agg
  Sex DaysInHospital
1   F      0.3691459
2   M      0.2927755
3   U      0.8540239

I broke it down by sum too -

agg agg
  Sex DaysInHospital
1   F          11713
2   M           7481
3   U          12087

can you double check your numbers? - you are much better at R than I am - so I may have screwed up somewhere.

Uri - the reason the .x and .ys are added is if you use certain functions that "join" more than one column with the same name - it will renumber those - you might want to try installing the "plyr" / "reshape" packages and use the join function instead of merge.  That will probably solve the reordering problem you had with NAs that I saw in another thread.

 

 

EDIT: grrr - That is the second time it happened - code seems to disappear after I edit sometimes...

Here it is hopefully ...

agg <- aggregate(DaysInHospital ~ Sex, right.b, mean)
agg
  Sex DaysInHospital
1   F      0.3691459
2   M      0.2927755
3   U      0.8540239

agg <- aggregate(DaysInHospital ~ Sex, right.b, sum)
agg
  Sex DaysInHospital
1   F          11713
2   M           7481
3   U          12087


Thanked by Uri Blass
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

Thanks Chris. reading my posts I see that the forum simply did not copy my code correctly for some reason but your code is simpler.

Edit:Now I also do not see your code but I understood that you use aggregate to show all the numbers 

 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

Chris Raimondi wrote:

Allan,

My numbers match Uri's - I only checked year 3 (right.b)

agg agg
  Sex DaysInHospital
1   F      0.3691459
2   M      0.2927755
3   U      0.8540239

You are right, I was wrong.  I joined too much at once.

> members[dih.Y2][,list(mean=mean(DaysInHospital), sum=sum(DaysInHospital)), by = Sex]
     Sex      mean   sum
[1,]   F 0.3990862 13626
[2,]   M 0.3199843  8949
[3,]   U 0.9292074 12942
> members[dih.Y3][,list(mean=mean(DaysInHospital), sum=sum(DaysInHospital)), by = Sex]
     Sex      mean   sum
[1,]   F 0.3691459 11713
[2,]   M 0.2927755  7481
[3,]   U 0.8540239 12087

Fortunately the conclusion is valid.

@Uri: I use the data.table notation extensively which is different from the data.frame you are used to.

 
Matt Fornari's image Rank 92nd
Posts 7
Joined 1 Jan '11 Email user

So what is the official response to these anomalies in the data?

No comment?

 
Toulouse's image Posts 25
Joined 18 Mar '11 Email user

Allan Engelhardt wrote:

Just because I had submissions to burn: Using the simplest model on Sex and AgeAtFirstClaim gives you a public score of 0.478118 which currently would place you at # 107 on the leaderboard.  I did:

library("data.table")
load("hhp-p030/HHPR3-data-table.RData")
levels(members$Sex) members$Sex[is.na(members$Sex)] levels(members$AgeAtFirstClaim) members$AgeAtFirstClaim[is.na(members$AgeAtFirstClaim)] model S S$DaysInHospital S write.csv(S, file = "sex-age-mean.csv", quote = FALSE, row.names = FALSE)

(Wish I had more time for this competition.)

 

Hello Allan,

 

What do you mean by linear model ?

Is it possible to write this linear model with a simple formula like DaysInHospital = f (Sex, AgeAtFirstClaim) ???

 

Thanks !

 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

vladvad wrote:

Hello Allan,

 

What do you mean by linear model ?

Is it possible to write this linear model with a simple formula like DaysInHospital = f (Sex, AgeAtFirstClaim) ???

Yes.  That is what I did with

model <- lm(log1p(DaysInHospital) ~ Sex + AgeAtFirstClaim, data = X)

The lm function fits a linear model. The first argument is the model formula: A ~ B + C means that A depends on the two variables B and C (not that it is the sum) so you find x,y,z that minimizes \[ A - x \times B - y \times C - z \]. In R, I can do arbitrary transformations of the variables direcly in the formula which is why I have log1p(DaysInHospital). (log1p(x) is like log(1+x), but calculated in a way that is more accurate for small x)

The Introduction to R that comes with the software covers some of this, or there is always help("lm") etc. from the R prompt.

 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

Allan Engelhardt wrote:

 that minimizes [the sum of squares of] ...

[Hands up everybody who thinks that Kaggle need to find a better forum software]

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user
Agreed - I like some of the features (thanks for example), but I feel like ease of communication is more important. Some of the web2.0 type stuff is too clever for its own good. I should NEVER get an error trying to copy and paste for example.
 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 356
Thanks 178
Joined 21 Aug '10 Email user
From Kaggle

Allan Engelhardt wrote:

[Hands up everybody who thinks that Kaggle need to find a better forum software]

Sorry about the forum issues. I'll plan on fixing the big ones soon. I've started a list at http://www.kaggle.com/forums/t/653/forum-suggestions 

Would love to get feedback on that topic. Thanks!

 
<12>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?