# Cases missing Sex and Age code in release 3

 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user ...seemed meaningful. I'm wondering whether anyone else is uncomfortable with the idea of using predictors that only address data suppression If it works for you - I would use it.  Data is messy - what every you can do to help make it better I would consider.  I am trying all sorts of things - if it works I keep it, if not - I move on. does not seem to address the stated goal of the competition It might SEEM that way, but without doing these types of things - the real good stuff will remain hidden in the data.  Also - for 500k/3m - everyone and their brother is going to try whatever they can - you don't have a choice if you want to remain competitive. Sorry, my Excel table got scrunched in the previous post. FWIW - quick reply screws stuff up - the regular reply doesn't squish everything together. Someone had a good post on how to post code (I think it was Jeff) - but I couldn't find it the other day though when I was looking for it.  It would be cool if we could sticky that or put the link somewhere near the reply buttton. Thanked by mentula #16 / Posted 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user Allan Engelhardt wrote: mentula wrote: Besides treating missing values as a distinct level, I am going to try imputing sex=female for a subset of missing based on Primary Condition and Specialty that skew heavily (> 90%) toward women among members with known sex.  It adds noise to my female level but I figure that the benefit of reducing the size of the sexless pool is worth it.  Try it, but the missing gender members have much higer average DaysInHospital so I don't think that is the best approach. library("data.table")X X[, list(Y2=mean(DaysInHospital.Y2, na.rm = TRUE), Y3=mean(DaysInHospital.Y3, na.rm = TRUE)), by = Sex]##       Sex        Y2        Y3## [1,] 0.7940364 0.7415792## [2,]    F 0.3173265 0.2727760## [3,]    M 0.2467464 0.2062700 Let us know how you get on with your approach.   I do not understand your code. Can you explain what is dih?(it is in the original post but for some reason not in this post that I quote you) Note that I also got the same results that people without sex have more days in hospital but I do not get the same numbers as you Here is my code: members hospital.y2 hospital.y3 colnames(hospital.y2) colnames(hospital.y3) aggmembersaggmembersright.a right.b mean(right.b$DaysInHospitalY3.x[right.b$Sex==""])mean(right.b$DaysInHospitalY3.x[right.b$Sex=="F"])mean(right.b$DaysInHospitalY3.x[right.b$Sex=="M"])mean(right.a$DaysInHospitalY2.x[right.a$Sex==""])mean(right.a$DaysInHospitalY2.x[right.a$Sex=="F"])mean(right.a$DaysInHospitalY2.x[right.a$Sex=="M"])   [1] 0.8540239[1] 0.3691459[1] 0.2927755[1] 0.9292074[1] 0.3990862[1] 0.3199843 #17 / Posted 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user I can add that I dislike the fact that R gives me DaysInHospitalY2.x and DaysInHospitalY2.y when both have the same content #18 / Posted 23 months ago
 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user Allan, My numbers match Uri's - I only checked year 3 (right.b) agg agg   Sex DaysInHospital 1   F      0.3691459 2   M      0.2927755 3   U      0.8540239 I broke it down by sum too - agg agg   Sex DaysInHospital 1   F          11713 2   M           7481 3   U          12087 can you double check your numbers? - you are much better at R than I am - so I may have screwed up somewhere. Uri - the reason the .x and .ys are added is if you use certain functions that "join" more than one column with the same name - it will renumber those - you might want to try installing the "plyr" / "reshape" packages and use the join function instead of merge.  That will probably solve the reordering problem you had with NAs that I saw in another thread.     EDIT: grrr - That is the second time it happened - code seems to disappear after I edit sometimes... Here it is hopefully ... agg <- aggregate(DaysInHospital ~ Sex, right.b, mean) agg   Sex DaysInHospital 1   F      0.3691459 2   M      0.2927755 3   U      0.8540239 agg <- aggregate(DaysInHospital ~ Sex, right.b, sum) agg   Sex DaysInHospital 1   F          11713 2   M           7481 3   U          12087 Thanked by Uri Blass #19 / Posted 23 months ago / Edited 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user Thanks Chris. reading my posts I see that the forum simply did not copy my code correctly for some reason but your code is simpler. Edit:Now I also do not see your code but I understood that you use aggregate to show all the numbers #20 / Posted 23 months ago
 Posts 77 Thanks 29 Joined 28 May '10 Email user Chris Raimondi wrote: Allan, My numbers match Uri's - I only checked year 3 (right.b) agg agg   Sex DaysInHospital 1   F      0.3691459 2   M      0.2927755 3   U      0.8540239 You are right, I was wrong.  I joined too much at once. > members[dih.Y2][,list(mean=mean(DaysInHospital), sum=sum(DaysInHospital)), by = Sex]     Sex      mean   sum[1,]   F 0.3990862 13626[2,]   M 0.3199843  8949[3,]   U 0.9292074 12942> members[dih.Y3][,list(mean=mean(DaysInHospital), sum=sum(DaysInHospital)), by = Sex]     Sex      mean   sum[1,]   F 0.3691459 11713[2,]   M 0.2927755  7481[3,]   U 0.8540239 12087 Fortunately the conclusion is valid. @Uri: I use the data.table notation extensively which is different from the data.frame you are used to. #21 / Posted 23 months ago
 Rank 92nd Posts 7 Joined 1 Jan '11 Email user So what is the official response to these anomalies in the data? No comment? #22 / Posted 23 months ago
 Posts 25 Joined 18 Mar '11 Email user Allan Engelhardt wrote: Just because I had submissions to burn: Using the simplest model on Sex and AgeAtFirstClaim gives you a public score of 0.478118 which currently would place you at # 107 on the leaderboard.  I did: library("data.table")load("hhp-p030/HHPR3-data-table.RData")levels(members$Sex) members$Sex[is.na(members$Sex)] levels(members$AgeAtFirstClaim) members$AgeAtFirstClaim[is.na(members$AgeAtFirstClaim)] model S S\$DaysInHospital S write.csv(S, file = "sex-age-mean.csv", quote = FALSE, row.names = FALSE) (Wish I had more time for this competition.)   Hello Allan,   What do you mean by linear model ? Is it possible to write this linear model with a simple formula like DaysInHospital = f (Sex, AgeAtFirstClaim) ???   Thanks ! #23 / Posted 23 months ago
 Posts 77 Thanks 29 Joined 28 May '10 Email user vladvad wrote: Hello Allan,   What do you mean by linear model ? Is it possible to write this linear model with a simple formula like DaysInHospital = f (Sex, AgeAtFirstClaim) ??? Yes.  That is what I did with model <- lm(log1p(DaysInHospital) ~ Sex + AgeAtFirstClaim, data = X) The lm function fits a linear model. The first argument is the model formula: A ~ B + C means that A depends on the two variables B and C (not that it is the sum) so you find x,y,z that minimizes $A - x \times B - y \times C - z$. In R, I can do arbitrary transformations of the variables direcly in the formula which is why I have log1p(DaysInHospital). (log1p(x) is like log(1+x), but calculated in a way that is more accurate for small x) The Introduction to R that comes with the software covers some of this, or there is always help("lm") etc. from the R prompt. #24 / Posted 23 months ago
 Posts 77 Thanks 29 Joined 28 May '10 Email user Allan Engelhardt wrote:  that minimizes [the sum of squares of] ... [Hands up everybody who thinks that Kaggle need to find a better forum software] #25 / Posted 23 months ago
 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user Agreed - I like some of the features (thanks for example), but I feel like ease of communication is more important. Some of the web2.0 type stuff is too clever for its own good. I should NEVER get an error trying to copy and paste for example. #26 / Posted 23 months ago
