# R questions

« Prev
Topic
» Next
Topic
 Rank 49th Posts 18 Thanks 2 Joined 4 Apr '11 Email user Zach wrote: Uri Blass wrote: Gender and Age are strings in the file so it is not clear how lm use them and if lm simply translates them to integers then it is not clear to me how you translate them. Gender is actually a factor, not a string. Factors are a rather unqiue data type used to represent categorical data.  Age can also be represented as an ordered factor, but I suggest you find a way to convert it to a continuous variable, as that makes intuitive sense.   Just a word of caution on that approach.  I would recommend keeping age as a categorical variable.  While intuitively, you would expect hospitalization rates to increase with age, there is one key exception to that rule: females in the "birthing years".  I haven't tested it out, but I think there are enough data points in each age "bin" that you can get stable answers by keeping age as categorical. Even without the birthing years issue, there is no reason to expect that hospitalization rates are linear with age.  There may be sections of flatness and steepness as you progress in age.  Better to let the data tell you what that is rather than artificially imposing a model on the age dependency. #16 / Posted 24 months ago
 Rank 31st Posts 292 Thanks 64 Joined 2 Mar '11 Email user boooeee wrote: Even without the birthing years issue, there is no reason to expect that hospitalization rates are linear with age You can build non-linear models with continuous variables. #17 / Posted 24 months ago
 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user I haven't tested it out, but I think there are enough data points in each age "bin" that you can get stable answers by keeping age as categorical. I have tested it - perhaps not the best way - I assigned a value of 0-8 for each(just so i could be lazy and use the first digit of the age) and assigned a 9 to all Unknowns.  Using categories was better for me.  I will probably try and code them as binary for other models, but right now I am using models that will accept factors without conversions. #18 / Posted 24 months ago
 Rank 31st Posts 292 Thanks 64 Joined 2 Mar '11 Email user Continuous variables are also easier to interpolate. Since there's a lot of missing data, this is an important consideration. #19 / Posted 24 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user Note that I do not understand the code of harry (what is locYXList) and I do not understand how to solve a simple problem like generating a vector of the number of claims for every member in a specific year and to merge it with the member list(inspite of knowing how to calculate the average of claims in a specific year) Here is my code to calculate the average of claims in year 1 claims<-read.csv(file="Claims.csv") qq1<-claims[which(claims$Year=="Y1"),] tt<-table(qq1$MemberID) mean(tt) Now tt[1] is 210 8 I can get 8 directly as max(tt[1]) and I can even get a vector by c<-pmax(tt[]) or cc<-as.matrix(tt) but I did not find how to get a vector of the numbers that begin with 210 and I would like to see how I can merge number of claims to the vector of members and I think it should be easy by the merge command if I only get the relevant vector of id members and not only the vector of number of claims. #20 / Posted 23 months ago
 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user Just so you know - you can use: qq1<-claims[claims$Year=="Y1",] instead of: qq1<-claims[which(claims$Year=="Y1"),] both should do the same thing. I am still not sure what you are trying to do, but my guess is instad of a vector of: showing the number of claims: 8 5 13 4 4 6 2 3 6 1 ... you want a vector with the member ids:  210 3197 3889 4187 ... If this is the case - you may find the "str" command useful.  This shows how R store the data - as tables are a little different. str(tt)  'table' int [1:76037(1d)] 8 5 13 4 4 6 2 3 6 1 ...  - attr(*, "dimnames")=List of 1   ..$: chr [1:76037] "210" "3197" "3889" "4187" ... You can see that the member names are really an attribute - in this case "dimnames". You can also see that it is a list. Lists will cause you no ends of problems in R until you get used to them. In this case you need to make sure you: 1) Get the attribute aa1 <- dimnames(tt) str(aa1) List of 1$ : chr [1:76037] "210" "3197" "3889" "4187" ... It is still a list which you don't want in this case... 2) Unlist it aa2 <- unlist(aa1)  str(aa2)  Named chr [1:76037] "210" "3197" "3889" "4187" "9063" "11951" ...  - attr(*, "names")= chr [1:76037] "" "" "" "" ... It still has names and is stored as a character(chr) 3) Make it a numeric vector aa3 <- as.numeric(as.vector(aa2))  str(aa3)  num [1:76037] 210 3197 3889 4187 9063 ... I probably wouldn't use table for this purpose, partially do to all the conversions you need to do, but I don't know what you want to do. If you just want a count of claims - and a "table" - actually a dataframe - here is how I would do it: qq1$cons <- 1 # This puts a column of ones in the claims data frame that you will then sum by MemberID agg <- aggregate(cons ~ MemberID, qq1, sum) # This make a data.frame with counts of claims by MemberID str(agg) 'data.frame': 76037 obs. of 2 variables:$ MemberID: int  210 3197 3889 4187 9063 11951 14661 14701 14778 14855 ...  \$ cons    : num  8 5 13 4 4 6 2 3 6 1 ...   You can see there aren't any lists or attributes to then worry about. Hope that helps. Thanked by boooeee , Harry G. , and Uri Blass #21 / Posted 23 months ago
 Posts 3 Thanks 2 Joined 28 Mar '11 Email user My apologies Uri. I forgot to fix that part of the code. Ignore locYXList[[1]] and instead use t1 in its place. Thanked by Uri Blass #22 / Posted 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user Thanks for all your replies. Another question: How do I add to the member table a vector of the first year that people made claims?(the vector should have 1 or 2 or 3 dependent on the first year that the person made a claim). I still do not know much about R and I am sure that there should be a simple way to do it. #23 / Posted 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user I already found a way to solve part of my problems with R. Another question How do I merge 2 tables when one has missing values without destroying the order? It seems that I can do it simply by starting with merging it to disordered table when the missing values are at the end(not what I want to do) and later changing the missing values to a constant and merge again but this way does not seem to me elegant and the question is if there is a way in R not to push all the missing values to the end when you merge 2 data frames or objects. #24 / Posted 23 months ago
 Rank 13th Posts 65 Thanks 25 Joined 5 Aug '10 Email user Uri, It sounds like you're going through a "learning curve" in using R. I too am making my first extensive use of R for this contest (although I was successful in the Kaggle "R Package Recommendation Engine" challenge, I didn't actually use R for that competition). I'm doing various things to try to master R's complexities and nuances, including searching the web (usually with Google) for answers to questions, searching the reference manual (http://cran.r-project.org/doc/manuals/fullrefman.pdf), looking through the list of available packages at http://cran.r-project.org/web/packages/, and lots of trial-and-error. I've also acquired a few books on R: "A Beginner's Guide to R", by Alain F. Zuur, et al, "Software for Data Analysis - Programming with R", by John M. Chambers, and "R Cookbook", by Paul Teetor. Of those, I've found the most useful to be "R Cookbook", which I have in both paper and PDF form, the latter purchased directly from the publisher, O'Reilly. I would imagine that more experienced R users may be able to recommend additional sources of info on R. -- Dave Slate #25 / Posted 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user David,You are right. I do not know much about R. Fortunately I learned enough in the last days to improve and got place 45 and I still expect to improve(I still did not look at most of the data). I wonder what is the ranking that I need to get to have the ranking posted near my name when people see my posts(I see that rank 32 by Allan is mentioned near his name in his posts). I still do not know non ugly ways to do things in R but I spend most of my time now not on studying R but on trying to improve my predictions. #26 / Posted 23 months ago
 Posts 7 Thanks 8 Joined 3 Jun '11 Email user You need to be 40th or better: http://www.kaggle.com/forums/t/647/forum-rank-feature Cheers ! Thanked by Uri Blass #27 / Posted 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user another question about the post of Harry G How can I use the list except seeing all the information about the claims of specific memeber? y1MemberList[1] shows me all the claims of memberID 42286978 but I see no way to use the specific numbers in R I expected to get only the second line of y1MemberList[1] by something like y1MemberList[1][2] but unfortunately it does not work. I read in http://cran.r-project.org/doc/manuals/R-lang.html#Indexing For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements so I tried y1MemberList[1][[2]] and got an error. I tried different things but nothing helped so I practically do not see how I can use lists in programming except looking at the data for different members. I wonder if people use the List in some productive way except seeing the information about specific members. #28 / Posted 23 months ago
 Posts 3 Thanks 2 Joined 28 Mar '11 Email user Here's an example of how you can use lists to create matrices. myList <- list() #populate list with 10 random numbers for(i in 1:5) {     myList[[i]] <- rnorm(10, 0, 1) } #create a matrix out of a list myMatrix <- do.call(rbind, myList)   I use a list for the members so that I can have every MemberID coupled to its claims.  It's just easier for me to handle the madness.  I do all my processing on individual list elements and subsequently merge everything down into a matrix. (imagine one line per memberID after I'm done processing all the entries within a list)  It's fast and gives me the feature matrix that I'm after. Thanked by Uri Blass #29 / Posted 23 months ago
 Posts 253 Thanks 4 Joined 5 Aug '10 Email user I see that you can use xi<-do.call(rbind, y1MemberList[i]) to get the details about member number i but I still do not see how I can use the list effectively to build a matrix for all the members  For example suppose that I want to have the number of different providerID's in year 1 for every member in the matrix(I do not think that this data is important and I give it only as an example). Suppose that I want to add the information about the number of claims that a member made in every year in specific PrimaryConditionGroup for every possible group(if we talk about one year it is 46 numbers for every memberID when most of them are 0). I know how to do it by aggregate and merge for every possible PrimaryConditionGroup but I need to write a special code for every different possible ConditionGroup in that way. I wonder if there is a better way and if the list that you generate can help in this task. #30 / Posted 23 months ago