<1234>
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

1)What is the best tutorial to learn the relevant parts of R for this competition

2)Is there a function in R to do binary search(note that I found that I can use order to replace the order of lines to have one vector in non decreasing order

Members<-read.csv(file="Members.csv",head=TRUE,sep=",")
OrderMembers<-Members[order(Members$MemberID),]

Now the question is if I want to find the place of MemberID 78832045 in this file by binary search then how do I do it in R.

I need to find 22222 in this example because  OrderMembers$MemberID[22222]=78832045 but I want to do binary search and use the fact that OrderMembers$MemberID is an increasing sequence.

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

Not sure if this is what you mean, but you can use "which" to find where something is - as in:

 

idx <- which(Members$MemberID=="12345678")

note the two equal signs - not one

and you can then use:

Members[idx,]

to show all rows with that MemberID

 

edited to add:

Oh and I have two books on R:

R in a nutshell

and

The R Book

I think R in a nutshell would be my first choice, but that might be because I read it first.

Also -  the vignettes for various functions (not available for all) are sometimes very helpful.  I would certainly recommend reading all the vignettes for the "Caret" function.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user
The vignettes are here, towards the bottom: http://cran.r-project.org/web/packages/caret/index.html
 
Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 382
Thanks 72
Joined 20 Jan '10 Email user
From Kaggle

I've said this before, but I think Jeremy's tutorial is really excellent although it is not focussed on HHP. He is hoping to get the opportunity to do an HHP tutorial in the next few months.  

 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

Thanks

I did not know about the which command and I thought to use a special function for that purpose but it is not exactly what I asked.

My question is about finding it faster.

The which command can help me to find a member with specific member id but it does not assume nothing about order of the vector.

I have

Members<-read.csv(file="Members.csv",head=TRUE,sep=",")
OrderMembers<-Members[order(Members$MemberID),]

After doing it I have for every i OrderMembers$MemberID[i]<OrderMembers$MemberID[i+1] and I want to use it to find n such that OrderMembers$MemberID[n]=m without looking at all the vector by binary search when the idea is simply to start from the middle of the vector and divide the interval that I search by 2 after every comparison so I practically need only 20 steps to search in a vector of length million.

 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

Uri Blass wrote:

I did not know about the which command and I thought to use a special function for that purpose but it is not exactly what I asked.

My question is about finding it faster.[...]

library("data.table") does what you want.  I use it all the time.

 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

I do not understand how to use

library("data.table")

I get the error  there is no package called 'data.table' if I simply type it.

 
Harry G.'s image Posts 3
Thanks 2
Joined 28 Mar '11 Email user

Here you go Uri.  This should help you out.

 

I wanted to create a list of all the claims per each member id.  So, I wrote a function that accepts a data.table and a vector of the unique member ID's as inputs.  Here is the function:

 

CreateListOfMemberClaims <- function(dt, ids) {
##
# FUNCTION NAME: CreateListOfMemberClaims
# INPUTS: dt = A data.table of the claims for a particular year
# ids = The member id's to look for
#
# OUTPUTS: memberList = list of all the claims as a data.frame for each member id.
##
memberList <- list()
for(i in 1:length(ids)) {
memberList[[i]] <- data.frame(dt[J(ids[i])])
}
return(memberList)
}
 
Here's how you use the function:  Notice that I have converted only the data.frame for Y1 claims into a data.table.  I called this new 
data.table qq for lack of a better name.  Now you need to specify what the key of this table is.  I chose MemberID as the key.
library(data.table)
qq                 <- data.table(claims[which(claims$Year == "Y1"), ])
setkey(qq, MemberID)
 
 
#now use the function.  You need the data.table and a vector of all the unique ID's
t1 <- which(claims$Year == "Y1")
uniqueVectorOfIDs <- unique(claims$MemberID[locYXList[[1]]])
 
y1MemberList <- CreateListOfMemberClaims(qq, uniqueVectorOfIDs)
 
# For all this to work, simply copy paste the function from above into R, make sure you install the package data.table and then copy paste the 
rest of the code.  In order to install the package you can type install.packages("data.table")   It's a two step process..First you install a package, then you load it by 
typing either library(packageName) or require(packageName)
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

Harry,I do not understand this function maybe because I am relatively a beginner in R and I did not use a good tutorial to learn R but if you use the file claims.csv then I could expect to see the name of that file in your code and I do not see it.

Note that I do not want to use the claims file at this point of time and I prefer to understand how I can use the other files to make a simple prediction only based on age and gender.

I tried not to use binary search with R after not understanding how to use binary search and the result is clearly disappointing(relative to C without binary search)

With C I could make a prediction only based on age and gender even with a bad algorithm(that does not do binary search) in less than a minute

With R it seems that something like this without binary search is going to take many hours and the result is that I even do not like the idea of using binary search with R because if R is slower relative to C by a factor of 5 or 10 I can live with it but not if it is slower by a factor of more than 1000. 

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

The following code installs the needed files (not the claims), makes a simple model based on age and sex, then uses that to predict the next years DaysInHospital.  It compute an error score (for what was just predicted) and then makes another prediction for year 4, and saves the file.  This took just under 5 seconds (including loading the files) on my computer.

aa <- Sys.time()
#
# Import Files (have your files in a subdirectory called "hhp2" - or just delete the characters "hhp2"
#

members.all <- read.csv(file="hhp2/Members.csv")

hospital.y2 <- read.csv(file="hhp2/DaysInHospital_Y2.csv")
hospital.y3 <- read.csv(file="hhp2/DaysInHospital_Y3.csv")
hospital.y4 <- read.csv(file="hhp2/Target.csv")
hospital.y2$logdays <- log1p(hospital.y2$DaysInHospital)
hospital.y3$logdays <- log1p(hospital.y3$DaysInHospital)
hospital.y4$logdays <- NA

#
# Make Right Hand Files
#

right.a <- merge(hospital.y2, members.all, by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
right.b <- merge(hospital.y3, members.all, by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
right.c <- merge(hospital.y4, members.all, by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)

# Compute a simple model

lm.1.log.based <- lm(logdays ~ AgeAtFirstClaim + Sex, data=right.a)

pred.b.log.based <- predict(lm.1.log.based, right.b)
pred.b.log.based <- expm1(pred.b.log.based)

# Compute RMSLE
err  <- function(obs, pred) sqrt( 1/length(obs) * sum((log(pred+1) - log(obs+1))^2))

err(pred.b.log.based, right.b$DaysInHospital)

# Predict for unknown year four

pred.c.log.based <- predict(lm.1.log.based, right.c)
pred.c.log.based <- expm1(pred.c.log.based)

# Place in target file and save

hospital.y4$DaysInHospital <- pred.c.log.based
hospital.y4 <- hospital.y4[,1:3]
write.csv(hospital.y4,file="my.submission.csv",quote=FALSE,row.names=FALSE)
ab <- Sys.time()
aa-ab






Thanked by Uri Blass , pim# , Heuristic , Wei-shou Hsu , Vtaylor , and jujung
 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user

Thanks for the code. It works to generate a submission based on age and gender but the problem is that I do not understand exactly what it does. There are many functions that I simply did not know about them in R and I do not know where to learn them. I searched in google for log1p and for expm1 and I understood them I also searched in google for merge by and understood that it simply build a new table that has all the ages and gender(I guess that merge use a binary search in order to do it efficiently) I do not understand exactly what does lm and predict(searching in google I found that it is about regression and linear model but I prefer to see also a mathematical formula to be sure what it does exactly).

Thanked by lenne20
 
Dmitry Vorobyev's image Posts 1
Joined 23 May '11 Email user

@Uri: As you correctly noticed, lm is the linear regression function. You can read about its mathematical coverage here:

http://en.wikipedia.org/wiki/Linear_regression

predict is another R function that uses the regression model in order to produce predicted values that are subsequently compared to real values to calculate the accuracy of the model. Refer to the R reference to know more or simply type help(function_name) in the console.

P.S. Not that I'm a huge R expert either :)

 
Uri Blass's image Posts 253
Thanks 4
Joined 5 Aug '10 Email user
Thanks for the code of chris. It clearly helped me to understand how it is possible to use R with vectors and not with single numbers. I will probably not use the lm function because I dislike using formulas that I am not sure exactly how they work. The problem is not what is linear regression but how do you calculate it. Gender and Age are strings in the file so it is not clear how lm use them and if lm simply translates them to integers then it is not clear to me how you translate them. I prefer to use my own functions when I know exactly what I do.
 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

Use "?" and "??" to read the manual files for a given function. e.g.

?lm

??lm

The lm command in R is extremely well-tested, and would probably take you many hours of effort to replicate. Furthermore, it has interfaces to other R functions, such as "summary" and "plot" that you may find useful in the future. If you're really suspicious of code written in R you can usually type the name of the function with no parenthesis to view it's source code, e.g:

lm

This is a good way to learn about programming conventions in R. I suggest you compare the predictions from R's lm command to predictions from your own regression function to assure yourself that they match.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

Uri Blass wrote:

Gender and Age are strings in the file so it is not clear how lm use them and if lm simply translates them to integers then it is not clear to me how you translate them.

Gender is actually a factor, not a string. Factors are a rather unqiue data type used to represent categorical data.  Age can also be represented as an ordered factor, but I suggest you find a way to convert it to a continuous variable, as that makes intuitive sense.

 
<1234>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?