Why so quiet - What is everyone up to?

I am starting to get annoyed at my CPU fan and hard drive noise.  I can sort of tell where in the code my CPU is based off of the CPU fan noise.  Also I think I am going to get an SSD drive.  [Would probably also help if it wasn't located three feet from my head]

Other than that - trying to find new features and better algos and trying to clean up and organize code.

Just starting really R Studio - I think I like it - should be much neater looking than the 16,000 line disorganized TextPad file I am working with now.

I want to try doing some multicore R stuff - so far I have just been manually launching multiple instances.  And then maybe even give the Amazon EC2 stuff a try.

How about the rest of you - any goals/objectives/frusterations you care to share?

For me, the release of the Labs and Rx data took some wind out of my sails, thus the quietness since around then.

I'm spending time performance tuning my algorithms just now, though, when I'm spending time on it at all.  I made the mistake of using this competition to learn data mining, so I've been playing with R and RapidMiner for part of my time (many thanks for the R tips everyone's throwing out, by the way), but I'm spending a lot of time in perl, Excel, and MySQL too which are causing me some performance issues.

Also, it's summer, and the laptop screen kind of washes out in the sunshine, which I'm trying to spend more time in.

:-)

I am curious about what R Studio is.  A user interface like R Commander?

@R. Kaan - I'm just learning R, but have found some nice things about R Studio as a development environment - not a menu-driven environment like R Commander. And it looks great on a Mac!. See http://www.rstudio.org/

I'm plodding along learning R (thanks Chris R. for getting us started with some great examples). We've still not incorporated all of the Release 2 data yet, so hopefully there's more movement up the leaderboard after doing that and bringing in Release 3 data.

I like 'R Studio' screenshots and it seems to be a really nice IDE for R. I'm very tempted to download and give it a try, however, since I don't use R in my day job, I'll save it for a later time.

That being said, I've started watching some of the R related videos that I come along, like this one: http://google-opensource.blogspot.com/2011/06/visualization-meetup-at-googleplex.html

I am frustrated by the fact that because of my lack of knowledge about R I need to write the same code hundrends of times(to be more correct I do not write the same code hundrends of times but I use copy and paste and replace hundrends of times only to get a matrix that has only numbers with the information that I like to have).

I asked about it in the end of page 2 in the thread of R questions but unfortunately I did not get a reply about it and I can understand if people who want to get the first place do not want to help me in the competition.

I would like to get many vectors so I can get a vector of the number of  AMI claims for every member in every year and the number of APPCHOL claims for every member in every year and the same for every possiblity but I need to do copy and past and replace for my code for every combination of claim type and year and later check that I did not do mistakes in my copy and paste.

Note that there are 46 options of PrimaryConditionGroup including the empty option and there are 3 options for the year so I need to do 46*3 copy and paste only for PrimaryConditionGroup and I would like to do the same also for other columns.

I am sure that there should be a better way to get this information with R.

Chris Raimondi wrote:

I want to try doing some multicore R stuff - so far I have just been manually launching multiple instances.  And then maybe even give the Amazon EC2 stuff a try.

Chris, if you'd like some advice for writing parallel code in R an executing it on amazon EC2 clusters, I'd be happy to give you some pointers.

Zach:

Thanks - I am going to give it a shot again in a week and might take you up on that....

Uri:

I am by no means an R expert and some of this stuff isn't easy for me - I saw your question, but your data appears to be organized different than mine.  I have my data split up into year - by the hospital tables.  It appears you are organizing into one master "table" for lack of a better word.  I could do it in theory, but it would take me a while.  It took my quite some time to make my code.

I would suggest you start by cleaning the data - get rid of the spaces and replace the empty strings:

for example (assuming the claims are in a data.frame called claims.all):

claims.all$Specialty <- gsub("="" ",="" "",="">
claims.all$PlaceSvc <- gsub("="" ",="" "",="">

Those are just two examples - you will have to spend a decent amount of time on this step before going to the next. There are a lot of decisions to be made here - and some of them I am probably should have done differently.

Allan wrote an excellent post on this (some of it is outdated as it deals with release one of the data, but 95% of it I think still works):

http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html

Then - use the following function to break it up by year.

getCleanClaims <- function(x="Y1" ,="" y="hospital.y2)">
  sand <- claims.all[claims.all$year="">
  all.in <- sand$memberid="" %in%="">
  sand <->
  }

Read in the hospital files:

hospital.y2 <- read.csv(file="hhp2/DaysInHospital_Y2.csv">
hospital.y3 <- read.csv(file="hhp2/DaysInHospital_Y3.csv">
hospital.y4 <- read.csv(file="hhp2/Target.csv">
hospital.y2$logdays <->
hospital.y3$logdays <->
hospital.y4$logdays <->
hospital.y2$bindays <- ifelse(hospital.y2$daysinhospital=""> 0, 1, 0)
hospital.y3$bindays <- ifelse(hospital.y3$daysinhospital=""> 0, 1, 0)
hospital.y4$bindays <->

Use the function from above:

clean.1 <- getcleanclaims("y1",="">
clean.2 <- getcleanclaims("y2",="">
clean.3 <- getcleanclaims("y3",="">

Now you have three seperate files - one for each year - and you know that you are matched up with all the members being in each)

Then it appears you are trying to get counts by PrimaryCondition - I don't know/remember how much cleaning I had to do to that - so if you don't clean up the empty strings and such - you will run into problems.  But once they are cleaned - you can use something like:

makeTab <- function(x,y)="">
  temp <->
  class(temp) <->
  temp <- as.data.frame(temp,="" stringsasfactors="">
  temp <->
  temp[,1] <->
  colnames(temp) <->
  temp
  }
# There is probably a better way, but I couldn't figure out how

#
# Assuming you have the members file in a data.frame called "members.all"
#

right.a <- merge(hospital.y2,="" members.all,="" by.x="MemberID" ,by.y="MemberID" ,="" all.x="TRUE," sort="">
right.b <- merge(hospital.y3,="" members.all,="" by.x="MemberID" ,by.y="MemberID" ,="" all.x="TRUE," sort="">
right.c <- merge(hospital.y4,="" members.all,="" by.x="MemberID" ,by.y="MemberID" ,="" all.x="TRUE," sort="">

#
# Now make another short function to make it shorter...
#

mergeIt <- function(x,y="temp)" {="" merge(x,="" y,="" by.x="MemberID" ,by.y="MemberID" ,="" all.x="TRUE," sort="">

#
# Expand out the Conditions as Columns Primary ConditionGroup
#

temp <- maketab(clean.1$memberid,="">
right.a <->
temp <- maketab(clean.2$memberid,="">
right.b <->
temp <- maketab(clean.3$memberid,="">
right.c <->

Also - get:

"R in a Nutshell"

http://www.amazon.com/Nutshell-Desktop-Quick-Reference-OReilly/dp/059680170X

You probably also will want to look at the packages plyr and reshape.

I have spent a whole bunch of time cleaning the data - and will need to spend a whole bunch more time.
It is the boring part of this, but necessary.

Chris Raimondi wrote:
I can sort of tell where in the code my CPU is based off of the CPU fan noise.

I used to claim I could read the filenames off discs by holding the media against the sun and studying the resulting light pattern.

Thanks Chris.

Your post helped me.
I simply thought that functions are about numbers if I do not tell R that they are about strings and did not understand that I can use a function for strings

practically I even do not need
function(x="Y1") and function(x) is enough when it is clear that x is a string from the body of the function.

After understanding it I can clearly make my code shorter by using the right functions.

my code is at the bottom of this post

I have another question.
Is there an elegant way in R to translate a string to a variable

I have a function that get a string with the name y and suppose that y="AMI"

I would like my function to have the following command

agg <- aggregate(numAMI ~ MemberID, right.a, sum)

I found the following solution to copy all the string but I think that it is an ugly solution:

numy<-paste("num",y,sep="")
text1<-paste("agg<-aggregate(",numy,sep="")
text2<-"~MemberID, right.a, sum)"
text1<-paste(text1,text2,sep="")
eval(parse(text=text1))

I would like to have simply

agg <- aggregate(numy ~ MemberID, right.a, sum)

The main problem is that numy is not numAMI but "numAMI" and I do not know how to generate the string "numAMI" without the quotation marks.

Here is my code that I generated thanks for your post(I still need to call the function for every PrimaryConditionGroup and a more elegant solution can be something that make a loop on all the possible primary condition groups but it is clearly better than what I had earlier)

countprimary1<- function(y)
{
temp<-qq1[qq1$PrimaryConditionGroup==y,]
temp$numspecial<-1
agg <- aggregate(numspecial ~ MemberID, temp, sum)
no<-merge(hospital.y2,agg,by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
no$numspecial[is.na(no$numspecial)]<-0
agg <- aggregate(numspecial ~ MemberID, no, sum)
names(agg)<-sub("special",y,names(agg))
merge(right.a, agg, by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
}

countprimary2<- function(y)
{
temp<-qq2[qq2$PrimaryConditionGroup==y,]
temp$numspecial<-1
agg <- aggregate(numspecial ~ MemberID, temp, sum)
no<-merge(hospital.y3,agg,by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
no$numspecial[is.na(no$numspecial)]<-0
agg <- aggregate(numspecial ~ MemberID, no, sum)
names(agg)<-sub("special",y,names(agg))
merge(right.b, agg, by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
}

countprimary3<- function(y)
{
temp<-qq3[qq3$PrimaryConditionGroup==y,]
temp$numspecial<-1
agg <- aggregate(numspecial ~ MemberID, temp, sum)
no<-merge(hospital.y4,agg,by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
no$numspecial[is.na(no$numspecial)]<-0
agg <- aggregate(numspecial ~ MemberID, no, sum)
names(agg)<-sub("special",y,names(agg))
merge(right.c, agg, by.x="MemberID",by.y="MemberID", all.x=TRUE, sort=FALSE)
}

countprimary<-function(y)
{
right.a<<-countprimary1(y)
right.b<<-countprimary2(y)
right.c<<-countprimary3(y)
}

How much time have folks been dedicating to this? I finally just had a full weekend to spend on it for the first time since late May and was able to do some actual analysis and build some simple models. With work and such though it's been very difficult to spend more than 5-10 hours per week on it.

After setting up the data in SQL server on my laptop though I was finally able to do some actual experimenting over the weekend. Has anyone else found that the Y3 hospitalization data seems to have a much different mean than the 30% of Y4? I built a simple autoregressive model, i.e. using just claimsTruncated and the prior year's daysInHospital/claimsTruncated. Using Y3/Y2 as training data for this gave RMSE of .4699 but on Y4/Y3 the public score was above .48. Thought that was a curious observation.

I've got simular results.

My main algorithm that gave me the 20th place on the ranking gives 0.452 on the randomly choosen 33% hold out from the training data, but 0.4636 in reality.

Some advice on how to fix this would be very appreciated.

I can add now that I understand the need of cleaning the data.

I could get rid of spaces in my functions easily but it is not enough and I cannot use variable name based on the string for every string without some replace.

"-" is an example to a char that cannot be part of a variable name so I get an error.

Inspite of it I plan to use more significant part of the data in my next submission that I hope to make no later than next week. 

Kwaak wrote:

I've got simular results.

My main algorithm that gave me the 20th place on the ranking gives 0.452 on the randomly choosen 33% hold out from the training data, but 0.4636 in reality.

Some advice on how to fix this would be very appreciated.

The average of DIH in Y3 is lower than Y4.What I'd do is either see the training score as an improvement over the naive baseline (which is different for Y3 and Y4), or I'd find a "random" validation subset of the Y3 data that has a similar average to the Y4 data. That (log1p-based) average is 0.209179 according to a separate thread.

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?