 Rank 92nd Posts 7 Joined 1 Jan '11 Email user I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance? #31 / Posted 22 months ago
 Posts 2 Thanks 1 Joined 31 May '11 Email user The box I usually run my models on is down for maintenance, so I've spent this last week writing some fairly (read: very) useless things in R. Here's an example: Say I have this constant urge to go outside and enjoy some sunshine whenver my computer is chugging along on a randomForest/neural net/other awesome technique, and it may be minutes/hours/days before my job is done. Still, I'd like to know when it finishes so I can run back to my computer in a hurry. I recently stumbled upon a twitter client for R called...twitteR, that has some neat functions that allow you to tweet from an R console. I can now set up my job like so: tweet("Starting Job!") a <- Sys.time() # SOME REALLY COMPUTATIONALLY EXPENSIVE STUFF GOES HERE! b <- Sys.time() tweet(paste("Job finished in ", b-a, "!", sep=""))  If Twitter is too new-fangled, you can always check out the sendmailR package and send yourself an e-mail when it's done. One thing I'm currently working on (having no computer and plenty of free time) is writing an R package that hooks up with Twilio, a pretty interesting service that allows developers to make phone calls and text messages through a web service API. That way, I can get push notifications to my phone via SMS telling me that my 250k tree forest or my quadruple for-loop data-processing function is done. Thanked by Zach #32 / Posted 22 months ago
 Posts 8 Joined 6 May '11 Email user I fear I am hijacking this thread for R help, but here it goes. Thanks to Chris for answering the question I asked last time. The answer helped, but it brought up a new question. So I now have the predict function looking for my y2 data to predict with. However, I want to use my Y3 data. The columns are not named the same, so how do I fudge it to use the model from before? I came up with a cludge, but I would prefer something elegant and fast. Also, I when running a random forest, what is the expected resulting time to run. I am running it on my 70,000 x 50 data table, with 5 trees, and R goes into not responding mode... This seems similar to what was described on the Nabble forum: http://r.789695.n4.nabble.com/Large-dataset-randomForest-td830768.html #33 / Posted 22 months ago
 Rank 2nd Posts 195 Thanks 46 Joined 12 Nov '10 Email user Chris Raimondi wrote: Also DO NOT use the forumla interface (this is a random forest thing and not a caret specific thing).  I read this some where and was skeptical, but on my other computer I would run out of memory using the formula method on larger data set, but be fine without.  I don't get how there is that much of an overhead for using a formula, but whatever - it works. Where can I find an example of not using the forumula interface ? I got the formula interface in caret package working, but always got this error with the non-formula interface: model = train(x,y,method="rf", metric="RMSE",ntree=400,                trControl=tc1,tuneGrid=tg) Fitting: mtry=3 Aggregating results Selecting tuning parameters Fitting model on full training set Error in preProcess.default(trainX, method = pp$options, thresh = pp$thresh,  :   all columns of x must be numeric #34 / Posted 22 months ago
 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user Where can I find an example of not using the forumula interface ? I was talking about using randomForest itself.  If you use caret - and have factors - you sometimes have to use the formula interface (which can be nice - as you can use it for models that you wouldn't otherwise be able to - I assume it is turning it into dummy variables in the background). I do not know if the non formula interface trick actually saves you anything in caret.  It is possible that do to what is going on in the background - you don't get an advantage - or maybe you always get the advantage - I don't know - sorry. I am POSITIVE though - if you train using randomForest directly - it does help with memory issues - at least it did for me.  I tend to use the formula interface with caret. If you use:  getAnywhere(train.default) You can see some of the code for the train function - you can see where some of the logic takes place for which models require the formula interface to use factors.  I believe that you shouldn't need it for "rf" though - so not sure why you are getting that message.  I believe I have had issued with unusual column names before, but not sure - can't remember. #35 / Posted 22 months ago
 Rank 31st Posts 292 Thanks 64 Joined 2 Mar '11 Email user darc wrote: I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance? rfe HAS to fit a full model at the beginning, so it knows which variables to eliminate.  I'm not sure you can tell caret to not fit a final model.  Look around on the parameters for it on the help page.  The final object returned by caret contains both the final model and the optimal tuning parameters ($finalModel and something else) which you can use to refit the model if you like. #36 / Posted 22 months ago  Posts 82 Thanks 50 Joined 1 Sep '10 Email user I've just started playing around in this comp, so I thought I'd revive the thread and contribute one or two R functions, using what hopefully are fairly obvious names for the data objects. A submission function, a training function for random forest parameters, a quick variable importance function, and a couple of faster alternatives to table and tapply that may work for you depending on what you are doing. submit.csv <- function(model, newdata, filename, ...){ preds <- predict(model,newdata=newdata, ...) preds <- exp(preds)-1 if(any(is.na(preds))) stop("missing values in predictions") if(any(is.infinite(preds))) stop("infinite values in predictions") if(any(preds < 0)) stop("negative values in predictions") preds <- data.frame(MemberID = daysY4$MemberID, ClaimsTruncated = daysY4$ClaimsTruncated, DaysInHospital=round(preds,6)) if(nrow(preds) != 70942) stop("incorrect number of rows") write.csv(preds, filename, row.names=FALSE, quote=FALSE)} rf.train <- function(sampsize=10000,mtry=10,ntree=100,nodesize=50,reps=3){ #arguments should be numeric vectors of any length argm <- expand.grid(sampsize=sampsize,mtry=mtry,ntree=ntree,nodesize=nodesize) mt <- matrix(NA,ncol=reps,nrow=nrow(argm)) for(i in 1:nrow(argm)) { for(j in 1:reps) { rfFita <- randomForest(daysY2MM,TargetY2,ntree=argm[i,3],mtry=argm[i,2], replace=FALSE,sampsize=argm[i,1],nodesize=argm[i,4],maxnodes=NULL, xtest=daysY3MM, ytest=TargetY3, importance=FALSE,localImp=FALSE,keep.forest=TRUE) mt[i,j] <- sqrt(rfFita$test$mse)[argm[i,3]] } } cbind(argm,round(mt,5))} qvarimp <- function(prdata, target, sort=TRUE, ...){ # prdata should be numeric matrix with no missing values # or zero variance variables # target should have no missing values impFunc <- function(x, y) abs(coef(summary(lm(y ~ x)))[2, "t value"]) ret <- data.frame(tstat = round(apply(prdata, 2, impFunc, y = target),1)) if(sort) ret <- ret[sort.list(ret$tstat,dec=TRUE),] ret}   qtapply <- function (X, INDEX, FUN, ..., simplify = TRUE){ #INDEX should be a factor FUN <- match.fun(FUN) ans <- lapply(split(X, INDEX), FUN, ...) if(simplify) ans <- unlist(ans) ans} qtable <- function(fac, names = FALSE) { #fac should be a factor pd <- nlevels(fac) y <- tabulate(fac, pd) if(names) names(y) <- levels(fac) y}   #37 / Posted 20 months ago
 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:    replaceNAWithMean <- function(x) {   x[is.na(x)] <- mean(x,na.rm=TRUE)   x   } replaceNAWithZero <- function(x) {   x[is.na(x)] <- 0   x   }  #38 / Posted 12 months ago
 Rank 31st Posts 26 Thanks 15 Joined 23 Dec '10 Email user Chris Raimondi wrote: One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean: There are some nice functions in the timeSeries library as well:  library(timeSeries) ?substituteNA  #39 / Posted 12 months ago
 Rank 31st Posts 292 Thanks 64 Joined 2 Mar '11 Email user There's a couple useful answers on stackoverflow for replacing NAs with 0s: http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r This function is very fast: http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table #40 / Posted 10 months ago
