<123>
Matt Fornari's image Rank 92nd
Posts 7
Joined 1 Jan '11 Email user

I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance?

 
thomson's image Posts 2
Thanks 1
Joined 31 May '11 Email user

The box I usually run my models on is down for maintenance, so I've spent this last week writing some fairly (read: very) useless things in R. Here's an example:

Say I have this constant urge to go outside and enjoy some sunshine whenver my computer is chugging along on a randomForest/neural net/other awesome technique, and it may be minutes/hours/days before my job is done. Still, I'd like to know when it finishes so I can run back to my computer in a hurry. I recently stumbled upon a twitter client for R called...twitteR, that has some neat functions that allow you to tweet from an R console. I can now set up my job like so:

tweet("Starting Job!")
a <- Sys.time()
# SOME REALLY COMPUTATIONALLY EXPENSIVE STUFF GOES HERE!
b <- Sys.time()
tweet(paste("Job finished in ", b-a, "!", sep=""))

If Twitter is too new-fangled, you can always check out the sendmailR package and send yourself an e-mail when it's done.

One thing I'm currently working on (having no computer and plenty of free time) is writing an R package that hooks up with Twilio, a pretty interesting service that allows developers to make phone calls and text messages through a web service API. That way, I can get push notifications to my phone via SMS telling me that my 250k tree forest or my quadruple for-loop data-processing function is done.

Thanked by Zach
 
ProTester's image Posts 8
Joined 6 May '11 Email user

I fear I am hijacking this thread for R help, but here it goes.

Thanks to Chris for answering the question I asked last time. The answer helped, but it brought up a new question. So I now have the predict function looking for my y2 data to predict with. However, I want to use my Y3 data. The columns are not named the same, so how do I fudge it to use the model from before? I came up with a cludge, but I would prefer something elegant and fast.

Also, I when running a random forest, what is the expected resulting time to run. I am running it on my 70,000 x 50 data table, with 5 trees, and R goes into not responding mode... This seems similar to what was described on the Nabble forum: http://r.789695.n4.nabble.com/Large-dataset-randomForest-td830768.html

 
B Yang's image Rank 2nd
Posts 195
Thanks 46
Joined 12 Nov '10 Email user

Chris Raimondi wrote:

Also DO NOT use the forumla interface (this is a random forest thing and not a caret specific thing).  I read this some where and was skeptical, but on my other computer I would run out of memory using the formula method on larger data set, but be fine without.  I don't get how there is that much of an overhead for using a formula, but whatever - it works.

Where can I find an example of not using the forumula interface ? I got the formula interface in caret package working, but always got this error with the non-formula interface:

model = train(x,y,method="rf", metric="RMSE",ntree=400,

               trControl=tc1,tuneGrid=tg)
Fitting: mtry=3
Aggregating results
Selecting tuning parameters
Fitting model on full training set
Error in preProcess.default(trainX, method = pp$options, thresh = pp$thresh,  :
  all columns of x must be numeric

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

Where can I find an example of not using the forumula interface ?

I was talking about using randomForest itself.  If you use caret - and have factors - you sometimes have to use the formula interface (which can be nice - as you can use it for models that you wouldn't otherwise be able to - I assume it is turning it into dummy variables in the background).

I do not know if the non formula interface trick actually saves you anything in caret.  It is possible that do to what is going on in the background - you don't get an advantage - or maybe you always get the advantage - I don't know - sorry.

I am POSITIVE though - if you train using randomForest directly - it does help with memory issues - at least it did for me.  I tend to use the formula interface with caret.

If you use:

 getAnywhere(train.default)

You can see some of the code for the train function - you can see where some of the logic takes place for which models require the formula interface to use factors.  I believe that you shouldn't need it for "rf" though - so not sure why you are getting that message.  I believe I have had issued with unusual column names before, but not sure - can't remember.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

darc wrote:

I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance?

rfe HAS to fit a full model at the beginning, so it knows which variables to eliminate.  I'm not sure you can tell caret to not fit a final model.  Look around on the parameters for it on the help page.  The final object returned by caret contains both the final model and the optimal tuning parameters ($finalModel and something else) which you can use to refit the model if you like.

 
Alec Stephenson's image Posts 82
Thanks 50
Joined 1 Sep '10 Email user

I've just started playing around in this comp, so I thought I'd revive the thread and contribute one or two R functions, using what hopefully are fairly obvious names for the data objects. A submission function, a training function for random forest parameters, a quick variable importance function, and a couple of faster alternatives to table and tapply that may work for you depending on what you are doing. 

submit.csv <- function(model, newdata, filename, ...)
{
preds <- predict(model,newdata=newdata, ...)
preds <- exp(preds)-1
if(any(is.na(preds))) stop("missing values in predictions")
if(any(is.infinite(preds))) stop("infinite values in predictions")
if(any(preds < 0)) stop("negative values in predictions")
preds <- data.frame(MemberID = daysY4$MemberID, ClaimsTruncated =
daysY4$ClaimsTruncated, DaysInHospital=round(preds,6))
if(nrow(preds) != 70942) stop("incorrect number of rows")
write.csv(preds, filename, row.names=FALSE, quote=FALSE)
}
rf.train <- function(sampsize=10000,mtry=10,ntree=100,nodesize=50,reps=3)
{
#arguments should be numeric vectors of any length
argm <- expand.grid(sampsize=sampsize,mtry=mtry,ntree=ntree,nodesize=nodesize)
mt <- matrix(NA,ncol=reps,nrow=nrow(argm))
for(i in 1:nrow(argm)) {
for(j in 1:reps) {
rfFita <- randomForest(daysY2MM,TargetY2,ntree=argm[i,3],mtry=argm[i,2],
replace=FALSE,sampsize=argm[i,1],nodesize=argm[i,4],maxnodes=NULL,
xtest=daysY3MM, ytest=TargetY3,
importance=FALSE,localImp=FALSE,keep.forest=TRUE)
mt[i,j] <- sqrt(rfFita$test$mse)[argm[i,3]]
}
}
cbind(argm,round(mt,5))
}
qvarimp <- function(prdata, target, sort=TRUE, ...)
{
# prdata should be numeric matrix with no missing values
# or zero variance variables
# target should have no missing values
impFunc <- function(x, y) abs(coef(summary(lm(y ~ x)))[2, "t value"])
ret <- data.frame(tstat = round(apply(prdata, 2, impFunc, y = target),1))
if(sort) ret <- ret[sort.list(ret$tstat,dec=TRUE),]
ret
}
 
qtapply <- function (X, INDEX, FUN, ..., simplify = TRUE)
{
#INDEX should be a factor
FUN <- match.fun(FUN)
ans <- lapply(split(X, INDEX), FUN, ...)
if(simplify) ans <- unlist(ans)
ans
}

qtable <- function(fac, names = FALSE)
{
#fac should be a factor
pd <- nlevels(fac)
y <- tabulate(fac, pd)
if(names) names(y) <- levels(fac)
y
}
 
 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:

 

replaceNAWithMean <- function(x) {
  x[is.na(x)] <- mean(x,na.rm=TRUE)
  x
  }

replaceNAWithZero <- function(x) {
  x[is.na(x)] <- 0
  x
  }

 
James Petterson's image Rank 31st
Posts 26
Thanks 15
Joined 23 Dec '10 Email user

Chris Raimondi wrote:
One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:

There are some nice functions in the timeSeries library as well:

library(timeSeries)
?substituteNA

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

There's a couple useful answers on stackoverflow for replacing NAs with 0s:
http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r

This function is very fast:
http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table

 
<123>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?