I've been fooling around with the caret package and had a few questions someone might be able to answer. I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end? Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance?
|
Posts 7 Joined 1 Jan '11 Email user |
|
|
Thanks 1 Joined 31 May '11 Email user |
The box I usually run my models on is down for maintenance, so I've spent this last week writing some fairly (read: very) useless things in R. Here's an example: Say I have this constant urge to go outside and enjoy some sunshine whenver my computer is chugging along on a randomForest/neural net/other awesome technique, and it may be minutes/hours/days before my job is done. Still, I'd like to know when it finishes
so I can run back to my computer in a hurry. I recently stumbled upon a twitter client for R called... tweet("Starting Job!")
a <- Sys.time()
# SOME REALLY COMPUTATIONALLY EXPENSIVE STUFF GOES HERE!
b <- Sys.time()
tweet(paste("Job finished in ", b-a, "!", sep=""))
If Twitter is too new-fangled, you can always check out the
One thing I'm currently working on (having no computer and plenty of free time) is writing an R package that hooks up with Twilio, a pretty interesting service that allows developers to make phone calls and text messages through a web service API. That way, I can get push notifications to my phone via SMS telling me that my 250k tree forest or my quadruple for-loop data-processing function is done.
Thanked by
Zach
|
|
Joined 6 May '11 Email user |
I fear I am hijacking this thread for R help, but here it goes. Thanks to Chris for answering the question I asked last time. The answer helped, but it brought up a new question. So I now have the predict function looking for my y2 data to predict with. However, I want to use my Y3 data. The columns are not named the same, so how do I fudge it to use the model from before? I came up with a cludge, but I would prefer something elegant and fast. Also, I when running a random forest, what is the expected resulting time to run. I am running it on my 70,000 x 50 data table, with 5 trees, and R goes into not responding mode... This seems similar to what was described on the Nabble forum: http://r.789695.n4.nabble.com/Large-dataset-randomForest-td830768.html |
|
Posts 195 Thanks 46 Joined 12 Nov '10 Email user |
Chris Raimondi wrote: Also DO NOT use the forumla interface (this is a random forest thing and not a caret specific thing). I read this some where and was skeptical, but on my other computer I would run out of memory using the formula method on larger data set, but be fine without. I don't get how there is that much of an overhead for using a formula, but whatever - it works.
Where can I find an example of not using the forumula interface ? I got the formula interface in caret package working, but always got this error with the non-formula interface: model = train(x,y,method="rf", metric="RMSE",ntree=400, trControl=tc1,tuneGrid=tg) |
|
Posts 194 Thanks 90 Joined 9 Jul '10 Email user |
I was talking about using randomForest itself. If you use caret - and have factors - you sometimes have to use the formula interface (which can be nice - as you can use it for models that you wouldn't otherwise be able to - I assume it is turning it into dummy variables in the background). I do not know if the non formula interface trick actually saves you anything in caret. It is possible that do to what is going on in the background - you don't get an advantage - or maybe you always get the advantage - I don't know - sorry. I am POSITIVE though - if you train using randomForest directly - it does help with memory issues - at least it did for me. I tend to use the formula interface with caret. If you use: getAnywhere(train.default) You can see some of the code for the train function - you can see where some of the logic takes place for which models require the formula interface to use factors. I believe that you shouldn't need it for "rf" though - so not sure why you are getting that
message. I believe I have had issued with unusual column names before, but not sure - can't remember. |
|
Posts 292 Thanks 64 Joined 2 Mar '11 Email user |
darc wrote: I've been fooling around with the caret package and had a few questions someone might be able to answer. I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end? Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance?
rfe HAS to fit a full model at the beginning, so it knows which variables to eliminate. I'm not sure you can tell caret to not fit a final model. Look around on the parameters for it on the help page. The final object returned by caret contains both the final model and the optimal tuning parameters ($finalModel and something else) which you can use to refit the model if you like. |
|
Thanks 50 Joined 1 Sep '10 Email user |
I've just started playing around in this comp, so I thought I'd revive the thread and contribute one or two R functions, using what hopefully are fairly obvious names for the data objects. A submission function, a training function for random forest parameters, a quick variable importance function, and a couple of faster alternatives to table and tapply that may work for you depending on what you are doing. submit.csv <- function(model, newdata, filename, ...) rf.train <- function(sampsize=10000,mtry=10,ntree=100,nodesize=50,reps=3) qvarimp <- function(prdata, target, sort=TRUE, ...) qtapply <- function (X, INDEX, FUN, ..., simplify = TRUE)
|
|
Posts 194 Thanks 90 Joined 9 Jul '10 Email user |
One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:
|
|
Posts 26 Thanks 15 Joined 23 Dec '10 Email user |
Chris Raimondi wrote:
One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:
There are some nice functions in the timeSeries library as well:
|
|
Posts 292 Thanks 64 Joined 2 Mar '11 Email user |
There's a couple useful answers on stackoverflow for replacing NAs with 0s: This function is very fast: |
Reply
Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?

with —