# Contribute an R Function

« Prev
Topic
» Next
Topic
<12>
 0 votes Chris Raimondi wrote: I don't get this either - but I have had luck in choosing a different repository.  I would think this stuff is all automated, but I have found if I try a few more times I can usually find it (this doesn't appear to work in cases where the webpage indicates it is no longer available - only when it appears that is should be available.)  Sometimes I have to try 3 or four different ones, but it has probably worked about 15 times for me. Chris - the reason for certain windows binaries not being available is... Packages that do not compile out of the box or do not pass "R CMD check" with "OK" or at least a "WARNING" will *not* be published. #21 | Posted 5 years ago Competition 1st | Overall 678th Posts 378 | Votes 178 Joined 22 Jun '10 | Email User
 0 votes Wow that caret package is pretty neat, one package to rule them all. #22 | Posted 5 years ago Competition 61st Posts 7 Joined 1 Jan '11 | Email User
 0 votes darc wrote: Wow that caret package is pretty neat, one package to rule them all. It really is.  If I ever meet Max Kuhn, I'm gonna buy him a beer. #23 | Posted 5 years ago Competition 9th | Overall 365th Posts 574 | Votes 580 Joined 2 Mar '11 | Email User
 2 votes Here's the structure I use, where X is your model matrix, and Y is the target. mtry can be any number. You can let caret optimize it, but it takes a long time, so I usually start with something small like 5 or 10. predictionBounds is also very useful, particularly in this competition. train(X,Y,metric="RMSE",method='parRF',tuneGrid=expand.grid(.mtry=4), ntree=500, trControl=trainControl( method = "boot", number=1, predictionBounds = c(0,15)))) One idea for selecting mtry is to use the method 'rf' with option 'oob' and a small number of trees (say 100) and let caret pick an mtry value, which you then use to construct a larger forest. Also parRF doesn't work with the method 'oob'. =( #24 | Posted 5 years ago Competition 9th | Overall 365th Posts 574 | Votes 580 Joined 2 Mar '11 | Email User
 0 votes Is there a discussion forum for R and R users somewhere ? #25 | Posted 5 years ago Competition 2nd | Overall 486th Posts 279 | Votes 121 Joined 12 Nov '10 | Email User
 0 votes #26 | Posted 5 years ago Competition 1st | Overall 678th Posts 378 | Votes 178 Joined 22 Jun '10 | Email User
 0 votes #27 | Posted 5 years ago Competition 9th | Overall 365th Posts 574 | Votes 580 Joined 2 Mar '11 | Email User
 0 votes #28 | Posted 5 years ago Competition 9th | Overall 365th Posts 574 | Votes 580 Joined 2 Mar '11 | Email User
 0 votes Here's my version of a parallel random forest function: multiRF=function(x,...) {         foreach(i=x,.combine=combine,.packages='randomForest',                 .export=c('X','Y'),.inorder=FALSE) %dopar% {                 randomForest(X,Y,mtry=i,...)         } } multiRF(c(rep(3,10),rep(4,10),rep(5,10)),ntree=500) I discuss it in more detail on my blog. #29 | Posted 5 years ago Competition 9th | Overall 365th Posts 574 | Votes 580 Joined 2 Mar '11 | Email User
 1 vote @ProTester: There is a nice little Python package (published about in the Journal of Statistical Software) called pyper that lets you call R from Python. Tada! #30 | Posted 5 years ago Posts 1 | Votes 1 Joined 15 Nov '10 | Email User
 0 votes I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance? #31 | Posted 5 years ago Competition 61st Posts 7 Joined 1 Jan '11 | Email User
 1 vote The box I usually run my models on is down for maintenance, so I've spent this last week writing some fairly (read: very) useless things in R. Here's an example: Say I have this constant urge to go outside and enjoy some sunshine whenver my computer is chugging along on a randomForest/neural net/other awesome technique, and it may be minutes/hours/days before my job is done. Still, I'd like to know when it finishes so I can run back to my computer in a hurry. I recently stumbled upon a twitter client for R called...twitteR, that has some neat functions that allow you to tweet from an R console. I can now set up my job like so: tweet("Starting Job!") a <- sys.time()="" #="" some="" really="" computationally="" expensive="" stuff="" goes="" here!="" b=""><- sys.time()="" tweet(paste("job="" finished="" in="" ",="" b-a,="" "!",="" sep="" ))=""> If Twitter is too new-fangled, you can always check out the sendmailR package and send yourself an e-mail when it's done. One thing I'm currently working on (having no computer and plenty of free time) is writing an R package that hooks up with Twilio, a pretty interesting service that allows developers to make phone calls and text messages through a web service API. That way, I can get push notifications to my phone via SMS telling me that my 250k tree forest or my quadruple for-loop data-processing function is done. #32 | Posted 5 years ago Posts 2 | Votes 1 Joined 31 May '11 | Email User
 0 votes I fear I am hijacking this thread for R help, but here it goes. Thanks to Chris for answering the question I asked last time. The answer helped, but it brought up a new question. So I now have the predict function looking for my y2 data to predict with. However, I want to use my Y3 data. The columns are not named the same, so how do I fudge it to use the model from before? I came up with a cludge, but I would prefer something elegant and fast. Also, I when running a random forest, what is the expected resulting time to run. I am running it on my 70,000 x 50 data table, with 5 trees, and R goes into not responding mode... This seems similar to what was described on the Nabble forum: http://r.789695.n4.nabble.com/Large-dataset-randomForest-td830768.html #33 | Posted 5 years ago Posts 14 Joined 6 May '11 | Email User
 0 votes Chris Raimondi wrote: Also DO NOT use the forumla interface (this is a random forest thing and not a caret specific thing).  I read this some where and was skeptical, but on my other computer I would run out of memory using the formula method on larger data set, but be fine without.  I don't get how there is that much of an overhead for using a formula, but whatever - it works. Where can I find an example of not using the forumula interface ? I got the formula interface in caret package working, but always got this error with the non-formula interface: model = train(x,y,method="rf", metric="RMSE",ntree=400,                trControl=tc1,tuneGrid=tg) Fitting: mtry=3 Aggregating results Selecting tuning parameters Fitting model on full training set Error in preProcess.default(trainX, method = pp$options, thresh = pp$thresh,  :   all columns of x must be numeric #34 | Posted 5 years ago Competition 2nd | Overall 486th Posts 279 | Votes 121 Joined 12 Nov '10 | Email User
 0 votes Where can I find an example of not using the forumula interface ? I was talking about using randomForest itself.  If you use caret - and have factors - you sometimes have to use the formula interface (which can be nice - as you can use it for models that you wouldn't otherwise be able to - I assume it is turning it into dummy variables in the background). I do not know if the non formula interface trick actually saves you anything in caret.  It is possible that do to what is going on in the background - you don't get an advantage - or maybe you always get the advantage - I don't know - sorry. I am POSITIVE though - if you train using randomForest directly - it does help with memory issues - at least it did for me.  I tend to use the formula interface with caret. If you use:  getAnywhere(train.default) You can see some of the code for the train function - you can see where some of the logic takes place for which models require the formula interface to use factors.  I believe that you shouldn't need it for "rf" though - so not sure why you are getting that message.  I believe I have had issued with unusual column names before, but not sure - can't remember. #35 | Posted 5 years ago Competition 20th Posts 194 | Votes 92 Joined 9 Jul '10 | Email User
 0 votes darc wrote: I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance? rfe HAS to fit a full model at the beginning, so it knows which variables to eliminate.  I'm not sure you can tell caret to not fit a final model.  Look around on the parameters for it on the help page.  The final object returned by caret contains both the final model and the optimal tuning parameters ($finalModel and something else) which you can use to refit the model if you like. #36 | Posted 5 years ago Competition 9th | Overall 365th Posts 574 | Votes 580 Joined 2 Mar '11 | Email User  0 votes I've just started playing around in this comp, so I thought I'd revive the thread and contribute one or two R functions, using what hopefully are fairly obvious names for the data objects. A submission function, a training function for random forest parameters, a quick variable importance function, and a couple of faster alternatives to table and tapply that may work for you depending on what you are doing. submit.csv <- function(model, newdata, filename, ...){ preds <- predict(model,newdata=newdata, ...) preds <- exp(preds)-1 if(any(is.na(preds))) stop("missing values in predictions") if(any(is.infinite(preds))) stop("infinite values in predictions") if(any(preds < 0)) stop("negative values in predictions") preds <- data.frame(MemberID = daysY4$MemberID, ClaimsTruncated = daysY4$ClaimsTruncated, DaysInHospital=round(preds,6)) if(nrow(preds) != 70942) stop("incorrect number of rows") write.csv(preds, filename, row.names=FALSE, quote=FALSE)} rf.train <- function(sampsize=10000,mtry=10,ntree=100,nodesize=50,reps=3){ #arguments should be numeric vectors of any length argm <- expand.grid(sampsize=sampsize,mtry=mtry,ntree=ntree,nodesize=nodesize) mt <- matrix(NA,ncol=reps,nrow=nrow(argm)) for(i in 1:nrow(argm)) { for(j in 1:reps) { rfFita <- randomForest(daysY2MM,TargetY2,ntree=argm[i,3],mtry=argm[i,2], replace=FALSE,sampsize=argm[i,1],nodesize=argm[i,4],maxnodes=NULL, xtest=daysY3MM, ytest=TargetY3, importance=FALSE,localImp=FALSE,keep.forest=TRUE) mt[i,j] <- sqrt(rfFita$test$mse)[argm[i,3]] } } cbind(argm,round(mt,5))} qvarimp <- function(prdata, target, sort=TRUE, ...){ # prdata should be numeric matrix with no missing values # or zero variance variables # target should have no missing values impFunc <- function(x, y) abs(coef(summary(lm(y ~ x)))[2, "t value"]) ret <- data.frame(tstat = round(apply(prdata, 2, impFunc, y = target),1)) if(sort) ret <- ret[sort.list(ret$tstat,dec=TRUE),] ret}   qtapply <- function (X, INDEX, FUN, ..., simplify = TRUE){ #INDEX should be a factor FUN <- match.fun(FUN) ans <- lapply(split(X, INDEX), FUN, ...) if(simplify) ans <- unlist(ans) ans} qtable <- function(fac, names = FALSE) { #fac should be a factor pd <- nlevels(fac) y <- tabulate(fac, pd) if(names) names(y) <- levels(fac) y}   #37 | Posted 4 years ago Posts 82 | Votes 59 Joined 1 Sep '10 | Email User
 0 votes One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:  replaceNAWithMean <- function(x) {   x[is.na(x)] <- mean(x,na.rm=TRUE)   x   } replaceNAWithZero <- function(x) {   x[is.na(x)] <- 0   x   }  #38 | Posted 4 years ago Competition 20th Posts 194 | Votes 92 Joined 9 Jul '10 | Email User
 0 votes Chris Raimondi wrote: One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean: There are some nice functions in the timeSeries library as well:  library(timeSeries) ?substituteNA  #39 | Posted 4 years ago Competition 9th | Overall 624th Posts 51 | Votes 83 Joined 23 Dec '10 | Email User
 0 votes There's a couple useful answers on stackoverflow for replacing NAs with 0s: http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r This function is very fast: http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table #40 | Posted 4 years ago Competition 9th | Overall 365th Posts 574 | Votes 580 Joined 2 Mar '11 | Email User
<12>