<12>

Chris Raimondi wrote:

I don't get this either - but I have had luck in choosing a different repository.  I would think this stuff is all automated, but I have found if I try a few more times I can usually find it (this doesn't appear to work in cases where the webpage indicates it is no longer available - only when it appears that is should be available.)  Sometimes I have to try 3 or four different ones, but it has probably worked about 15 times for me.

Chris - the reason for certain windows binaries not being available is...

Packages that do not compile out of the box or do not pass "R CMD check" with "OK" or at least a "WARNING" will *not* be published.




see this

http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe

Wow that caret package is pretty neat, one package to rule them all.

darc wrote:
Wow that caret package is pretty neat, one package to rule them all.

It really is.  If I ever meet Max Kuhn, I'm gonna buy him a beer.

Here's the structure I use, where X is your model matrix, and Y is the target. mtry can be any number. You can let caret optimize it, but it takes a long time, so I usually start with something small like 5 or 10.

predictionBounds is also very useful, particularly in this competition.

train(X,Y,metric="RMSE",method='parRF',tuneGrid=expand.grid(.mtry=4),
ntree=500,
trControl=trainControl(
method = "boot",
number=1,
predictionBounds = c(0,15))))

One idea for selecting mtry is to use the method 'rf' with option 'oob' and a small number of trees (say 100) and let caret pick an mtry value, which you then use to construct a larger forest.

Also parRF doesn't work with the method 'oob'. =(

Is there a discussion forum for R and R users somewhere ?

http://r.789695.n4.nabble.com/

http://stats.stackexchange.com/
http://stackoverflow.com/questions/tagged/r

Here's my version of a parallel random forest function:

multiRF=function(x,...) {
        foreach(i=x,.combine=combine,.packages='randomForest',
                .export=c('X','Y'),.inorder=FALSE) %dopar% {
                randomForest(X,Y,mtry=i,...)

        }
}
multiRF(c(rep(3,10),rep(4,10),rep(5,10)),ntree=500)

I discuss it in more detail on my blog.

@ProTester:

There is a nice little Python package (published about in the Journal of Statistical Software) called pyper that lets you call R from Python.

Tada!

I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance?

The box I usually run my models on is down for maintenance, so I've spent this last week writing some fairly (read: very) useless things in R. Here's an example:

Say I have this constant urge to go outside and enjoy some sunshine whenver my computer is chugging along on a randomForest/neural net/other awesome technique, and it may be minutes/hours/days before my job is done. Still, I'd like to know when it finishes so I can run back to my computer in a hurry. I recently stumbled upon a twitter client for R called...twitteR, that has some neat functions that allow you to tweet from an R console. I can now set up my job like so:

tweet("Starting Job!")
a <- sys.time()="" #="" some="" really="" computationally="" expensive="" stuff="" goes="" here!="" b=""><- sys.time()="" tweet(paste("job="" finished="" in="" ",="" b-a,="" "!",="" sep="" ))="">

If Twitter is too new-fangled, you can always check out the sendmailR package and send yourself an e-mail when it's done.

One thing I'm currently working on (having no computer and plenty of free time) is writing an R package that hooks up with Twilio, a pretty interesting service that allows developers to make phone calls and text messages through a web service API. That way, I can get push notifications to my phone via SMS telling me that my 250k tree forest or my quadruple for-loop data-processing function is done.

I fear I am hijacking this thread for R help, but here it goes.

Thanks to Chris for answering the question I asked last time. The answer helped, but it brought up a new question. So I now have the predict function looking for my y2 data to predict with. However, I want to use my Y3 data. The columns are not named the same, so how do I fudge it to use the model from before? I came up with a cludge, but I would prefer something elegant and fast.

Also, I when running a random forest, what is the expected resulting time to run. I am running it on my 70,000 x 50 data table, with 5 trees, and R goes into not responding mode... This seems similar to what was described on the Nabble forum: http://r.789695.n4.nabble.com/Large-dataset-randomForest-td830768.html

Chris Raimondi wrote:

Also DO NOT use the forumla interface (this is a random forest thing and not a caret specific thing).  I read this some where and was skeptical, but on my other computer I would run out of memory using the formula method on larger data set, but be fine without.  I don't get how there is that much of an overhead for using a formula, but whatever - it works.

Where can I find an example of not using the forumula interface ? I got the formula interface in caret package working, but always got this error with the non-formula interface:

model = train(x,y,method="rf", metric="RMSE",ntree=400,

               trControl=tc1,tuneGrid=tg)
Fitting: mtry=3
Aggregating results
Selecting tuning parameters
Fitting model on full training set
Error in preProcess.default(trainX, method = pp$options, thresh = pp$thresh,  :
  all columns of x must be numeric

Where can I find an example of not using the forumula interface ?

I was talking about using randomForest itself.  If you use caret - and have factors - you sometimes have to use the formula interface (which can be nice - as you can use it for models that you wouldn't otherwise be able to - I assume it is turning it into dummy variables in the background).

I do not know if the non formula interface trick actually saves you anything in caret.  It is possible that do to what is going on in the background - you don't get an advantage - or maybe you always get the advantage - I don't know - sorry.

I am POSITIVE though - if you train using randomForest directly - it does help with memory issues - at least it did for me.  I tend to use the formula interface with caret.

If you use:

 getAnywhere(train.default)

You can see some of the code for the train function - you can see where some of the logic takes place for which models require the formula interface to use factors.  I believe that you shouldn't need it for "rf" though - so not sure why you are getting that message.  I believe I have had issued with unusual column names before, but not sure - can't remember.

darc wrote:

I've been fooling around with the caret package and had a few questions someone might be able to answer.  I was wondering if it is possible to use train to explore the tuning parameters but not fit a full final model at the end?  Similarly I was wondering if it is possible to run rfe and not fit the full model at the begining, but instead give it a list of predictor importance?

rfe HAS to fit a full model at the beginning, so it knows which variables to eliminate.  I'm not sure you can tell caret to not fit a final model.  Look around on the parameters for it on the help page.  The final object returned by caret contains both the final model and the optimal tuning parameters ($finalModel and something else) which you can use to refit the model if you like.

I've just started playing around in this comp, so I thought I'd revive the thread and contribute one or two R functions, using what hopefully are fairly obvious names for the data objects. A submission function, a training function for random forest parameters, a quick variable importance function, and a couple of faster alternatives to table and tapply that may work for you depending on what you are doing. 

submit.csv <- function(model, newdata, filename, ...)
{
preds <- predict(model,newdata=newdata, ...)
preds <- exp(preds)-1
if(any(is.na(preds))) stop("missing values in predictions")
if(any(is.infinite(preds))) stop("infinite values in predictions")
if(any(preds < 0)) stop("negative values in predictions")
preds <- data.frame(MemberID = daysY4$MemberID, ClaimsTruncated =
daysY4$ClaimsTruncated, DaysInHospital=round(preds,6))
if(nrow(preds) != 70942) stop("incorrect number of rows")
write.csv(preds, filename, row.names=FALSE, quote=FALSE)
}
rf.train <- function(sampsize=10000,mtry=10,ntree=100,nodesize=50,reps=3)
{
#arguments should be numeric vectors of any length
argm <- expand.grid(sampsize=sampsize,mtry=mtry,ntree=ntree,nodesize=nodesize)
mt <- matrix(NA,ncol=reps,nrow=nrow(argm))
for(i in 1:nrow(argm)) {
for(j in 1:reps) {
rfFita <- randomForest(daysY2MM,TargetY2,ntree=argm[i,3],mtry=argm[i,2],
replace=FALSE,sampsize=argm[i,1],nodesize=argm[i,4],maxnodes=NULL,
xtest=daysY3MM, ytest=TargetY3,
importance=FALSE,localImp=FALSE,keep.forest=TRUE)
mt[i,j] <- sqrt(rfFita$test$mse)[argm[i,3]]
}
}
cbind(argm,round(mt,5))
}
qvarimp <- function(prdata, target, sort=TRUE, ...)
{
# prdata should be numeric matrix with no missing values
# or zero variance variables
# target should have no missing values
impFunc <- function(x, y) abs(coef(summary(lm(y ~ x)))[2, "t value"])
ret <- data.frame(tstat = round(apply(prdata, 2, impFunc, y = target),1))
if(sort) ret <- ret[sort.list(ret$tstat,dec=TRUE),]
ret
}
 
qtapply <- function (X, INDEX, FUN, ..., simplify = TRUE)
{
#INDEX should be a factor
FUN <- match.fun(FUN)
ans <- lapply(split(X, INDEX), FUN, ...)
if(simplify) ans <- unlist(ans)
ans
}

qtable <- function(fac, names = FALSE)
{
#fac should be a factor
pd <- nlevels(fac)
y <- tabulate(fac, pd)
if(names) names(y) <- levels(fac)
y
}
 

One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:


replaceNAWithMean <- function(x) {
  x[is.na(x)] <- mean(x,na.rm=TRUE)
  x
  }

replaceNAWithZero <- function(x) {
  x[is.na(x)] <- 0
  x
  }

Chris Raimondi wrote:
One thing that is a pain in R is dealing with NAs - most of the time I find it necessary to get rid of the NAs - here are two functions I use - one replaces all NAs with Zeros - the other replaces it with the mean:

There are some nice functions in the timeSeries library as well:

library(timeSeries)
?substituteNA

There's a couple useful answers on stackoverflow for replacing NAs with 0s:
http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r

This function is very fast:
http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table

<12>

Reply

Flag alert Flagging notifies Kaggle that this message is spam, inappropriate, abusive, or violates rules. Do not use flagging to indicate you disagree with an opinion or to hide a post.