Mathematician in Data Science: One of the best random forest implementations

I was looking for ways to run a parallel version of random forest. Of course we can do it with the widely known "randomForest", "caret" and "doParallel" packages. But my attempt to apply it to Digit Recognizer problem on Kaggle was not satisfactory: when I tried their whole data set my R instance was crushed, and when I took less than a quarter of it then it was running for too long. Here are my code and my notes, assuming that the data set is already here as a data frame "df":

library(caret); library(e1071)
library(kernlab); library(doParallel)
library(randomForest)

partial_1=df[1:10000, ]

# All my 8 cores are detectable. I will take 7 for calculations.

cl=makeCluster(detectCores()-1)
registerDoParallel(cl)
t=Sys.time()
rf_1=train(label~., data=partial_1, method="rf" )
Sys.time()-t
stopCluster(cl)

# Time difference of 3.000682 hours
# Accuracy 0.9463709
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 39.

I watched how much RAM has been used and it was a bit under 9 Gb. I've googled if there are any improvements to this, and turned out they are, in new packages: better memory usage and optimized algorithms. In particular, a package "ranger" is in a native for R "cran" repository. It has a built-in parallel option. I tried it on the whole data set with the same options as before, with a number of trees equals to 500 (default option for "randomForest" and "ranger"), "mtry=39" (from the caret model above) and 7 cores ("num.threads=7"), and it was a breeze. Computation time was less than 1 min 15 sec, accuracy was higher than 94.6% as well. The used RAM was a little bit above 3 Gb, which was pleasant, too. A default option for "verbose" is TRUE, so if you want to enjoy watching computations, you can do it. A formula was written beforehand as "frml". You can see my commands below.

library(ranger)
t=Sys.time()
frml=as.formula(paste0("label~", paste(names(dt)[-1], sep="", collapse='+')))
ranger(frml, data=df, mtry=39, num.threads=7, verbose=FALSE, classification=TRUE)
Sys.time()-t
# Time difference of 1.240249 mins
# prediction error: 3.20 %

After this I've explored different values of "mtry", since it was so easy, and changed a number of trees to 5000. My final choice for "mtry" was 32. Peaks for amount of RAM were between 9 and a bit above 10 Gb. Times used were a around 10-11 minutes for a model.
There are 2 options for prediction with the package. First, you can do it in one move with the "ranger" function. Or second, you should use their option "write.forest" for the "ranger" function, and then you can use "predict" function as usual. My last steps follow, assuming that the required test data frame is here under name "testdf":

t=Sys.time()
rf_mod=ranger(frml, data=df, mtry = 32, num.threads=7,
num.trees =5000, verbose= FALSE,
write.forest=T, classification=TRUE)
Sys.time()-t
# Time difference of 10.48414 mins
# prediction error: 3.06 %

Computing final predictions and writing a file for kaggle.com submission as a required table in csv format:

Label=predict(rf_mod, testdf)
Label=Label$predictions
submission=data.frame(id=testdf$id, outcome=Label)
write.csv(submission,"submission.csv", row.names=F)

Mathematician in Data Science

Monday, June 27, 2016

One of the best random forest implementations

No comments:

Post a Comment