Mathematician in Data Science: June 2016

I was looking for ways to run a parallel version of random forest. Of course we can do it with the widely known "randomForest", "caret" and "doParallel" packages. But my attempt to apply it to Digit Recognizer problem on Kaggle was not satisfactory: when I tried their whole data set my R instance was crushed, and when I took less than a quarter of it then it was running for too long. Here are my code and my notes, assuming that the data set is already here as a data frame "df":

library(caret); library(e1071)
library(kernlab); library(doParallel)
library(randomForest)

partial_1=df[1:10000, ]

# All my 8 cores are detectable. I will take 7 for calculations.

cl=makeCluster(detectCores()-1)
registerDoParallel(cl)
t=Sys.time()
rf_1=train(label~., data=partial_1, method="rf" )
Sys.time()-t
stopCluster(cl)

# Time difference of 3.000682 hours
# Accuracy 0.9463709
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 39.

I watched how much RAM has been used and it was a bit under 9 Gb. I've googled if there are any improvements to this, and turned out they are, in new packages: better memory usage and optimized algorithms. In particular, a package "ranger" is in a native for R "cran" repository. It has a built-in parallel option. I tried it on the whole data set with the same options as before, with a number of trees equals to 500 (default option for "randomForest" and "ranger"), "mtry=39" (from the caret model above) and 7 cores ("num.threads=7"), and it was a breeze. Computation time was less than 1 min 15 sec, accuracy was higher than 94.6% as well. The used RAM was a little bit above 3 Gb, which was pleasant, too. A default option for "verbose" is TRUE, so if you want to enjoy watching computations, you can do it. A formula was written beforehand as "frml". You can see my commands below.

library(ranger)
t=Sys.time()
frml=as.formula(paste0("label~", paste(names(dt)[-1], sep="", collapse='+')))
ranger(frml, data=df, mtry=39, num.threads=7, verbose=FALSE, classification=TRUE)
Sys.time()-t
# Time difference of 1.240249 mins
# prediction error: 3.20 %

After this I've explored different values of "mtry", since it was so easy, and changed a number of trees to 5000. My final choice for "mtry" was 32. Peaks for amount of RAM were between 9 and a bit above 10 Gb. Times used were a around 10-11 minutes for a model.
There are 2 options for prediction with the package. First, you can do it in one move with the "ranger" function. Or second, you should use their option "write.forest" for the "ranger" function, and then you can use "predict" function as usual. My last steps follow, assuming that the required test data frame is here under name "testdf":

t=Sys.time()
rf_mod=ranger(frml, data=df, mtry = 32, num.threads=7,
num.trees =5000, verbose= FALSE,
write.forest=T, classification=TRUE)
Sys.time()-t
# Time difference of 10.48414 mins
# prediction error: 3.06 %

Computing final predictions and writing a file for kaggle.com submission as a required table in csv format:

Label=predict(rf_mod, testdf)
Label=Label$predictions
submission=data.frame(id=testdf$id, outcome=Label)
write.csv(submission,"submission.csv", row.names=F)

I've spent a few hours trying to install "mxnet" package for R. I have looked at different sites, applied different steps. As result I've seen a lot of messages on my screen that something is missing or a provided url is invalid and my attempts for the installation had failed. Finally I found this site, with "R Package Installation" section:

http://mxnet.readthedocs.io/en/latest/how_to/build.html#

But I still have troubles. So there was no choice but to read all the error messages!

It turned out that I needed a bunch of libraries which are usually used by developers. I guess that people who wrote instrustions are developers and they already have all the libraries. So when they wrote instructions they did not think of them.

Installation of libraries I needed on Ubuntu:

sudo apt-get install libssl-dev
sudo apt-get install libssh2-1-dev

In addition I've installed a few packages in R shell (invoking it by "sudo R") as well, because they were asked for.
I'm not sure if they all must be installed for "mxnet". Maybe my order of steps was not optimal. Or maybe there are other required libraries which I already have.

install.packages("shiny")
install.packages("httr")
install.packages("openssl")
install.packages("devtools")
install.packages("imager")

At the very end I finally got to unpack and install the mxnet package. Well, not a boring simple step here as well. A command in the "R Package Installation" section on the site mentioned above was for earlier version:

R CMD INSTALL mxnet_0.5.tar.gz

While I should use:

R CMD INSTALL mxnet_0.7.tar.gz

Now I need to look somewhere for R tutorials, because on the site they have only Python tutorials. Although some basic options must be the same.

Mathematician in Data Science

Monday, June 27, 2016

One of the best random forest implementations

Thursday, June 23, 2016

Installation of "mxnet" package for R on Ubuntu 16.04