library(caret); library(e1071)
library(kernlab); library(doParallel)
library(randomForest)
partial_1=df[1:10000, ]
# All my 8 cores are detectable. I will take 7 for calculations.
cl=makeCluster(detectCores()-1)
registerDoParallel(cl)
t=Sys.time()
rf_1=train(label~., data=partial_1, method="rf" )
Sys.time()-t
stopCluster(cl)
# Time difference of 3.000682 hours
# Accuracy 0.9463709
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 39.
I watched how much RAM has been used and it was a bit under 9 Gb. I've googled if there are any improvements to this, and turned out they are, in new packages: better memory usage and optimized algorithms. In particular, a package "ranger" is in a native for R "cran" repository. It has a built-in parallel option. I tried it on the whole data set with the same options as before, with a number of trees equals to 500 (default option for "randomForest" and "ranger"), "mtry=39" (from the caret model above) and 7 cores ("num.threads=7"), and it was a breeze. Computation time was less than 1 min 15 sec, accuracy was higher than 94.6% as well. The used RAM was a little bit above 3 Gb, which was pleasant, too. A default option for "verbose" is TRUE, so if you want to enjoy watching computations, you can do it. A formula was written beforehand as "frml". You can see my commands below.
library(ranger)
t=Sys.time()
frml=as.formula(paste0("label~", paste(names(dt)[-1], sep="", collapse='+')))
ranger(frml, data=df, mtry=39, num.threads=7, verbose=FALSE, classification=TRUE)
Sys.time()-t
# Time difference of 1.240249 mins
# prediction error: 3.20 %
After this I've explored different values of "mtry", since it was so easy, and changed a number of trees to 5000. My final choice for "mtry" was 32. Peaks for amount of RAM were between 9 and a bit above 10 Gb. Times used were a around 10-11 minutes for a model.
There are 2 options for prediction with the package. First, you can do it in one move with the "ranger" function. Or second, you should use their option "write.forest" for the "ranger" function, and then you can use "predict" function as usual. My last steps follow, assuming that the required test data frame is here under name "testdf":
t=Sys.time()
rf_mod=ranger(frml, data=df, mtry = 32, num.threads=7,
num.trees =5000, verbose= FALSE,
write.forest=T, classification=TRUE)
Sys.time()-t
# Time difference of 10.48414 mins
# prediction error: 3.06 %
Computing final predictions and writing a file for kaggle.com submission as a required table in csv format:
Label=predict(rf_mod, testdf)
Label=Label$predictions
submission=data.frame(id=testdf$id, outcome=Label)
write.csv(submission,"submission.csv", row.names=F)