Wednesday, November 30, 2016

MNIST set with Neural Networks using H2O

I continue doing my ML work with MNIST set, currently presented at kaggle as Digit Recognizer competition, which I’ve started in one of my previous posts. This time I’ve decided to try neural network method. At first I had taken a look at “nnet” and “neuralnet” packages, but they could not handle such big set. Not only memory and timing had been a problem, but there are default restrictions, like only one hidden layer and a number of nodes. The number of nodes may be increased, but I got memory overload.

I googled if there is anything new for NN with R. Found two frameworks, MXNET and H2O. Decided to try H2O first, because it looked simpler.

Both packages cannot be installed using usual R command “install.packages”. For H2O you can find installation instructions on the company web site. You may get a message that JAVA installation is required. MXNET installation instructions are more complicated and you may need to adapt what you google. I posted my story with it in my blog post here.

Now let us start predicting! At first we load the data set, check dimenstions and prepare target variable.

setwd("/home/mya/Kaggle/DigitRecognizer")
dt=read.csv("train.csv", stringsAsFactors=F)
## For classification you need your output as  a factor
dim(dt)
## [1] 42000   785
dt[,1] = as.factor(dt[,1]) # for classification

To work with H2O we should not only load its library, but to initialize it as well and convert our data to H2O format.

library(h2o)
localH2O = h2o.init(max_mem_size = '16g', # use 16GB of RAM of 32GB available
                    nthreads = 7) # use 7 CPUs
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/Rtmp8Uek1a/h2o_mya_started_from_r.out
##     /tmp/Rtmp8Uek1a/h2o_mya_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: .. Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 643 milliseconds 
##     H2O cluster version:        3.10.0.6 
##     H2O cluster version age:    3 months and 4 days  
##     H2O cluster name:           H2O_started_from_R_mya_oon021 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   14.22 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  7 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)
train_h2o = as.h2o(dt)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Now we are ready to build a model. 
s <- proc.time()
## train model
model =
  h2o.deeplearning(x = 2:785,  # column numbers for predictors
                   y = 1,   # a column number for label
                   training_frame = train_h2o, # data in H2O format
                   activation = "RectifierWithDropout", # activation function
                   loss = "CrossEntropy", #loss function
                   input_dropout_ratio = 0.1, # % of inputs dropout
                   hidden_dropout_ratios = c(0.2,0.2), # % for nodes dropout
                   balance_classes = TRUE, # for classificaton
                   hidden = c(300,100), # two layers 300 x 100 nodes
                   quiet_mode=T, # to reduce printed output
                   nesterov_accelerated_gradient = T, # use it for speed
                   epochs = 200) # no. of forward and backward propagations
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |=================================================================| 100%
s - proc.time()
##     user   system  elapsed 
##   -4.176   -0.120 -558.636

It took about 10 minutes.

h2o.confusionMatrix(model)
## Confusion Matrix: vertical: actual; across: predicted
##           0    1   2    3   4    5   6    7   8   9  Error           Rate
## 0      1077    0   0    0   0    1   2    6   0   0 0.0083 =    9 / 1,086
## 1         0 1021   4    1   0    0   0    5   1   0 0.0107 =   11 / 1,032
## 2         0    0 959    0   0    0   2   11   0   0 0.0134 =     13 / 972
## 3         0    0   6  998   0    2   0   14   0   2 0.0235 =   24 / 1,022
## 4         0    0   3    0 986    0   1    5   0   3 0.0120 =     12 / 998
## 5         0    0   0    6   0  997   3    9   1   1 0.0197 =   20 / 1,017
## 6         1    0   3    0   0    1 981   21   0   0 0.0258 =   26 / 1,007
## 7         0    1   3    1   1    0   0  995   0   0 0.0060 =    6 / 1,001
## 8         0    2   2    1   0    4   0    9 952   1 0.0196 =     19 / 971
## 9         0    0   0    2   2    0   0   10   1 976 0.0151 =     15 / 991
## Totals 1078 1024 980 1009 989 1005 989 1085 955 983 0.0154 = 155 / 10,097

Because I’m working with a kaggle set I’m supposed to submit my prediction for test set on their site. For this I need to load it, convert it to H2O format as well, make predictions and convert results back to R format. Aftewards I will shut down H2O instance and write a submission file.

I’m making this markdown file with RStudio, and it means that at first I need to go back to the directory where all my data are stored.

setwd("/home/mya/Kaggle/DigitRecognizer")
test=read.csv("test.csv", stringsAsFactors=F)
test_h2o = as.h2o(test)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## classify test set
h2o_y_test <- h2o.predict(model, test_h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=================================================================| 100%
## convert H2O format into data frame and  save as csv
df_y_test = as.data.frame(h2o_y_test)
df_y_test = data.frame(ImageId = seq(1,length(df_y_test$predict)), 
                   Label = df_y_test$predict)
## shut down virutal H2O cluster
h2o.shutdown(prompt = F)
## [1] TRUE
write.csv(df_y_test, file = "H20_submission.csv", row.names=F)

My submission scored 0.95900. Not as good as Random Forests.