Sunday, June 25, 2017

Neural Networks as a Corporation Chain of Command

Neural networks are considered complicated and they are always explained using neurons and a brain function. But we do not need to learn how to brain works to understand Neural networks structure and how they operate.

Let us start with logistic regression. Recall that  a logistic regression divides 2 sets by a line (or a hyperplane if we have higher dimensions)

The logistic regression yields values form 0 to 1, and we can consider the process as making an evaluation. In the process we get data and we calculate our evaluation by a formula.

 For example we may have the following assignment: to compute if we have enough goods in storage to last for a week of sales.  This is quite a common problem, and say some clerks report their numbers to their manager to figure it out. The manager collects information, processes it and makes an evaluation.

Note that this  is how a logistic regression functions.




Usually computing if an amount of goods is sufficient is not the only problem. In addition we need to know, for example, if our storage is full to optimal capacity (75% -85% or something like this). Therefore we need to evaluate another statistic. 
And of course these people should report to their supervisor who will make another evaluation:
So we get a whole hierarchy of evaluations and at the end they report to CEO. We can compare it with a neural network structure:
We can observe a lot of in common with a corporation chain of command. As we see middle managers are hidden layers which do the balk of the job.  We have the similar information flow and processing which is analogous to forward propagation and backward propagation. 

What is left now is to explain that  dealing with sigmoid function at each node is too costly so it mostly reserved for CEO level. 

Wednesday, January 11, 2017

Machine Prediction for Human Resources Analytics

A Simple Analysis of Human Resources Data Set and Some Machine Learning Methods Maiia Bakhova
Content
  1. Introduction
  2. Data Load
  3. Creating Numeric Variables from Character Columns
  4. Choosing data for work
  5. Machine Learning
    1. Decision Tree with Result Explanation
    2. Random Forest Method
    3. Linear and Quadratic Discriminant Analysis
    4. Nearest Neighbors
    5. Support Vector Machines
    6. Logistic regression
    7. Conclusion

Introduction

Here I show analysis, feature engineering and decision tree construction with the result explanation for a csv data set.

I work on kaggle data set "Human Resources Analytics", which can be found at url

https://www.kaggle.com/ludobenistant/hr-analytics.

There we see that a question for the data set is the following:

Why are our best and most experienced employees leaving prematurely?

There you can find a data file in csv format and download it. Each record (a row) represents an employee. Fields in the data set include:
  • Employee satisfaction level
  • Last evaluation
  • Number of projects
  • Average monthly hours
  • Time spent at the company
  • Whether they have had a work accident
  • Whether they have had a promotion in the last 5 years
  • Department
  • Salary
  • Whether the employee has left

Data Load

I start with reading data set and looking at its properties: dimensions, column names and types of data. Let us assume that the data file lies in our working directory.

dt=read.csv("HR_comma_sep.csv", stringsAsFactors=F)
options(wide=120)
str(dt)
## 'data.frame': 14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : chr  "sales" "sales" "sales" "sales" ...
##  $ salary               : chr  "low" "medium" "medium" "low" ...
# Checking for missing values:
sapply(dt, function(x) sum(is.na(x)))
##    satisfaction_level       last_evaluation        number_project 
##                     0                     0                     0 
##  average_montly_hours    time_spend_company         Work_accident 
##                     0                     0                     0 
##                  left promotion_last_5years                 sales 
##                     0                     0                     0 
##                salary 
##                     0
mean(dt$left)
## [1] 0.238082539

As we see all data columns (variables) except for the last 2 ones have numeric and integer types. Our target variable is marked as "left" and it is 7th. In addition, here are no missing values, which is convenient. We learnt that a rate of leaving the company 23.8% in accordance to this data set.

We can check pairwise plots, correlations and densities for first 8 columns.

library(psych)
pairs.panels(dt[,1:8], hist.col="lightblue",
             cex.axis=1.5, cex.cor=0.5)
This picture shows that the greatest absolute correlations of the "left" column are "satisfaction_level", "time_spend_company" and "Work_accident".

Creating Numeric Variables from Character Columns

Now let us look at what the values in last 2 columns.

unique(dt$salary)
## [1] "low"    "medium" "high"
unique(dt$sales)
##  [1] "sales"       "accounting"  "hr"          "technical"   "support"    
##  [6] "management"  "IT"          "product_mng" "marketing"   "RandD"
In "salary" column we see 3 levels for salary values and it looks like "sales" column represents 10 departments. Let us see how uniformly data records are distributed between salary levels and departments:
table(dt$salary)
## 
##   high    low medium 
##   1237   7316   6446
table(dt$sales)
## 
##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2229        2720
It is not very uniform, but at least here are a few hundred records for every salary type and for each department. I would like to replace "salary" column with three other columns: "high_salary", "low_salary", "medium_salary". Each will have values 0 and 1, denoting "no" and "yes", respectfully. In a package called "dummies" we find a function which produces such columns from a text variable.
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
dumdt=dummy(dt$salary)
dumdt=as.data.frame(dumdt)
names(dumdt)
## [1] "HumanResoursesAnalytics.Rhtmlhigh"  
## [2] "HumanResoursesAnalytics.Rhtmllow"   
## [3] "HumanResoursesAnalytics.Rhtmlmedium"
Remark: I got these names because I was using RStudio to create this text. The names may be different if you work with a command line interface.

The names are not good because they are too long. I will replace them.

names(dumdt)=c("high_salary", "low_salary", "medium_salary")
I will attach my new variables to existing data frame and remove now redundant "salary".
dt=cbind(dt,dumdt)
dt$salary=NULL
str(dt)
## 'data.frame': 14999 obs. of  12 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : chr  "sales" "sales" "sales" "sales" ...
##  $ high_salary          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ low_salary           : int  1 0 0 1 1 1 1 1 1 1 ...
##  $ medium_salary        : int  0 1 1 0 0 0 0 0 0 0 ...
I will go through the similar steps with "sales" column. It has 10 values, so I'll create new columns, rename them and remove the "sales" column.
dumdt=dummy(dt$sales)
dumdt=as.data.frame(dumdt)
names(dumdt)
##  [1] "HumanResoursesAnalytics.Rhtmlaccounting" 
##  [2] "HumanResoursesAnalytics.Rhtmlhr"         
##  [3] "HumanResoursesAnalytics.RhtmlIT"         
##  [4] "HumanResoursesAnalytics.Rhtmlmanagement" 
##  [5] "HumanResoursesAnalytics.Rhtmlmarketing"  
##  [6] "HumanResoursesAnalytics.Rhtmlproduct_mng"
##  [7] "HumanResoursesAnalytics.RhtmlRandD"      
##  [8] "HumanResoursesAnalytics.Rhtmlsales"      
##  [9] "HumanResoursesAnalytics.Rhtmlsupport"    
## [10] "HumanResoursesAnalytics.Rhtmltechnical"
names(dumdt)=c("accounting","hr","IT","management","marketing","product_mng","RandD",
               "sales","support", "technical")
dt=cbind(dt,dumdt)
dt$sales=NULL
dim(dt)
## [1] 14999    21
# Removing the object we do not need anymore
rm(dumdt)
We got 13 new columns. We can check how they are correlated. In this case all our variables are categorical, and plotting them is not helpful. We will check numerical correlations.
options(wide=110)
cor(dt[,c(7,9:14)])
##                        left   high_salary     low_salary  medium_salary
## left           1.0000000000 -0.1209294638  0.13472197414 -0.06883296809
## high_salary   -0.1209294638  1.0000000000 -0.29256037558 -0.26027352121
## low_salary     0.1347219741 -0.2925603756  1.00000000000 -0.84714420898
## medium_salary -0.0688329681 -0.2602735212 -0.84714420898  1.00000000000
## accounting     0.0152011507  0.0118213193 -0.00975882741  0.00328479594
## hr             0.0282487481 -0.0178579572 -0.01568984972  0.02576547182
## IT            -0.0109248273 -0.0160889786  0.00511558982  0.00377497434
##                   accounting            hr             IT
## left           0.01520115067  0.0282487481 -0.01092482732
## high_salary    0.01182131932 -0.0178579572 -0.01608897862
## low_salary    -0.00975882741 -0.0156898497  0.00511558982
## medium_salary  0.00328479594  0.0257654718  0.00377497434
## accounting     1.00000000000 -0.0528478314 -0.06929286031
## hr            -0.05284783141  1.0000000000 -0.06794949459
## IT            -0.06929286031 -0.0679494946  1.00000000000
cor(dt[,c(7,15:21)])
##                        left    management       marketing   product_mng
## left         1.000000000000 -0.0460353907 -0.000859304044 -0.0110291521
## management  -0.046035390706  1.0000000000 -0.051577535319 -0.0529659689
## marketing   -0.000859304044 -0.0515775353  1.000000000000 -0.0623079556
## product_mng -0.011029152078 -0.0529659689 -0.062307955590  1.0000000000
## RandD       -0.046595651167 -0.0492738809 -0.057964667577 -0.0595250386
## sales        0.009923407034 -0.1292892024 -0.152092863576 -0.1561871042
## support      0.010700118013 -0.0874815739 -0.102911324656 -0.1056816303
## technical    0.020076104934 -0.0985507548 -0.115932856081 -0.1190536929
##                     RandD          sales       support     technical
## left        -0.0465956512  0.00992340703  0.0107001180  0.0200761049
## management  -0.0492738809 -0.12928920242 -0.0874815739 -0.0985507548
## marketing   -0.0579646676 -0.15209286358 -0.1029113247 -0.1159328561
## product_mng -0.0595250386 -0.15618710420 -0.1056816303 -0.1190536929
## RandD        1.0000000000 -0.14529980143 -0.0983149024 -0.1107548413
## sales       -0.1452998014  1.00000000000 -0.2579674078 -0.2906084287
## support     -0.0983149024 -0.25796740777  1.0000000000 -0.1966357767
## technical   -0.1107548413 -0.29060842875 -0.1966357767  1.0000000000
As we see our new variables may have lower correlations with our target variable "left" than between themselves. It happens because 1) a person should have some salary, so if it is not high or low it must be medium, 2) a person should belong to one of the departments as well. Hence variables in these groups are correlated.

I will remove a variable in each group which yields the lowest correlation with my target variable: "medium_salary" and "marketing".

dt$medium_salary =NULL
dt$marketing =NULL

Choosing data for work

As we remember we are supposed to answer a specific question: Why are our best and most experienced employees leaving prematurely? Regretfully we are not told if our data set already contains the employees, and we need to check it out. Let us compute what are "last_evaluation" and "time_spend_company" ranges.
range(dt$last_evaluation)
## [1] 0.36 1.00
range(dt$time_spend_company)
## [1]  2 10
As we see evaluations may be rather low. Therefore our data set has other employees which we do not need for our analysis.

I will plot a histogram for "last_evaluation" column" and mark the median for its values as a blue vertical line.

hist(dt$last_evaluation, col="salmon",
     xlab="Last Evaluation Value",
 main="Histogram of Last Evaluation Values")
abline(v=median(dt$last_evaluation),
    col=4,lwd=3)
median(dt$last_evaluation)
## [1] 0.72
Doing going through similar steps for time spent with company:
hist(dt$time_spend_company, col="khaki",
     xlab="Last Evaluation Value",
 main="Histogram of Time Spent with Company")
abline(v=median(dt$time_spend_company),
    col=4,lwd=3)
median(dt$time_spend_company)
## [1] 3
We can consider as best and most experienced employees people who have evaluation above 0.72 and spent more than 3 years with the company as the "best and most experienced employees". We can compute what is a rate of leaving for such employees.
dt=dt[dt$last_evaluation>0.72 & dt$time_spend_company>3,]
mean(dt$left)
## [1] 0.523085747
The rate of leaving was 23.8%, and now it is more than doubled. It is a cause for concern.

Let us now repeat the same pairwise plots with correlations we did at first.

library(psych)
pairs.panels(dt[,1:8], hist.col="lightblue",
             cex.axis=1.5, cex.cor=0.5)
plot of chunk unnamed-chunk-15
As we see the data behavior is a bit different now. Now columns "average_montly_hours", "number_project" and "satisfaction_level" yield highest correlations with "left" variable.

My data frame is ready for prediction.

Machine Learning

Because we need to explain why employees are leaving then a simple decision tree algorithm is the best.

Decision Tree with Result Explanation

The option called "maxdepth" defines a maximal depth of a tree.
library(rpart)
library(rpart.plot )
cart_mod=rpart(left~., data=dt, maxdepth=5)
rpart.plot(cart_mod)
So, what do we see on our tree graph? That majority of valuable employees work overtime: 70%. People with normal time load tend to stay, with rate of 94%. Among those who work too many hours several can stay if they have no more than 3.5 projects. There are a few enthusiasts who do not leave with too much work and their satisfaction is above 0.71. But most of overworked people are leaving, and they are the 54% of all valuable workers. Even high salary does not help. As we see the longer overworked people stay at the company, the more they are inclined to leave.

I want to see how reliable is my result and I will split the data set into 2 sets: for training and for testing. We will compute a decision tree model for a train set and check what accuracy we get predicting with the model on test set. To make sure that our accuracy number is not accidental it makes sense to repeat it a few times. In practice it is done at least 10 times (and called cross-validation), but I will limit it to 4. I will set up random seeds to make sure I have no repetitions.

par(mfrow=c(2,2))
seed=c(89,132,765,4)
accuracy=numeric(length=4)
for (i in 1:4){
    set.seed(seed[i])
    indeces=sample(1:dim(dt)[1], round(.7*dim(dt)[1]))
    train=dt[indeces,]
    test=dt[-indeces,]
    cart_mod=rpart(left~., data=train, maxdepth=5)
    rpart.plot(cart_mod, digits=3)
# In this case our predictions are returned as probabilities, 
#  and we need 0 and 1.
    predictionsAs0and1=sapply(predict(cart_mod,test[,-7]),
                              function(nu) ifelse(nu>.5, 1, 0))
          accuracy[i]=
sum(test$left==predictionsAs0and1)/length(test$left)
}
accuracy
## [1] 0.959847036 0.958891013 0.956022945 0.958891013
As you see our accuracy here fluctuates somewhere between 95.5% and 96%. The decision trees change slightly as well, but not much.

If we are not limiting our predictions with methods which are to explain to a layman then we can get better accuracy.

Random Forest Method

For random forest method we construct many trees and default option is 500 of them. For every tree and at each tree level we pick up only a subset of variables. We need to choose a size for such subset and in the package below it is called "mtry". As before to check method reliability I will examine its work on different subsets of the data. In addition I want to see what variables are important for the method, so I create a data frame for "mtry" number, corresponding accuracy and for 3 most important variables.
library(randomForest)
# For classification I'm to declare the target variable as factor.
dt$left=as.factor(dt$left)
resultTable=data.frame(N=1:4, accuracy=0,
                        importance1=character(length=4L),
                        importance2=character(length=4L),
                        importance3=character(length=4L), stringsAsFactors=FALSE)
for (i in 1:4) {
    set.seed(seed[i])
    indeces=sample(1:dim(dt)[1], round(.7*dim(dt)[1]))
    train=dt[indeces,]
    test=dt[-indeces,]
   rf_model=randomForest(left~., data=train, mtry=6,importance=TRUE)
   resultTable[i,"accuracy"]=
   sum(test$left==predict(rf_model,test[,-7]))/length(test$left)
   resultTable[i,3:5]=
   rownames(rf_model$importance)[1:3]
}
resultTable
##   N    accuracy        importance1     importance2    importance3
## 1 1 0.985659656 satisfaction_level last_evaluation number_project
## 2 2 0.984703633 satisfaction_level last_evaluation number_project
## 3 3 0.991395793 satisfaction_level last_evaluation number_project
## 4 4 0.990439771 satisfaction_level last_evaluation number_project
We can get accuracy 98.5%-99% choosing "mtry=6". The most significant variable is "satisfaction_level", then "last_evaluation" and "number_project". As we saw previously the fist variable is responsible for majority of leaving workers.

Linear and Quadratic Discriminant Analysis

These methods require some specific properties of data. Our variables are supposed to have a normal distribution, and by the look of their histograms they do not. Thus the methods is not likely not perform well and we see as a result.

Linear discriminant analysis

library(MASS)
lda_model=lda(left~., data=train)
pred=predict(lda_model, test)$class
sum(test$left==pred)/length(test$left)
## [1] 0.86998088
Quadratic discriminant analysis in addition suggests that variable correlations are consistent, which clearly does not help.
qda_model=qda(left~., data=train)
pred=predict(qda_model, test)$class
sum(test$left==pred)/length(test$left)
## [1] 0.863288719

Nearest Neighbors

With nearest neighbors we check for each test value if it is close to some values in train set if we consider each record as a point in multidimensional space. For this we need values for each variable to be approximately of the same scale. We can look at our plot and see that it is not the case. Majority of our data values are 0 and 1, but some are not. Let us see it in more detail:
 sapply(dt[,1:6], range)
##      satisfaction_level last_evaluation number_project
## [1,]               0.09            0.73              2
## [2,]               1.00            1.00              7
##      average_montly_hours time_spend_company Work_accident
## [1,]                   96                  4             0
## [2,]                  310                 10             1
I will shift and scale variables which have large numbers.
dt$number_project=(dt$number_project-2)/5
dt$average_montly_hours=(dt$average_montly_hours-90)/200
dt$time_spend_company=dt$time_spend_company/10
Number of neighbors in the method is "k".
library(class)
knn_accuracy=numeric(length=19)
for (i in 1:19){
   pred=knn(train[, -7], test[, -7], train$left, k=i)
   knn_accuracy[i]=sum(test$left==pred)/length(test$left)
}
knn_accuracy
##  [1] 0.913001912 0.885277247 0.875717017 0.874760994 0.875717017
##  [6] 0.874760994 0.871892925 0.871892925 0.869980880 0.872848948
## [11] 0.871892925 0.869024857 0.870936902 0.864244742 0.861376673
## [16] 0.859464627 0.858508604 0.859464627 0.852772467
The method does not yield good accuracy for any number of neighbors even in comparison with the decision tree method and it does not make sense to investigate how reliable is the resulting number.

Support Vector Machines

We can try another method which is based on measuring distances: support vector machines. For this method we are looking for linear borders between geometrical clusters of our data.
library(e1071)
svm_model=svm(left~., data=dt, kernel="linear", type="C")
sum(test$left==predict(svm_model, test[,-7]))/length(test$left)
## [1] 0.526768642
It is the worst. It means that our data do not form two clusters which we can mark as "left" and "stayed". You can try other kernels which represent different kind of borders and verify that it is no help. I am not going to check out the resulting accuracy as well.

Logistic Regression

Now let us use logistic regression. It is not likely to produce a good result, because for it to work properly we need a number of assumptions. Still we can check it out.
seed=c(89,132,765,4)
accuracy=numeric(length=4)
for (i in 1:4){
    set.seed(seed[i])
    indeces=sample(1:dim(dt)[1], round(.7*dim(dt)[1]))
    train=dt[indeces,]
    test=dt[-indeces,]
    log_reg_model=glm(left~., data=train, family="binomial")
## Values for logistic regression predictions are  
##  not limited to 0 and 1.
    predictionsAs0and1=sapply(predict(log_reg_model,test[,-7]), function(nu) ifelse(nu>0, 1, 0))
    accuracy[i]=sum(test$left==predictionsAs0and1)/length(test$left)
}
accuracy
## [1] 0.860420650 0.869980880 0.855640535 0.875717017
It is not one of the best as well. So our variables are not very good fit for linear additive model. Although sometimes 85% is all you can get from data, but here we've seen better accuracy.

Conclusion

Random Forest method yields the best accuracy, although it does not explain much. The Decision Tree is much more useful for staff policy recommendations. Other method mostly tell us what our data are not: we can not divide records into 2 well defined clusters which we can name "left" and "stayed", variables are not mutlinormally distributed, and linear additive model does not work well predicting desired outcome.

Remark: I would like to note that I used simple variations of the methods and did not explore all options.

As you see I have a lot of repetitions in my code. There is a way to avoid it: look at "caret" package!

Wednesday, November 30, 2016

MNIST set with Neural Networks using H2O

I continue doing my ML work with MNIST set, currently presented at kaggle as Digit Recognizer competition, which I’ve started in one of my previous posts. This time I’ve decided to try neural network method. At first I had taken a look at “nnet” and “neuralnet” packages, but they could not handle such big set. Not only memory and timing had been a problem, but there are default restrictions, like only one hidden layer and a number of nodes. The number of nodes may be increased, but I got memory overload.

I googled if there is anything new for NN with R. Found two frameworks, MXNET and H2O. Decided to try H2O first, because it looked simpler.

Both packages cannot be installed using usual R command “install.packages”. For H2O you can find installation instructions on the company web site. You may get a message that JAVA installation is required. MXNET installation instructions are more complicated and you may need to adapt what you google. I posted my story with it in my blog post here.

Now let us start predicting! At first we load the data set, check dimenstions and prepare target variable.

setwd("/home/mya/Kaggle/DigitRecognizer")
dt=read.csv("train.csv", stringsAsFactors=F)
## For classification you need your output as  a factor
dim(dt)
## [1] 42000   785
dt[,1] = as.factor(dt[,1]) # for classification

To work with H2O we should not only load its library, but to initialize it as well and convert our data to H2O format.

library(h2o)
localH2O = h2o.init(max_mem_size = '16g', # use 16GB of RAM of 32GB available
                    nthreads = 7) # use 7 CPUs
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/Rtmp8Uek1a/h2o_mya_started_from_r.out
##     /tmp/Rtmp8Uek1a/h2o_mya_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: .. Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 643 milliseconds 
##     H2O cluster version:        3.10.0.6 
##     H2O cluster version age:    3 months and 4 days  
##     H2O cluster name:           H2O_started_from_R_mya_oon021 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   14.22 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  7 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)
train_h2o = as.h2o(dt)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## Now we are ready to build a model. 
s <- proc.time()
## train model
model =
  h2o.deeplearning(x = 2:785,  # column numbers for predictors
                   y = 1,   # a column number for label
                   training_frame = train_h2o, # data in H2O format
                   activation = "RectifierWithDropout", # activation function
                   loss = "CrossEntropy", #loss function
                   input_dropout_ratio = 0.1, # % of inputs dropout
                   hidden_dropout_ratios = c(0.2,0.2), # % for nodes dropout
                   balance_classes = TRUE, # for classificaton
                   hidden = c(300,100), # two layers 300 x 100 nodes
                   quiet_mode=T, # to reduce printed output
                   nesterov_accelerated_gradient = T, # use it for speed
                   epochs = 200) # no. of forward and backward propagations
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |=================================================================| 100%
s - proc.time()
##     user   system  elapsed 
##   -4.176   -0.120 -558.636

It took about 10 minutes.

h2o.confusionMatrix(model)
## Confusion Matrix: vertical: actual; across: predicted
##           0    1   2    3   4    5   6    7   8   9  Error           Rate
## 0      1077    0   0    0   0    1   2    6   0   0 0.0083 =    9 / 1,086
## 1         0 1021   4    1   0    0   0    5   1   0 0.0107 =   11 / 1,032
## 2         0    0 959    0   0    0   2   11   0   0 0.0134 =     13 / 972
## 3         0    0   6  998   0    2   0   14   0   2 0.0235 =   24 / 1,022
## 4         0    0   3    0 986    0   1    5   0   3 0.0120 =     12 / 998
## 5         0    0   0    6   0  997   3    9   1   1 0.0197 =   20 / 1,017
## 6         1    0   3    0   0    1 981   21   0   0 0.0258 =   26 / 1,007
## 7         0    1   3    1   1    0   0  995   0   0 0.0060 =    6 / 1,001
## 8         0    2   2    1   0    4   0    9 952   1 0.0196 =     19 / 971
## 9         0    0   0    2   2    0   0   10   1 976 0.0151 =     15 / 991
## Totals 1078 1024 980 1009 989 1005 989 1085 955 983 0.0154 = 155 / 10,097

Because I’m working with a kaggle set I’m supposed to submit my prediction for test set on their site. For this I need to load it, convert it to H2O format as well, make predictions and convert results back to R format. Aftewards I will shut down H2O instance and write a submission file.

I’m making this markdown file with RStudio, and it means that at first I need to go back to the directory where all my data are stored.

setwd("/home/mya/Kaggle/DigitRecognizer")
test=read.csv("test.csv", stringsAsFactors=F)
test_h2o = as.h2o(test)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## classify test set
h2o_y_test <- h2o.predict(model, test_h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=================================================================| 100%
## convert H2O format into data frame and  save as csv
df_y_test = as.data.frame(h2o_y_test)
df_y_test = data.frame(ImageId = seq(1,length(df_y_test$predict)), 
                   Label = df_y_test$predict)
## shut down virutal H2O cluster
h2o.shutdown(prompt = F)
## [1] TRUE
write.csv(df_y_test, file = "H20_submission.csv", row.names=F)

My submission scored 0.95900. Not as good as Random Forests.