Mathematician in Data Science: 2017

Neural Networks are considered as complicated and they are always explained using neurons and a brain function. But we do not need to learn how brain works to understand Neural Networks structure and how they operate.

Let us start with a logistic regression. Recall that for a logistic regression model we represent our data as point coordinates. The model allows us to split all the points into 2 classes. Visually the model divides 2 classes of data records by a line (or a plane, or a hyperplane if we have higher dimensions) .

Numerically a logistic regression yields values from 0 to 1 for each data point. For every record we compute a probability for it to belong to one class or another and we can consider the process as making an evaluation. A value close to 1 means one class, while close to 0 means another. In the process we load data and calculate our evaluation by a formula.

For example we may have the following assignment: to compute if we have enough goods in warehouses to last for a week of sales. This is quite a common problem, and say inventory clerks report amounts of each item to a manager. The manager collects information, processes it and makes an evaluation: do we have enough of every product, "Yes" or "No".

Here we can use a logistic regression model which produces probabilities for 2 classes: "Yes" and "No". If you feel that our manager should supply a bit more information with it, for example, how much should be restocked if any, then in Neural Networks it could passed, too.

Usually one evaluation is not the only required indicator for a chain of stores to operate. In addition we need to know, for example, if our warehouses are full to optimal capacity (75% - 85% or something like this). Therefore we need to estimate this as well. In our organization we will have another manager gathering data and assessing if we have enough reserve storage space.

Then let us say that these people should report to their supervisor who will make next evaluation about store preparedness level:

So we get a whole hierarchy of managers, submitting their assessments and at the end the company CEO obtains a summary report. The CEO looks at the whole picture, issues adjusted guidelines for further work and sends them back to middle management. We can compare it with a neural network structure:

Now we can observe a lot of in common with a corporation chain of command. As we see middle managers are corporation hidden layers which do the balk of the job, as it happens in real life. They got reports from their subordinates and guidelines form their bosses, and they optimize their work based on the provided information. In neural networks we have the similar information flow and processing which is analogous to forward propagation (from lowest workers to CEO) and backward propagation (instructions from CEO to middle management, adapted for each lower manager by her/his boss).

I discovered that a lot of people express a desire for the NN results to be more explainable by its inner functioning. I believe that my analogy is helpful for this, too. As we know, a company success is defined by its top management. They select a company structure and define an inner company policy, which middle management adapts and directs further by usual means. We know that an industry leader could originate due to a chance, a lucky initial idea. But its following performance is a learned behavior. People use experience of other company managers combining it with their own trial & error process.

Introduction

Here I work on kaggle data set "Human Resources Analytics", which can be found by url
https://www.kaggle.com/ludobenistant/hr-analytics.
We're supposed to predict which valuable employees will leave next. Fields in the data set include:

Employee satisfaction level
Last evaluation
Number of projects
Average monthly hours
Time spent at the company
Whether they have had a work accident
Whether they have had a promotion in the last 5 years
Department
Salary
Whether the employee has left

Data Load

I start with loading data set and looking at its properties: dimensions, column names and types of data.

dt=read.csv("HR_comma_sep.csv", stringsAsFactors=F)
str(dt)

## 'data.frame': 14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : chr  "sales" "sales" "sales" "sales" ...
##  $ salary               : chr  "low" "medium" "medium" "low" ...

sapply(dt, function(x) sum(is.na(x)))

##    satisfaction_level       last_evaluation        number_project 
##                     0                     0                     0 
##  average_montly_hours    time_spend_company         Work_accident 
##                     0                     0                     0 
##                  left promotion_last_5years                 sales 
##                     0                     0                     0 
##                salary 
##                     0

As we see all data except for the last 2 columns are numeric. In addition, here are no missing values.
We can check pairwise plots, correlations and densities.

library(ggplot2)
library(GGally)
ggpairs(data=dt, columns=1:8,
    mapping = aes(color = "navy"),
    axisLabels="show")

Now let us look at what the values in last 2 columns.

unique(dt$salary)

## [1] "low"    "medium" "high"

unique(dt$sales)

##  [1] "sales"       "accounting"  "hr"          "technical"   "support"    
##  [6] "management"  "IT"          "product_mng" "marketing"   "RandD"

Here are 3 classes for salary values and it looks like the column "sales" represents departments. Let us see how uniform is the distribution of the data:

table(dt$sales)

## 
##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2229        2720

table(dt$salary)

## 
##   high    low medium 
##   1237   7316   6446

It is not very uniform, but at least each class is not too small. I would like to replace the columns with dummy variable columns. As first salary values:

library(dummies)
dumdt=dummy(dt$salary)
dumdt=as.data.frame(dumdt)
names(dumdt)

## [1] "MachinePredictionForNorthropGrumman.Rhtmlhigh"  
## [2] "MachinePredictionForNorthropGrumman.Rhtmllow"   
## [3] "MachinePredictionForNorthropGrumman.Rhtmlmedium"

Firstly, these names are not good because they are too long. Secondly, I do not need all of them, because they are correlated: having a low salary means not having high or medium. I will remove the low salary column and I will rename the rest of them.

dumdt=dumdt[,-2]
names(dumdt)=c("high_salary", "medium_salary")

Now I will attach my new variables to existing data frame.

dt=cbind(dt,dumdt)
dt$salary=NULL
str(dt)

## 'data.frame': 14999 obs. of  11 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : chr  "sales" "sales" "sales" "sales" ...
##  $ high_salary          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ medium_salary        : int  0 1 1 0 0 0 0 0 0 0 ...

dumdt=dummy(dt$sales)
dumdt=as.data.frame(dumdt)
names(dumdt)

##  [1] "MachinePredictionForNorthropGrumman.Rhtmlaccounting" 
##  [2] "MachinePredictionForNorthropGrumman.Rhtmlhr"         
##  [3] "MachinePredictionForNorthropGrumman.RhtmlIT"         
##  [4] "MachinePredictionForNorthropGrumman.Rhtmlmanagement" 
##  [5] "MachinePredictionForNorthropGrumman.Rhtmlmarketing"  
##  [6] "MachinePredictionForNorthropGrumman.Rhtmlproduct_mng"
##  [7] "MachinePredictionForNorthropGrumman.RhtmlRandD"      
##  [8] "MachinePredictionForNorthropGrumman.Rhtmlsales"      
##  [9] "MachinePredictionForNorthropGrumman.Rhtmlsupport"    
## [10] "MachinePredictionForNorthropGrumman.Rhtmltechnical"

Clearly I need to go through the similar steps.

dumdt=dumdt[,-10]
names(dumdt)=c("accounting","hr","IT","management","marketing","product_mng","RandD","sales","support")
dt=cbind(dt,dumdt)
dt$sales=NULL
# Look at the new data frame:
str(dt)

## 'data.frame': 14999 obs. of  19 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ high_salary          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ medium_salary        : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ accounting           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ IT                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ management           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ marketing            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ product_mng          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RandD                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ support              : int  0 0 0 0 0 0 0 0 0 0 ...

rm(dumdt)

Machine Learning

Now my data is ready for prediction. I split the data set into 2 sets: for training and for testing.

indeces=sample(1:dim(dt)[1], round(.7*dim(dt)[1]))
train=dt[indeces,]
test=dt[-indeces,]

str(train)

## 'data.frame': 10499 obs. of  19 variables:
##  $ satisfaction_level   : num  0.71 0.73 0.74 0.41 0.1 0.88 0.71 0.76 0.85 0.93 ...
##  $ last_evaluation      : num  0.5 0.6 0.55 0.49 0.94 1 0.92 0.8 0.65 0.65 ...
##  $ number_project       : int  4 3 6 2 6 5 3 4 4 4 ...
##  $ average_montly_hours : int  253 137 130 130 255 219 202 226 280 212 ...
##  $ time_spend_company   : int  3 3 2 3 4 6 4 5 3 4 ...
##  $ Work_accident        : int  0 0 0 0 0 1 0 0 1 0 ...
##  $ left                 : int  0 0 0 1 1 1 0 0 0 0 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ high_salary          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ medium_salary        : int  1 1 1 0 0 0 0 0 1 1 ...
##  $ accounting           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr                   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ IT                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ management           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ marketing            : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ product_mng          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RandD                : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ support              : int  0 0 1 0 0 0 0 1 0 0 ...

And now we can try a few predicting algorithms.

Logistic Regression

glm_mod=glm(left~., data=train, family=binomial)
options(width=120)
 summary(glm_mod)

## 
## Call:
## glm(formula = left ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -2.252360  -0.669234  -0.404160  -0.118092   3.005301  
## 
## Coefficients:
##                           Estimate   Std. Error   z value   Pr(>|z|)    
## (Intercept)            0.513671571  0.152904509   3.35943 0.00078104 ***
## satisfaction_level    -4.102569262  0.116375474 -35.25287 < 2.22e-16 ***
## last_evaluation        0.636893223  0.177888069   3.58030 0.00034320 ***
## number_project        -0.315622781  0.025421761 -12.41546 < 2.22e-16 ***
## average_montly_hours   0.004809019  0.000613495   7.83872 4.5515e-15 ***
## time_spend_company     0.270089407  0.018619140  14.50601 < 2.22e-16 ***
## Work_accident         -1.489204860  0.104868336 -14.20071 < 2.22e-16 ***
## promotion_last_5years -1.268547473  0.277706609  -4.56794 4.9254e-06 ***
## high_salary           -1.903687581  0.153808902 -12.37697 < 2.22e-16 ***
## medium_salary         -0.485601493  0.054435288  -8.92071 < 2.22e-16 ***
## accounting            -0.216824268  0.130317943  -1.66381 0.09615045 .  
## hr                     0.177414039  0.129105528   1.37418 0.16938628    
## IT                    -0.271998822  0.110571123  -2.45994 0.01389585 *  
## management            -0.523601663  0.162647148  -3.21925 0.00128527 ** 
## marketing             -0.098831483  0.127918381  -0.77261 0.43975108    
## product_mng           -0.272539608  0.124254440  -2.19340 0.02827862 *  
## RandD                 -0.599234520  0.144357350  -4.15105 3.3095e-05 ***
## sales                 -0.129873866  0.078066780  -1.66363 0.09618735 .  
## support               -0.048540168  0.089959935  -0.53958 0.58948988    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11544.393  on 10498  degrees of freedom
## Residual deviance:  9035.048  on 10480  degrees of freedom
## AIC: 9073.048
## 
## Number of Fisher Scoring iterations: 5

In this model being in a specific department is not a reliable predictor except for being in management or R and D. Let us see what is suitable threshold for decision making.

fittedValues=fitted(glm_mod)
# How well is predicted people who stayed
hist(fittedValues[train$left==0])

# And the ones who left.
hist(fittedValues[train$left==1])

Even on train data we do not get good predictions! Nevertheless we can check it on our test data.

glm_predictions=predict(glm_mod,test[,-match("left", names(test))])
# How well is predicted people who stayed, colored red
hist(glm_predictions[test$left==0], breaks =15, col=2 )
# And the ones who stayed, colored blue
hist(glm_predictions[test$left==1],breaks =15,  col=rgb(0,0.7,0.8,1/2),add=T)

Taking into account that to define whose who left is more important it looks like a threshold may be about -1.

Mathematician in Data Science

Sunday, June 25, 2017

Neural Networks as a Corporation Chain of Command

Wednesday, January 11, 2017

Machine Prediction for Human Resources Analytics

Introduction

Data Load

Machine Learning

Logistic Regression