Thursday, April 28, 2016

Improving performance of random forests for a particular value of outcome by selecting better features

Summary

Choosing features to improve a performance of a particular algorithm is a difficult question. Currently here is PCA, which is hard to understand (although it can be used out-of-the-box), is not easy to interpret and requires centralizing and scaling of features. In addition, it does not allow to improve prediction performance for a particular outcome (if its accuracy is lower than for others or it has a particular importance). My method enables to use features without preprocessing. Therefore a resulting prediction is easy to explain. Plus it can be used to improve a accuracy prediction of a specified outcome value. It based on comparison of feature densities and has a good visual interpretation, which does not require thorough knowledge of linear algebra or calculus.

Application Example

Here is a worked out example of Choosing features for random forests algorithm with R code. It is supplemented with choosing additional features to improve prediction for a particular value of outcome. The method for comparing densities is described in detail there: Computing Ratio of Areas.

I will use for my computations data for Human Activity Recognition. The short description of the problem follows:

Human Activity Recognition - HAR - is a supervised classification problem, whose training data is obtained via an experiment having human subjects perform different activities. It focuses on the data collection from multiple accelerometers and possibly other sensors placed on a human body. There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting exercises.

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

In this project we use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly (marked as “A”) and incorrectly in 4 different ways.

A - exactly according to the specification

B - throwing the elbows to the front

C - lifting the dumbbell only halfway

D - lowering the dumbbell only halfway

E - throwing the hips to the front

An outcome column with the letters is called “classe”. Read more, with pictures: http://groupware.les.inf.puc-rio.br/har#ixzz3jfosRia6

The goal of the project is to predict the manner in which participants did the exercise. In particular, we are to figure out which features to use to reach our goal.

I will load only the “training” file, because in the “testing” file there are no marks for exercise correctness.

training=read.csv("pml-training.csv", stringsAsFactors=F)

Now we can look at the “training” file more closely. It is easy to establish that first columns contain names of subjects, days and times of recording and other classifiers which do not represent physical movements. For example the very first “X” column is used to enumerate rows and the last column “classe” contains the letters which mark the performance quality. We can look at the data table dimensions, first 10 column names, first 20 values of first column and values of last “classe” column.

dim(training)
## [1] 19622   160
names(training)[1:10]
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"
training$X[1:20]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
unique(training$classe)
## [1] "A" "B" "C" "D" "E"

We see users’ names and so on. They should be removed for a predicting model.

Cleaning data

The next step is less obvious: by checking out the “training” data I’ve discovered a lot of features which have mostly undefined values. Some of them are read as logical with nonexistent values and some as characters. We can check the amount of columns with undefined values and the amount of numeric features in the test set and see that the number of useful features is fewer than 40%.

sum(colSums(is.na(training)) !=0)
## [1] 67
sum(sapply(training, is.numeric))
## [1] 123

I will remove non numeric features from the training set for my work together with the first 7 columns. In addition some of the columns with sparse data have mostly undefined values (NA), and we should get rid of them as well. The new data set with numeric features will be called “trai”. Note that the classifier column “classe” is removed after first command line as one of character columns, and I need to put it back.

prVec=sapply(training, is.numeric)
prVec[1:7]=F
trai=training[ , prVec]
trai <- trai[, colSums(is.na(trai)) == 0] 
trai$classe=as.factor(training$classe)
dim(trai)
## [1] 19622    53

We are left with 53 columns. The last is our outcome. Now let us split the data for cross-validation. I will load required packages, choose a number for random sampling and then put 70% of the data frame into a training set and the rest of it into a validation set.

library(caret); library(e1071)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(11)
inTrain = createDataPartition(y=trai$classe, p=0.7, list=F)
dataFrame=trai[inTrain,]
valdTrai=trai[-inTrain, ]

Choosing features

We need to find ways to distinguish types of performance labeled by letters from each other.

Here are 52 features in our data frame. It is still a lot for my laptop, say unparalleled “random forests” algorithm took 2 hours. Granted my laptop is not a powerful one, but what if our prediction model is supposed to work across different gadgets? It makes sense to make it less dependable on resources and to reduce the number of features. Let us visualize the results and figure out features which show more difference from others.

library(ggplot2)
library(Rmisc)
## Loading required package: plyr
p1=qplot(dataFrame[, 1], color=classe, data=dataFrame, geom="density", xlab="First feature")
p2=qplot(dataFrame[, 2], color=classe, data=dataFrame, geom="density", xlab="Second feature")
p3=qplot(dataFrame[, 3], color=classe, data=dataFrame, geom="density", xlab="Third feature")
p4=qplot(dataFrame[, 4], color=classe, data=dataFrame, geom="density", xlab="Fourth feature")
p5=qplot(dataFrame[, 5], color=classe, data=dataFrame, geom="density", xlab="Fifth feature")
p6=qplot(dataFrame[, 6], color=classe, data=dataFrame, geom="density", xlab="Sixth feature")
multiplot(p1,p2,p3,p4,p5,p6, cols=2)

## removing objects I do not need anymore
rm("p1","p2","p3","p4","p5","p6")

We can see that data behaviors are complicated. Some of features are bimodal and even multimodal. These characteristics could be caused by participants’ different sizes or training levels, but we do not have enough information to check it out. We can pick up features which diverge the most on density graphs by considering difference of densities. The “density” function is the same which is used in R for graphing. My rule for picking up a feature for final prediction data frame is following: \[ \frac{\textrm{an area between two density curves}}{\textrm{an area under one of the curves}} > 0.75 \] When it is true for at least one of pairs, then the feature is picked for prediction. The method itself is described in detail in my post: Computing Ratio of Areas. The threshold .75 is picked up by trial and error.

My code for it follows.

nmsPred = dim(dataFrame)[2]
classe = unique(dataFrame[, "classe"])
densities=data.frame(rep(0,1000),rep(0,1000),rep(0,1000),rep(0,1000),rep(0,1000))
names(densities)=classe
prefPred = rep(FALSE, nmsPred )
prefPred[53] = TRUE # to keep the "classe" variate
set.seed(11)
for (k in 1:(nmsPred-1) ) {
  lower.limit=min(dataFrame[ , k])
  upper.limit=max(dataFrame[ , k])
     for (lett in classe) {
        den <-density(dataFrame[dataFrame$classe==lett, k],
        kernel="gaussian", n=1000, 
        from = lower.limit, to = upper.limit)
        densities[, lett] <- den$y
     }
     ind=FALSE
     for (l1 in 1:length(classe)) {
        for (l2 in classe[-l1]) {
        ind0 <- ((sum(abs(densities[,l1]-densities[,l2]))/sum(densities[,l1]))>.75)
        ind = (ind | ind0)
              }
       }
           prefPred[k] <- prefPred[k] | (ind)
}
## Gathering the features together:
workdf=dataFrame[ , prefPred]

Random Forests Algorithm Results and Cross-Validation

I will use “random forests” algorithm option from “caret” package. Since it uses random numbers, I need to set up my “random number seed”. Then we can look at the model and its accuracy:

library(caret); library(e1071)
library(randomForest)
library(kernlab); library(doParallel)
set.seed(11)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
modrf=train(classe~., data=workdf, method="rf", allowParallel=T)
stopCluster(cl)
registerDoSEQ()
# Looking at the resulting model
modrf
## Random Forest 
## 
## 13737 samples
##    19 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9855892  0.9817652
##   10    0.9853784  0.9814997
##   19    0.9728235  0.9656130
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Then we compare predictions on the validation set:

# Forming the testing subset:
valddf=valdTrai[, prefPred]
last=length(prefPred)
# Computing predictions and checking accuracies:
predictRFonVald=predict(modrf, valddf[, -last])
confusionMatrix(valdTrai$classe, predictRFonVald)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    2    0    0    1
##          B    6 1130    3    0    0
##          C    0   14 1006    5    1
##          D    0    0   13  950    1
##          E    0    0    4    1 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9913          
##                  95% CI : (0.9886, 0.9935)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.989           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9860   0.9805   0.9937   0.9972
## Specificity            0.9993   0.9981   0.9959   0.9972   0.9990
## Pos Pred Value         0.9982   0.9921   0.9805   0.9855   0.9954
## Neg Pred Value         0.9986   0.9966   0.9959   0.9988   0.9994
## Prevalence             0.2850   0.1947   0.1743   0.1624   0.1835
## Detection Rate         0.2839   0.1920   0.1709   0.1614   0.1830
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9979   0.9921   0.9882   0.9954   0.9981

As we see on the model data and in the table the resulting accuracy is around 99%, which is quite good. In the same time we have the lowest accuracy for a outcome value C: it is 98.8%, and we may want to increase it to make accuracy for different outcomes more uniform. Let us try to add some features to improve it by using the above method, but only for one outcome value. A threshold in this case is picked up to be 0.65.

for (k in 1:(nmsPred-1) ) {
  lower.limit=min(dataFrame[ , k])
  upper.limit=max(dataFrame[ , k])
     for (lett in classe) {
        den <-density(dataFrame[dataFrame$classe==lett, k],
        kernel="gaussian", n=1000, 
        from = lower.limit, to = upper.limit)
        densities[, lett] <- den$y
     }
     ind=FALSE
     for (l1 in classe[-3]) { # working with 3d classifier, "C""
         ind0 <- ((sum(abs(densities[,3]-densities[,l1]))/sum(densities[,3]))>.65)
         ind = (ind | ind0)
     }
           prefPred[k] <- prefPred[k] | (ind)
}
## Gathering the features together:
workdf=dataFrame[ , prefPred]

Running the random forests algorithm again:

set.seed(11)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
modrf=train(classe~., data=workdf, method="rf", allowParallel=T)
stopCluster(cl)
registerDoSEQ()
# Looking at the resulting model
modrf
## Random Forest 
## 
## 13737 samples
##    23 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9845870  0.9804951  0.001692352  0.002145187
##   12    0.9855843  0.9817603  0.001927464  0.002441153
##   23    0.9732220  0.9661184  0.004619732  0.005841083
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 12.

Again comparing results on our validation set we can see that accuracy for the value C is improved.

# Forming the testing subset:
valddf=valdTrai[, prefPred]
last=length(prefPred)
# Computing predictions and checking accuracies:
predictRFonVald=predict(modrf, valddf[, -last])
confusionMatrix(valdTrai$classe, predictRFonVald)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    2    0    0    1
##          B    7 1126    4    2    0
##          C    0   12 1010    4    0
##          D    0    0    5  957    2
##          E    0    0    6    0 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9898, 0.9944)
##     No Information Rate : 0.2851          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9958   0.9877   0.9854   0.9938   0.9972
## Specificity            0.9993   0.9973   0.9967   0.9986   0.9988
## Pos Pred Value         0.9982   0.9886   0.9844   0.9927   0.9945
## Neg Pred Value         0.9983   0.9971   0.9969   0.9988   0.9994
## Prevalence             0.2851   0.1937   0.1742   0.1636   0.1833
## Detection Rate         0.2839   0.1913   0.1716   0.1626   0.1828
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9976   0.9925   0.9910   0.9962   0.9980

Accuracy for other outcome values is slightly changed, too, but not much.

Principal Component Analysis

The principal component analysis is a standard method for reducing number of features. We can compare results with the method used above and ascertain how many components we would be required to use by the method for accuracy of at least 99%.

prep=prcomp(dataFrame[-53], scale=T)
v99<-summary(prep)$importance[3,]>.988
summary(prep)$importance[3, v99]
##    PC35    PC36    PC37    PC38    PC39    PC40    PC41    PC42    PC43 
## 0.98948 0.99102 0.99223 0.99329 0.99434 0.99512 0.99582 0.99649 0.99709 
##    PC44    PC45    PC46    PC47    PC48    PC49    PC50    PC51    PC52 
## 0.99762 0.99814 0.99865 0.99905 0.99943 0.99967 0.99984 0.99996 1.00000

As we see to get the same accuracy as for the random forest model the principal component method advises to use 36 or 37 principal components (PC36 or PC37), which are prepossessed features calculated from original 52 ones. In addition it does not tell us how to improve prediction performance for one of outcome values if we would want it.

Other Applications

The method can be used to evaluate consistency of feature differences during boosting or cross-validation as well. In addition, it could be used to optimize and speed up existing tree based algorithms.

4 comments:

  1. Hello,

    Nice article, therefore I have few comments :) .
    1. it is a nice criteria to compare areas determined by density curves.
    But the link "Computing Ratio of Areas" doesn't work, so it is hard to say anything else.
    Did you compare your method with other solutions ?. For example like "nearZeroVar" function in R or feature selection based on the correlation between features ("cor" and "findCorrelation" in R) ?

    2. about PCA:
    I cannot agree that the PCA is a tough conceptual method.
    The PCA has rather simple interpretation: it can be seen as such set of rotations around feature axes which
    leads to a K-dimensional hyperplane that minimize feature's projection errors onto this hyperplane .
    As a result one can remove N-K features using as a parameter our assumption about the projection error value.
    Also the method used in the PCA method (Single Value Decomposition) is well proven and well working algebraic tool.

    So, I will say just opposite: for the time being your "Computing Ratio of Areas" is right now more difficult to accept as a reliable method for improving of a data analysis performance. Especially if the link is missing :)

    3. usually increasing dimensionality of considered data leads rather to higher variance and thus for higher chance of overfitting.

    4. the kernel issue: during creation of a new feature you have to assume some kernel. How the kernel choice freedom
    influence the final analysis performance ?

    I will appreciate you answers & comments.
    & sorry, if I missed something in your text.

    Kind regards,

    Bogdan
    (dataanalytics4all.com)

    ReplyDelete
    Replies
    1. Thanks for your comment!
      1. The link was changed by Google blog site, and now it is:
      http://myabakhova.blogspot.com/2016_02_01_archive.html
      I did not know that it could happen and I apologize.
      It is my blog post published on February 1st, 2016.

      I never said anything about other methods but PCA. The command "nearZeroVar" is still a good tool to get rid of variables which are practically constant. From another hand if you center and scale them they may turn out to be useful.
      Correlation measures only linear dependency. For tree based algorithms it is not very helpful.
      Say if you generate your data in two ways, first with the same distribution, and second with distinct means:

      a) x=rnorm(500); y=rnorm(500)
      df=data.frame(value=c(x, y), is.x=c(rep(1,500),rep(0,500)))

      b) x=rnorm(500, mean=-2); y=rnorm(500, mean=2)
      df=data.frame(value=c(x, y), is.x=c(rep(1,500),rep(0,500)))

      Then in both cases correlation of the features is very low. But in the first case they are not discernible, while in the second we have a very obvious rule: if a value is positive, then y; otherwise, x. My method allows to capture the difference between features which is essential for tree based methods.

      2. PCA method have an interpretation, but it does involve some geometrical skills which come with corresponding training. And the finding of rotations requires computing of eigenvalues. Unless, of course, you use it out-of-the-box.
      In addition PCA is based on assumptions that we work with distributions which have means and variances. It may be not the case. If so, then many statistical methods fail to work. Even Central Limit Theorem fails in such case.

      3. It depends. There are always trade-offs.

      4. I only did some preliminary numerical investigation, which shows that a more smooth kernel yields better results with my data. I have no explanation for this.

      I still believe that my method has its merits for tree based algorithms. But it is in its initial stage of development. Well, giving as much time and effort which was spent on PCA it could be changed.

      Currently I am in a process of changing Linux distros on my laptop, so my replies could be late. I apologize in advance for this.
      Best regards,
      Mya.

      Delete
    2. Dear Mya,

      thank you for your answers, they are explanatory.

      All the best,

      Bogdan
      (dataanalytics4all.com)

      Delete
    3. Hello Bogdan,
      I thought a bit more about the method and, in particular, about choice of kernel to approximate histograms as density curves. I realized that the way we choose a kernel for the R `density` function is a regularization for the method. It drops out outliers. In particular, more smooth approximation for a density curve means more regularization.

      Delete