Thursday, April 28, 2016

Improving performance of random forests for a particular value of outcome by adding chosen features

Summary

Choosing features to improve a performance of a particular algorithm is a difficult question. Currently here is PCA, which is hard to understand (although it can be used out-of-the-box), is not easy to interpret and requires centralizing and scaling of features. In addition, it does not allow to improve prediction performance for a particular outcome (if its accuracy is lower than for others or it has a particular importance). My method enables to use features without preprocessing. Therefore a resulting prediction is easy to explain. Plus it can be used to improve a accuracy prediction of a specified outcome value. It based on comparison of feature densities and has a good visual interpretation, which does not require thorough knowledge of linear algebra or calculus.

Application Example

Here is a worked out example of Choosing features for random forests algorithm with R code. It is supplemented with choosing additional features to improve prediction for a particular value of outcome. The method for comparing densities is described in detail there: Computing Ratio of Areas.

I will use for my computations data for Human Activity Recognition. The short description of the problem follows:

Human Activity Recognition - HAR - is a supervised classification problem, whose training data is obtained via an experiment having human subjects perform different activities. It focuses on the data collection from multiple accelerometers and possibly other sensors placed on a human body. There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting exercises.

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

In this project we use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly (marked as “A”) and incorrectly in 4 different ways.

A - exactly according to the specification

B - throwing the elbows to the front

C - lifting the dumbbell only halfway

D - lowering the dumbbell only halfway

E - throwing the hips to the front

An outcome column with the letters is called “classe”. Read more, with pictures: http://groupware.les.inf.puc-rio.br/har#ixzz3jfosRia6

The goal of the project is to predict the manner in which participants did the exercise. In particular, we are to figure out which features to use to reach our goal.

I will load only the “training” file, because in the “testing” file there are no marks for exercise correctness.

training=read.csv("pml-training.csv", stringsAsFactors=F)

Now we can look at the “training” file more closely. It is easy to establish that first columns contain names of subjects, days and times of recording and other classifiers which do not represent physical movements. For example the very first “X” column is used to enumerate rows and the last column “classe” contains the letters which mark the performance quality. We can look at the data table dimensions, first 10 column names, first 20 values of first column and values of last “classe” column.

dim(training)
## [1] 19622   160
names(training)[1:10]
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"
training$X[1:20]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
unique(training$classe)
## [1] "A" "B" "C" "D" "E"

We see users’ names and so on. They should be removed for a predicting model.

Cleaning data

The next step is less obvious: by checking out the “training” data I’ve discovered a lot of features which have mostly undefined values. Some of them are read as logical with nonexistent values and some as characters. We can check the amount of columns with undefined values and the amount of numeric features in the test set and see that the number of useful features is fewer than 40%.

sum(colSums(is.na(training)) !=0)
## [1] 67
sum(sapply(training, is.numeric))
## [1] 123

I will remove non numeric features from the training set for my work together with the first 7 columns. In addition some of the columns with sparse data have mostly undefined values (NA), and we should get rid of them as well. The new data set with numeric features will be called “trai”. Note that the classifier column “classe” is removed after first command line as one of character columns, and I need to put it back.

prVec=sapply(training, is.numeric)
prVec[1:7]=F
trai=training[ , prVec]
trai <- trai[, colSums(is.na(trai)) == 0] 
trai$classe=as.factor(training$classe)
dim(trai)
## [1] 19622    53

We are left with 53 columns. The last is our outcome. Now let us split the data for cross-validation. I will load required packages, choose a number for random sampling and then put 70% of the data frame into a training set and the rest of it into a validation set.

library(caret); library(e1071)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(11)
inTrain = createDataPartition(y=trai$classe, p=0.7, list=F)
dataFrame=trai[inTrain,]
valdTrai=trai[-inTrain, ]

Choosing features

We need to find ways to distinguish types of performance labeled by letters from each other.

Here are 52 features in our data frame. It is still a lot for my laptop, say unparalleled “random forests” algorithm took 2 hours. Granted my laptop is not a powerful one, but what if our prediction model is supposed to work across different gadgets? It makes sense to make it less dependable on resources and to reduce the number of features. Let us visualize the results and figure out features which show more difference from others.

library(ggplot2)
library(Rmisc)
## Loading required package: plyr
p1=qplot(dataFrame[, 1], color=classe, data=dataFrame, geom="density", xlab="First feature")
p2=qplot(dataFrame[, 2], color=classe, data=dataFrame, geom="density", xlab="Second feature")
p3=qplot(dataFrame[, 3], color=classe, data=dataFrame, geom="density", xlab="Third feature")
p4=qplot(dataFrame[, 4], color=classe, data=dataFrame, geom="density", xlab="Fourth feature")
p5=qplot(dataFrame[, 5], color=classe, data=dataFrame, geom="density", xlab="Fifth feature")
p6=qplot(dataFrame[, 6], color=classe, data=dataFrame, geom="density", xlab="Sixth feature")
multiplot(p1,p2,p3,p4,p5,p6, cols=2)