Mathematician in Data Science: October 2016

Introduction

I’m going to apply a method for checking variable dependency which was introduced in my previous post. Because the “dependency” I get with this rule is not true dependency as defined in Probability then I will call variables practically dependent at a confidence level \(\alpha\), where \(\alpha\) is a confidence level of bootstrapped confidence intervals.

I will modify the idea slightly: I won’t compute means with interval lengths, because it is sufficient to verify that confidence intervals for \(\textbf{Pr} (A\text{ and } B)\) and \(\textbf{Pr} (A) \textbf{Pr} (B)\) do not intersect. For this I only need the confidence interval endpoints. In addition I’ve noted that if a variable has only two values, then it is enough to check for practical dependency of only one value, because relative frequency values for such variable are complementary.

I have tried “boot” package mentioned in the previous post and discovered that it is not convenient for a really big data. It generates a huge matrix and then calculates a statistic for each column. Such approach requires a lot of memory. It is more prudent to generate a vector, calculate the statistic and then generate next vector, replacing the previous.

Data Description, Load and Initial Investigation

I’m going to use data from KDD cup 1998, from here. There is a training data set in text format, a data dictionary and some other files.

I will load the data set, which is already in my working directory. Then we can look at our data set and compare it with the data dictionary, as usual.

dt=read.csv("cup98lrn.txt", header=T)
dim(dt)

## [1] 95412   481

options("width"=90)
head(dt[ 1:9,names(dt)[1:11] ])

##   ODATEDW OSOURCE TCODE STATE   ZIP MAILCODE PVASTATE  DOB NOEXCH RECINHSE RECP3
## 1    8901     GRI     0    IL 61081                   3712      0               
## 2    9401     BOA     1    CA 91326                   5202      0               
## 3    9001     AMH     1    NC 27017                      0      0               
## 4    8701     BRY     0    CA 95953                   2801      0               
## 5    8601             0    FL 33176                   2001      0        X     X
## 6    9401     CWR     0    AL 35603                      0      0

Now let us take a look at our target variables “TARGET_B” and “TARGET_D”. “TARGET_B” shows if there was any donation at all, and “TARGET_D” stands for the donation value.

library(data.table)
dt=data.table(dt)
c(dt[ ,class(TARGET_B)], dt[,class(TARGET_D)])

## [1] "integer" "numeric"

options("digits"=5)
c(dt[,as.integer(unique(TARGET_B))], dt[,mean(TARGET_B)], dt[,length(unique(TARGET_D))],dt[,mean(TARGET_D)])

## [1]  0.000000  1.000000  0.050759 71.000000  0.793073

Both have numeric values. Outcomes of “TARGET_B” yes/no are marked numerically as 1/0. We see that about 5% of all respondents donated.

I will consider only the last variable, because it is easier to apply my rule to a variable with 2 values.

Using “nearZeroVar” Function

At first I will employ a standard approach for eliminating non-informative and/or almost constant variables using “nearZeroVar” function. Since our target variable has about 5% of positive responses, then the function should have suitable options.

library(caret)
varToDrop=nearZeroVar(dt, names = TRUE,freqCut = 99/1, uniqueCut = 1)
options("width"=90)
print(paste0("It yields ", length(varToDrop), " variables"))

## [1] "It yields 25 variables"

varToDrop

##  [1] "NOEXCH"   "RECPGVG"  "MDMAUD"   "CHILD03"  "PUBPHOTO" "MAJOR"    "HOMEE"   
##  [8] "PLATES"   "ETH12"    "ADATE_2"  "ADATE_3"  "ADATE_4"  "ADATE_5"  "ADATE_6" 
## [15] "ADATE_13" "ADATE_14" "ADATE_15" "ADATE_20" "ADATE_23" "ADATE_24" "MAXADATE"
## [22] "RFA_2R"   "MDMAUD_R" "MDMAUD_F" "MDMAUD_A"

So there are 25 variables which do not vary much. But what if the last ones are useful for our prediction in some way?

For example, consider “ADATE_14” variable, which stands for “Date the 95NK promotion was mailed” in the data dictionary. If we make a plot, then we see that here is some kind of dependency:

boxplot(TARGET_D~addNA(ADATE_14), data=dt)

When we compute means separately for values of the target variable we see definite differences, as it is shown below.

dt[, .N, by=ADATE_14]

##    ADATE_14     N
## 1:     9506 76381
## 2:       NA 18867
## 3:     9504   164

c(dt[ADATE_14==9506, mean(TARGET_B)],dt[(is.na(ADATE_14)|ADATE_14!=9506), mean(TARGET_B)])

## [1] 0.053521 0.039672

An obvious deficiency of “nearZeroVar” function is that it considers a variable without its relation to a target variable. Very small fluctuations can be useful if they are in sinc with our outcome, especially when success rate is very low. But looking at each graph or checking all corresponding means to detect dependency evidence could be difficult with almost 500 variables.

Checking the Practical Dependency Condition.

Let us investigate, using my idea in the post Measuring Dependence of Variables with Confidence Intervals. We can easily calculate frequency of “TARGET_B” variable when it takes value “1”, because its another value is 0 and therefore the frequency is a sum of values divided by number of rows. With a different variable values we can make dummy variables and use the same calculation. To verify that computed confidence intervals do not intersect we compare upper bound of one interval with a lower bound of another one.

I wrote a function which checks for practical dependency of two variables when they have only values 0 and 1. It works for pair of variables which are passed as vectors “v1”, “v2”, a confidence level equaled “level” and a given number of bootstrap runs.

areDependent<-function(v1, v2, level,runN=100000L) {
num=length(v1)
if (num==length(v2) & level>0.5 & level <1) {
   require(data.table)
   probs=c(.5*(1-level), .5*(1+level))
   work_means=data.table(x=numeric(length=runN), 
                         y=numeric(length=runN),
                         product_means=numeric(length=runN))
   work_table=data.table(x=v1,y=v2)
   work_table[, product:=x*y]
   for (i in 1:runN) {
     set.seed(i)
     work_means[i,]=work_table[sample.int(num, size = num, 
                    replace =TRUE), .(sum(x),sum(y),sum(product))]  
   }
   work_means[, c("x", "y","product_means"):=
                            list(x/num, y/num,product_means/num)]
   product_ci=work_means[,quantile(product_means, probs=probs)]
   other_ci=(work_means[,quantile(x*y, probs=probs)])
   check=((max(product_ci)<min(other_ci))|
            (max(other_ci)<min(product_ci)))
   }
   else  {
    print("your inputs are wrong")
    check=NA
   }
return(check)
}

I found 3 variables which “nearZeroVar” function indicated for discarding and which at level 90% satisfy the practical dependency condition: “ADATE_14”, “ADATE_15” and “ADATE_23”. Corresponding correlation coefficients with the target variable are included.

t=Sys.time()
dt[, .N, by="ADATE_14"]

##    ADATE_14     N
## 1:     9506 76381
## 2:       NA 18867
## 3:     9504   164

ADATE_14=dt[,ADATE_14]
ADATE_14[(is.na(ADATE_14)|ADATE_14!=9506)]=0
ADATE_14[ADATE_14==9506]=1
t=Sys.time()
areDependent(v1=dt[,TARGET_B],v2=ADATE_14, level=.9)

## [1] TRUE

cor(ADATE_14, dt[,TARGET_B])

## [1] 0.025211

dt[, .N, by="ADATE_15"]

##    ADATE_15     N
## 1:     9504 29935
## 2:       NA 65477

ADATE_15=dt[,ADATE_15]
ADATE_15[is.na(ADATE_15)]=0
ADATE_15[ADATE_15==9504]=1
areDependent(v1=dt[,TARGET_B],v2=ADATE_15, level=.9)

## [1] TRUE

cor(ADATE_15, dt[,TARGET_B])

## [1] 0.02115

dt[, .N, by="ADATE_23"]

##    ADATE_23     N
## 1:     9407 38877
## 2:       NA 56270
## 3:     9406   243
## 4:     9312    22

ADATE_23=dt[,ADATE_23]
ADATE_23[(is.na(ADATE_23)|ADATE_23!=9407)]=0
ADATE_23[ADATE_23==9407]=1
areDependent(v1=dt[,TARGET_B],v2=ADATE_23, level=.9)

## [1] TRUE

cor(ADATE_23, dt[,TARGET_B])

## [1] 0.01736

Sys.time()-t

## Time difference of 34.121 mins

They all turned out to be practically dependable at level 90% as well. You can see their pairwise correlations at the end of the post together with other variables.

As you see my implementation takes a lot of time. I tried “apply” functions and learn that they use the same random seed for a whole run. I attempted to use “foreach” and could not make “data.table” to cooperate with my preferable randomization. I would like to to have a distinct random seed each time. Any help here will be appreciated.

Variables which are not practically dependent from the target variable at level 90%.

I’ve checked out variables with 2 values which were not picked by “nearZeroVar” function and found that at level 90% some of them are not practically dependent from the target variable.

VETERANS=dt[,as.numeric(factor(VETERANS))-1]
t=Sys.time()
areDependent(v1=dt[,TARGET_B],v2=VETERANS, level=.9)

## [1] FALSE

cor(VETERANS, dt[,TARGET_B])

## [1] 0.007773

STEREO=dt[,as.numeric(factor(STEREO))-1]
areDependent(v1=dt[,TARGET_B],v2=STEREO, level=.9)

## [1] FALSE

cor(STEREO, dt[,TARGET_B])

## [1] 0.00064342

Sys.time()-t

## Time difference of 22.97 mins

Training Data on Choosen Variables

Now a reality check: applying logistic regression and classification trees to our variables.

summary(glm(dt[,TARGET_B]~
        VETERANS+STEREO+ADATE_14+ADATE_15+ADATE_23, 
        family = "binomial"))$coef

##              Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) -3.200962   0.037834 -84.60469 0.0000e+00
## VETERANS     0.106498   0.047004   2.26571 2.3469e-02
## STEREO      -0.014937   0.044577  -0.33508 7.3757e-01
## ADATE_14     0.242234   0.043761   5.53537 3.1057e-08
## ADATE_15     0.114418   0.033995   3.36573 7.6341e-04
## ADATE_23     0.058708   0.032658   1.79769 7.2227e-02

options("digits"=4)
cor(data.frame(TARGET_B=dt[, TARGET_B],VETERANS,STEREO,ADATE_14,
               ADATE_15, ADATE_23))

##           TARGET_B VETERANS     STEREO ADATE_14  ADATE_15 ADATE_23
## TARGET_B 1.0000000 0.007773  0.0006434 0.025211  0.021150  0.01736
## VETERANS 0.0077730 1.000000  0.2400606 0.017284  0.006364  0.00983
## STEREO   0.0006434 0.240061  1.0000000 0.008152 -0.003251 -0.02461
## ADATE_14 0.0252114 0.017284  0.0081520 1.000000  0.334455  0.33028
## ADATE_15 0.0211499 0.006364 -0.0032513 0.334455  1.000000  0.36428
## ADATE_23 0.0173601 0.009830 -0.0246071 0.330283  0.364283  1.00000

As you see variables chosen by the method are not worse and sometimes even better then the ones picked up by “nearZeroVar” function. From another hand calculating correlation coefficients explains a lot for a regression.

library(rpart)
rpart_mod=rpart(dt[, TARGET_B] ~ 
                VETERANS+STEREO+ADATE_14+ADATE_15+ADATE_23,
                control = rpart.control(cp = 1e-05)) 
rpart_mod$variable.importance

## ADATE_14 ADATE_15 VETERANS ADATE_23   STEREO 
##   2.9220   0.8425   0.5788   0.4700   0.4382

For trees initial choice of variables to split is crucial for the rest of algorithm training. If we limit our data set to rows where we have 1 for the variable ADATE_14 (I remind you that it means value 9506 for column ADATE_14 in our data table) then practial dependency for variable ADATE_23 at level 90% is lost.

areDependent(v1=dt[ADATE_14==9506,TARGET_B],v2=ADATE_23[ADATE_14==1], 
                                                             level=.9)

## [1] FALSE

Conclusion.

At the moment the considered method does not appear to offer much of improvement in comparison with ones which are used already. Calculations take a lot of time and you can always explicitly train your data to understand what variable is better for your choice of training. The method can have some merits for theoretical analysis.

Disclaimer

There are conditions on using this data, one of which requires to notify Ismail Parsa (iparsa@epsilon.com) and Ken Howes (khowes@epsilon.com). I tried to do this and discovered that provided e-mails do not work. Any information on what to do in such case will be appreciated.

Mathematician in Data Science

Friday, October 21, 2016

Application: Detection of Practical Dependency of Variables with Confidence Intervals