Introduction
I’m going to apply a method for checking variable dependency which was introduced in my previous post. Because the “dependency” I get with this rule is not true dependency as defined in Probability then I will call variables practically dependent at a confidence level \(\alpha\), where \(\alpha\) is a confidence level of bootstrapped confidence intervals.
I will modify the idea slightly: I won’t compute means with interval lengths, because it is sufficient to verify that confidence intervals for \(\textbf{Pr} (A\text{ and } B)\) and \(\textbf{Pr} (A) \textbf{Pr} (B)\) do not intersect. For this I only need the confidence interval endpoints. In addition I’ve noted that if a variable has only two values, then it is enough to check for practical dependency of only one value, because relative frequency values for such variable are complementary.
I have tried “boot” package mentioned in the previous post and discovered that it is not convenient for a really big data. It generates a huge matrix and then calculates a statistic for each column. Such approach requires a lot of memory. It is more prudent to generate a vector, calculate the statistic and then generate next vector, replacing the previous.
Data Description, Load and Initial Investigation
I’m going to use data from KDD cup 1998, from here. There is a training data set in text format, a data dictionary and some other files.
I will load the data set, which is already in my working directory. Then we can look at our data set and compare it with the data dictionary, as usual.
dt=read.csv("cup98lrn.txt", header=T)
dim(dt)
## [1] 95412 481
options("width"=90)
head(dt[ 1:9,names(dt)[1:11] ])
## ODATEDW OSOURCE TCODE STATE ZIP MAILCODE PVASTATE DOB NOEXCH RECINHSE RECP3
## 1 8901 GRI 0 IL 61081 3712 0
## 2 9401 BOA 1 CA 91326 5202 0
## 3 9001 AMH 1 NC 27017 0 0
## 4 8701 BRY 0 CA 95953 2801 0
## 5 8601 0 FL 33176 2001 0 X X
## 6 9401 CWR 0 AL 35603 0 0
Now let us take a look at our target variables “TARGET_B” and “TARGET_D”. “TARGET_B” shows if there was any donation at all, and “TARGET_D” stands for the donation value.
library(data.table)
dt=data.table(dt)
c(dt[ ,class(TARGET_B)], dt[,class(TARGET_D)])
## [1] "integer" "numeric"
options("digits"=5)
c(dt[,as.integer(unique(TARGET_B))], dt[,mean(TARGET_B)], dt[,length(unique(TARGET_D))],dt[,mean(TARGET_D)])
## [1] 0.000000 1.000000 0.050759 71.000000 0.793073
Both have numeric values. Outcomes of “TARGET_B” yes/no are marked numerically as 1/0. We see that about 5% of all respondents donated.
I will consider only the last variable, because it is easier to apply my rule to a variable with 2 values.
Using “nearZeroVar” Function
At first I will employ a standard approach for eliminating non-informative and/or almost constant variables using “nearZeroVar” function. Since our target variable has about 5% of positive responses, then the function should have suitable options.
library(caret)
varToDrop=nearZeroVar(dt, names = TRUE,freqCut = 99/1, uniqueCut = 1)
options("width"=90)
print(paste0("It yields ", length(varToDrop), " variables"))
## [1] "It yields 25 variables"
varToDrop
## [1] "NOEXCH" "RECPGVG" "MDMAUD" "CHILD03" "PUBPHOTO" "MAJOR" "HOMEE"
## [8] "PLATES" "ETH12" "ADATE_2" "ADATE_3" "ADATE_4" "ADATE_5" "ADATE_6"
## [15] "ADATE_13" "ADATE_14" "ADATE_15" "ADATE_20" "ADATE_23" "ADATE_24" "MAXADATE"
## [22] "RFA_2R" "MDMAUD_R" "MDMAUD_F" "MDMAUD_A"
So there are 25 variables which do not vary much. But what if the last ones are useful for our prediction in some way?
For example, consider “ADATE_14” variable, which stands for “Date the 95NK promotion was mailed” in the data dictionary. If we make a plot, then we see that here is some kind of dependency:
boxplot(TARGET_D~addNA(ADATE_14), data=dt)
When we compute means separately for values of the target variable we see definite differences, as it is shown below.
dt[, .N, by=ADATE_14]
## ADATE_14 N
## 1: 9506 76381
## 2: NA 18867
## 3: 9504 164
c(dt[ADATE_14==9506, mean(TARGET_B)],dt[(is.na(ADATE_14)|ADATE_14!=9506), mean(TARGET_B)])
## [1] 0.053521 0.039672
An obvious deficiency of “nearZeroVar” function is that it considers a variable without its relation to a target variable. Very small fluctuations can be useful if they are in sinc with our outcome, especially when success rate is very low. But looking at each graph or checking all corresponding means to detect dependency evidence could be difficult with almost 500 variables.
Checking the Practical Dependency Condition.
Let us investigate, using my idea in the post Measuring Dependence of Variables with Confidence Intervals. We can easily calculate frequency of “TARGET_B” variable when it takes value “1”, because its another value is 0 and therefore the frequency is a sum of values divided by number of rows. With a different variable values we can make dummy variables and use the same calculation. To verify that computed confidence intervals do not intersect we compare upper bound of one interval with a lower bound of another one.
I wrote a function which checks for practical dependency of two variables when they have only values 0 and 1. It works for pair of variables which are passed as vectors “v1”, “v2”, a confidence level equaled “level” and a given number of bootstrap runs.
areDependent<-function(v1, v2, level,runN=100000L) {
num=length(v1)
if (num==length(v2) & level>0.5 & level <1) {
require(data.table)
probs=c(.5*(1-level), .5*(1+level))
work_means=data.table(x=numeric(length=runN),
y=numeric(length=runN),
product_means=numeric(length=runN))
work_table=data.table(x=v1,y=v2)
work_table[, product:=x*y]
for (i in 1:runN) {
set.seed(i)
work_means[i,]=work_table[sample.int(num, size = num,
replace =TRUE), .(sum(x),sum(y),sum(product))]
}
work_means[, c("x", "y","product_means"):=
list(x/num, y/num,product_means/num)]
product_ci=work_means[,quantile(product_means, probs=probs)]
other_ci=(work_means[,quantile(x*y, probs=probs)])
check=((max(product_ci)<min(other_ci))|
(max(other_ci)<min(product_ci)))
}
else {
print("your inputs are wrong")
check=NA
}
return(check)
}
I found 3 variables which “nearZeroVar” function indicated for discarding and which at level 90% satisfy the practical dependency condition: “ADATE_14”, “ADATE_15” and “ADATE_23”. Corresponding correlation coefficients with the target variable are included.
t=Sys.time()
dt[, .N, by="ADATE_14"]
## ADATE_14 N
## 1: 9506 76381
## 2: NA 18867
## 3: 9504 164
ADATE_14=dt[,ADATE_14]
ADATE_14[(is.na(ADATE_14)|ADATE_14!=9506)]=0
ADATE_14[ADATE_14==9506]=1
t=Sys.time()
areDependent(v1=dt[,TARGET_B],v2=ADATE_14, level=.9)
## [1] TRUE
cor(ADATE_14, dt[,TARGET_B])
## [1] 0.025211
dt[, .N, by="ADATE_15"]
## ADATE_15 N
## 1: 9504 29935
## 2: NA 65477
ADATE_15=dt[,ADATE_15]
ADATE_15[is.na(ADATE_15)]=0
ADATE_15[ADATE_15==9504]=1
areDependent(v1=dt[,TARGET_B],v2=ADATE_15, level=.9)
## [1] TRUE
cor(ADATE_15, dt[,TARGET_B])
## [1] 0.02115
dt[, .N, by="ADATE_23"]
## ADATE_23 N
## 1: 9407 38877
## 2: NA 56270
## 3: 9406 243
## 4: 9312 22
ADATE_23=dt[,ADATE_23]
ADATE_23[(is.na(ADATE_23)|ADATE_23!=9407)]=0
ADATE_23[ADATE_23==9407]=1
areDependent(v1=dt[,TARGET_B],v2=ADATE_23, level=.9)
## [1] TRUE
cor(ADATE_23, dt[,TARGET_B])
## [1] 0.01736
Sys.time()-t
## Time difference of 34.121 mins
They all turned out to be practically dependable at level 90% as well. You can see their pairwise correlations at the end of the post together with other variables.
As you see my implementation takes a lot of time. I tried “apply” functions and learn that they use the same random seed for a whole run. I attempted to use “foreach” and could not make “data.table” to cooperate with my preferable randomization. I would like to to have a distinct random seed each time. Any help here will be appreciated.
Variables which are not practically dependent from the target variable at level 90%.
I’ve checked out variables with 2 values which were not picked by “nearZeroVar” function and found that at level 90% some of them are not practically dependent from the target variable.
VETERANS=dt[,as.numeric(factor(VETERANS))-1]
t=Sys.time()
areDependent(v1=dt[,TARGET_B],v2=VETERANS, level=.9)
## [1] FALSE
cor(VETERANS, dt[,TARGET_B])
## [1] 0.007773
STEREO=dt[,as.numeric(factor(STEREO))-1]
areDependent(v1=dt[,TARGET_B],v2=STEREO, level=.9)
## [1] FALSE
cor(STEREO, dt[,TARGET_B])
## [1] 0.00064342
Sys.time()-t
## Time difference of 22.97 mins
Training Data on Choosen Variables
Now a reality check: applying logistic regression and classification trees to our variables.
summary(glm(dt[,TARGET_B]~
VETERANS+STEREO+ADATE_14+ADATE_15+ADATE_23,
family = "binomial"))$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.200962 0.037834 -84.60469 0.0000e+00
## VETERANS 0.106498 0.047004 2.26571 2.3469e-02
## STEREO -0.014937 0.044577 -0.33508 7.3757e-01
## ADATE_14 0.242234 0.043761 5.53537 3.1057e-08
## ADATE_15 0.114418 0.033995 3.36573 7.6341e-04
## ADATE_23 0.058708 0.032658 1.79769 7.2227e-02
options("digits"=4)
cor(data.frame(TARGET_B=dt[, TARGET_B],VETERANS,STEREO,ADATE_14,
ADATE_15, ADATE_23))
## TARGET_B VETERANS STEREO ADATE_14 ADATE_15 ADATE_23
## TARGET_B 1.0000000 0.007773 0.0006434 0.025211 0.021150 0.01736
## VETERANS 0.0077730 1.000000 0.2400606 0.017284 0.006364 0.00983
## STEREO 0.0006434 0.240061 1.0000000 0.008152 -0.003251 -0.02461
## ADATE_14 0.0252114 0.017284 0.0081520 1.000000 0.334455 0.33028
## ADATE_15 0.0211499 0.006364 -0.0032513 0.334455 1.000000 0.36428
## ADATE_23 0.0173601 0.009830 -0.0246071 0.330283 0.364283 1.00000
As you see variables chosen by the method are not worse and sometimes even better then the ones picked up by “nearZeroVar” function. From another hand calculating correlation coefficients explains a lot for a regression.
library(rpart)
rpart_mod=rpart(dt[, TARGET_B] ~
VETERANS+STEREO+ADATE_14+ADATE_15+ADATE_23,
control = rpart.control(cp = 1e-05))
rpart_mod$variable.importance
## ADATE_14 ADATE_15 VETERANS ADATE_23 STEREO
## 2.9220 0.8425 0.5788 0.4700 0.4382
For trees initial choice of variables to split is crucial for the rest of algorithm training. If we limit our data set to rows where we have 1 for the variable ADATE_14 (I remind you that it means value 9506 for column ADATE_14 in our data table) then practial dependency for variable ADATE_23 at level 90% is lost.
areDependent(v1=dt[ADATE_14==9506,TARGET_B],v2=ADATE_23[ADATE_14==1],
level=.9)
## [1] FALSE
Conclusion.
At the moment the considered method does not appear to offer much of improvement in comparison with ones which are used already. Calculations take a lot of time and you can always explicitly train your data to understand what variable is better for your choice of training. The method can have some merits for theoretical analysis.
Disclaimer
There are conditions on using this data, one of which requires to notify Ismail Parsa (iparsa@epsilon.com) and Ken Howes (khowes@epsilon.com). I tried to do this and discovered that provided e-mails do not work. Any information on what to do in such case will be appreciated.