In this post I will sometimes use a term “variable” for “feature”(“predictor”“) or”outcome“(”predicted value“”).
The question of variable dependencies for a particular data is quite important, because it can help to reduce an amount of predictors used for a model. Or it can tell us what feature is not helpful for a model construction, although it still can be used for engineering of another predictor. For example sometimes it is better to compute speed than to use distance values. In addition some standard algorithms assume independence of features and knowing how close to reality such assumption is useful.
The standard way to check dependencies of variables is to compute their covariance matrix. But it yields only linear dependencies. If dependencies are not linear then the covariance matrix may not pick them up. There are well known and numerous examples so I will not repeat them again.
Let us take a different approach. The definition of independent events is the following equality: \begin{equation} \textbf{Pr} (A \text{ and } B) = \textbf{Pr} (A) \textbf{Pr} (B). \end{equation} Hence for dependent events we should have inequality. A simple measure of such disparity is an absolute value of difference of the expressions on the right hand side and on the left hand side: \begin{equation} \big|\textbf{Pr} (A\text{ and } B) - \textbf{Pr} (A) \textbf{Pr} (B)\big|. \end{equation}Since in Data Science we work with probability estimations, then the true equality in the first formula is not likely anyway. The question is, how far from zero may be the difference in the second formula for us to believe that considered variables are dependent?
Well, in Data Science we can estimate bounds of a particular value with confidence intervals computed from given data . For example with R it can be done with package “boot” and with python it is done with “scikits.bootstrap”. Thus confidence intervals of values of \(\textbf{Pr} (A \text{ and } B)\), \(\textbf{Pr} (A)\) and \(\textbf{Pr} (B)\) can be estimated with a desired degree of probability. What is left to work out is a confidence interval of the product bounds, \(\textbf{Pr} (A) \textbf{Pr} (B)\).
To estimate bounds for the product we can use a standard approach from Numerical Analysis which is used to compute an accrued error of calculation caused by truncation errors. Assuming that we have values \(x\) and \(y\) computed with truncation errors as \(\Delta x\) and \(\Delta y\) respectfully (which are taken to be positive by convention), we see that their product will be affected in the following way: \begin{equation} (x\pm\Delta x)(y\pm\Delta y)=xy \pm x\Delta y \pm y \Delta x \pm \Delta x\Delta y. \end{equation} Therefore our deviation from the real value of the product will be \begin{equation} xy-(x\pm\Delta x)(y\pm\Delta y)=\pm x\Delta y \pm y \Delta x \pm \Delta x\Delta y. \end{equation}An upper bound for this value is \(|x|\Delta y + |y| \Delta x + \Delta x\Delta y\). In Numerical Analysis the last term \(\Delta x\Delta y\) is usually dropped as insignificant because it must be less than a usual error of truncation. But in our case the value may be not negligible.
By the way if you do not like my reference to Numerical Analysis and prefer to do everything using standard deviation like it is done in Statistics, then the above expression does approximate the standard deviation of a independent random variable product when \(\Delta x\) and \(\Delta y\) are respective standard deviations, provided that the deviations are small. The last formula is complicated and calls for its own analysis to estimate a possible computational error and its accuracy. So for me it is not clear what is better, especially since a necessary requirement for standard deviation existence does not appear realistic. In addition the classical procedure for confidence intervals computation is based on assumption of known distribution of investigated variable, which can be a stretch as well.
Let us assume for simplicity of the argument that all our confidence intervals are symmetric around their estimated values. For working with data we need to change from probabilities to relative frequencies. I would like to mention here that frequencies can benefit from some boosting as well. Our usual case in Data Science is an event that a particular variable takes some value. Assume that we would like to know if an event of outcome \(O\) equaled to \(z\) depends on case of a predictor \(P\) taking value \(a\). I denote a relative frequency of outcome equaled to \(z\) and the predictor equaled to \(a\) as \(\textbf{Fr} (O=a \text{ and }P=x)\). Say that we established its confidence interval length at 95% level. I will denote the length as \(\Delta (a\text{ and } x)\), for sake of brevity. So our confidence interval is \[ \big[\textbf{Fr} (O=a \text{ and }P=x)-\frac{1}{2}\Delta (a\text{ and } x),\ \ \textbf{Fr} (O=a \text{ and }P=x)+\frac{1}{2}\Delta (a\text{ and } x) \big]. \] At the same time we can denote separate frequencies for the outcome equaled \(a\) as \(\textbf{Fr} (O=a)\) and for the predictor equaled \(x\) as \(\textbf{Fr} (P=x)\) and their respective confidence interval lengths as \(\Delta (a)\) and \(\Delta (x)\).
Therefore if the difference \begin{equation} \big|\textbf{Fr} (O=a \text{ and }P=x)-\textbf{Fr} (O=a)\textbf{Fr} (P=x)\big| \end{equation}is greater than \(\frac{1}{2}\textbf{Fr} (O=a)\Delta (x)+ \frac{1}{2}\textbf{Fr} (P=x)\Delta (a)+ \frac{1}{4} \Delta (x)\Delta (a)\), then we can say that our data indicates that with confidence of least 95% these events might be dependent.
For two variables to be dependent we need to check the above condition for each pair of their values.
Clearly the approach can be elaborated for checking dependence of other features, asymmetric confidence intervals and for more variables. In addition we can choose different levels for confidence intervals and investigate how this dependence criterion changes.