Thursday, February 25, 2016

Computing Ratio of Areas

There were questions about computing the formula of area ratio in my previous post which I featured on “Data Science Central” and linkedin.com. My reply turned out to be long, so I posted it here.

I will create simulated data X and Y. Data X will be a vector of normally distributed random numbers with mean 0 and deviation 1, and data Y will be a combination of two normal distributions, one with mean -2 and deviation 0.7, and another with mean 1.5 and deviation 0.4. Let us consider a pair of density curves for X and Y. Here is my R code for this.

X=rnorm(10000)
Y=c(rnorm(5000, mean=-2, sd=0.7), rnorm(5000, mean=1.5, sd=0.4))
dX=density(X, from=min(X,Y), to=max(X,Y), n=1000)
dY=density(Y, from=min(X,Y), to=max(X,Y), n=1000)
par(mar = c(2,2,2,2))
plot(dX, col="orange", 
     main="X density curve is orange and Y density curve is blue",         cex.main=1, xlab="", ylim=c(0,0.43))
lines(dY, col=4)

Each curve is a smoothed out histogram.

par(mfrow = c(1, 2), mar = c(2,2,2,2))
hist(X, freq=F, col=7, ylim=c(0,0.43), cex.main=1)
lines(dX, col="orange")
hist(Y, freq=F, col=rgb(0,0.77,0.8,1/4), cex.main=1, 
     ylim=c(0,0.43))
lines(dY, col="blue")

Now areas under density curves over a given interval can be estimated by summation of corresponding histogram rectangles on this interval. For example we can consider area under Y density curve on interval from -2 to 1, and find the corresponding (blue) rectangles.

par(mar = c(2,2,2,2))
hist(Y, freq=F, col=c(rep(0,6),rep(rgb(0,0.77,0.8,1/4), 6)),
     ylim=c(0,0.43), breaks=20)
lines(dY, col="blue")

If our histogram bins are sufficiently narrow and the number of data is still not small, then we will have a very close estimate.

We can use the same method for calculating areas between curves. We need to make sure that the histogram bins are identical. Let us combine the density curves and histograms on one picture:

par(mar = c(2,2,2,2))
hist(X, freq=F, col=7, breaks=50, ylim=c(0,.6), 
     main="Histograms and density curves",
     xlab="")
lines(dX, col="orange")
hist(Y, freq=F, col=rgb(0,0.77,0.8,1/4), breaks=50, add=T)
lines(dY, col="blue")

So to find the areas between the curves we can use sum of yellow and blue rectangles.

Going back to the formula which I used: \[ \frac{\textrm{an area between two density curves}}{\textrm{an area under one of the curves}} > 0.75 \] Note that here is a ratio on the left, and we can cancel if here is a common multiplier on the top and on the bottom. An area of a rectangle is a product of its height and base width. All our rectangles have the same base width, hence it is enough to compute only sums of their heights in both cases.

And at the end the computations of areas under curves, using the computed densities:

d1=sum(abs(dX$y-dY$y))/sum(dX$y)
d2=sum(abs(dX$y-dY$y))/sum(dY$y)
d1
## [1] 1.153001
d2
## [1] 1.153068
abs(d1-d2)
## [1] 6.681302e-05

As we see our computational accuracy here appears below 0.005.

Conclusion

We need to check out if the method actually captures differences in data distributions which are crucial for random forests algorithm. What is essential for this is that an area under a complete (ideal) density curve must be 1, signifying the whole amount of data. Therefore the curves cannot have a similar shape and lie one above another. If they have a similar shape, they should almost agree. It is like a water mattress: since its volume is constant, then pressing down in one place makes other places go up. We cannot have anything like the picture below:

par(mar = c(2,2,2,2))
plot(x=dY$x, y=2*dY$y, col="orange", type="l",
     main="Not realistic density curves", cex.main=1, 
     xlab="", ylab="")
lines(dY, col="blue")