Mathematician in Data Science: February 2016

There were questions about computing the formula of area ratio in my previous post which I featured on “Data Science Central” and linkedin.com. My reply turned out to be long, so I posted it here.

I will create simulated data X and Y. Data X will be a vector of normally distributed random numbers with mean 0 and deviation 1, and data Y will be a combination of two normal distributions, one with mean -2 and deviation 0.7, and another with mean 1.5 and deviation 0.4. Let us consider a pair of density curves for X and Y. Here is my R code for this.

X=rnorm(10000)
Y=c(rnorm(5000, mean=-2, sd=0.7), rnorm(5000, mean=1.5, sd=0.4))
dX=density(X, from=min(X,Y), to=max(X,Y), n=1000)
dY=density(Y, from=min(X,Y), to=max(X,Y), n=1000)
par(mar = c(2,2,2,2))
plot(dX, col="orange", 
     main="X density curve is orange and Y density curve is blue",         cex.main=1, xlab="", ylim=c(0,0.43))
lines(dY, col=4)

Each curve is a smoothed out histogram.

par(mfrow = c(1, 2), mar = c(2,2,2,2))
hist(X, freq=F, col=7, ylim=c(0,0.43), cex.main=1)
lines(dX, col="orange")
hist(Y, freq=F, col=rgb(0,0.77,0.8,1/4), cex.main=1, 
     ylim=c(0,0.43))
lines(dY, col="blue")

Now areas under density curves over a given interval can be estimated by summation of corresponding histogram rectangles on this interval. For example we can consider area under Y density curve on interval from -2 to 1, and find the corresponding (blue) rectangles.

par(mar = c(2,2,2,2))
hist(Y, freq=F, col=c(rep(0,6),rep(rgb(0,0.77,0.8,1/4), 6)),
     ylim=c(0,0.43), breaks=20)
lines(dY, col="blue")

If our histogram bins are sufficiently narrow and the number of data is still not small, then we will have a very close estimate.

We can use the same method for calculating areas between curves. We need to make sure that the histogram bins are identical. Let us combine the density curves and histograms on one picture:

par(mar = c(2,2,2,2))
hist(X, freq=F, col=7, breaks=50, ylim=c(0,.6), 
     main="Histograms and density curves",
     xlab="")
lines(dX, col="orange")
hist(Y, freq=F, col=rgb(0,0.77,0.8,1/4), breaks=50, add=T)
lines(dY, col="blue")

So to find the areas between the curves we can use sum of yellow and blue rectangles.

Going back to the formula which I used: \[ \frac{\textrm{an area between two density curves}}{\textrm{an area under one of the curves}} > 0.75 \] Note that here is a ratio on the left, and we can cancel if here is a common multiplier on the top and on the bottom. An area of a rectangle is a product of its height and base width. All our rectangles have the same base width, hence it is enough to compute only sums of their heights in both cases.

And at the end the computations of areas under curves, using the computed densities:

d1=sum(abs(dX$y-dY$y))/sum(dX$y)
d2=sum(abs(dX$y-dY$y))/sum(dY$y)
d1

## [1] 1.153001

d2

## [1] 1.153068

abs(d1-d2)

## [1] 6.681302e-05

As we see our computational accuracy here appears below 0.005.

Conclusion

We need to check out if the method actually captures differences in data distributions which are crucial for random forests algorithm. What is essential for this is that an area under a complete (ideal) density curve must be 1, signifying the whole amount of data. Therefore the curves cannot have a similar shape and lie one above another. If they have a similar shape, they should almost agree. It is like a water mattress: since its volume is constant, then pressing down in one place makes other places go up. We cannot have anything like the picture below:

par(mar = c(2,2,2,2))
plot(x=dY$x, y=2*dY$y, col="orange", type="l",
     main="Not realistic density curves", cex.main=1, 
     xlab="", ylab="")
lines(dY, col="blue")

Thus for true density curves there must be places where they alternate being up and down unless they are coincide.

Computational Errors in Area Calculation

At this point a fairly natural question appears: if an area under a density curve is always 1, then why divide by it? Cannot we just compute sum of rectangles and be done?
The problem is that an area of each rectangle is small. If your computations are done with a fixed number of decimal places, then a rounding error can accumulate to a fairly large computational error by summing a very big number of values. For example, if number of bins is \(1000\) and each rectangular is computed with precision \(10^{-5}\), then our final accuracy will be \(10^{-2}\). It can affect our choice of predictors.

Computation of area under a curve is called “numerical integration” in math and have been well researched for the last couple of centuries. There are a few other ways to increase accuracy: using not rectangles but trapezoids (trapezoidal rule), or approximating a curve with spline interpolation (Simpson’s rule and others). There are good approximating pictures and formulas:

Pictures of areas under curves, approximations and formulas

Many of them became obsolete because they were created at times when people calculated manually and could not work with a huge number of subintervals, which we call bins. The interesting thing is that because we have a restriction on size of bins, then the old tricks of getting more precise numbers out of given amount of interval partition could be useful again.

Theoretically with a lot of features and desired high precision Monte Carlo method is the best for quick calculations. In addition, it will look as a familiar bootstrapping and can appear more natural in data science world. But the accuracy of the method is based on possibility of choosing any number, in particular between endpoints, and not to be limited to a given set of numbers. Therefore the computational accuracy cannot be better than with rectangles. Although maybe a combination of spline interpolation and Monte Carlo method could be effective.

Mathematician in Data Science

Thursday, February 25, 2016

Computing Ratio of Areas

Conclusion

Computational Errors in Area Calculation