Thursday, February 25, 2016

Computing Ratio of Areas

There were questions about computing the formula of area ratio in my previous post which I featured on “Data Science Central” and linkedin.com. My reply turned out to be long, so I posted it here.

I will create simulated data X and Y. Data X will be a vector of normally distributed random numbers with mean 0 and deviation 1, and data Y will be a combination of two normal distributions, one with mean -2 and deviation 0.7, and another with mean 1.5 and deviation 0.4. Let us consider a pair of density curves for X and Y. Here is my R code for this.

X=rnorm(10000)
Y=c(rnorm(5000, mean=-2, sd=0.7), rnorm(5000, mean=1.5, sd=0.4))
dX=density(X, from=min(X,Y), to=max(X,Y), n=1000)
dY=density(Y, from=min(X,Y), to=max(X,Y), n=1000)
par(mar = c(2,2,2,2))
plot(dX, col="orange", 
     main="X density curve is orange and Y density curve is blue",         cex.main=1, xlab="", ylim=c(0,0.43))
lines(dY, col=4)

Each curve is a smoothed out histogram.

par(mfrow = c(1, 2), mar = c(2,2,2,2))
hist(X, freq=F, col=7, ylim=c(0,0.43), cex.main=1)
lines(dX, col="orange")
hist(Y, freq=F, col=rgb(0,0.77,0.8,1/4), cex.main=1, 
     ylim=c(0,0.43))
lines(dY, col="blue")

Now areas under density curves over a given interval can be estimated by summation of corresponding histogram rectangles on this interval. For example we can consider area under Y density curve on interval from -2 to 1, and find the corresponding (blue) rectangles.

par(mar = c(2,2,2,2))
hist(Y, freq=F, col=c(rep(0,6),rep(rgb(0,0.77,0.8,1/4), 6)),
     ylim=c(0,0.43), breaks=20)
lines(dY, col="blue")

If our histogram bins are sufficiently narrow and the number of data is still not small, then we will have a very close estimate.

We can use the same method for calculating areas between curves. We need to make sure that the histogram bins are identical. Let us combine the density curves and histograms on one picture:

par(mar = c(2,2,2,2))
hist(X, freq=F, col=7, breaks=50, ylim=c(0,.6), 
     main="Histograms and density curves",
     xlab="")
lines(dX, col="orange")
hist(Y, freq=F, col=rgb(0,0.77,0.8,1/4), breaks=50, add=T)
lines(dY, col="blue")