Processing math: 100%

Wednesday, March 9, 2016

Biased Estimators vs Unbiased Ones

Let us consider a numerical difference between a biased estimator and unbiased one. I will take a simple case of standard deviation of a sample mean. Say we have some measurements {x1,x2,,xN} of a particular outcome. We can compute its average, ˉx, and we want to estimate its deviation as well. The formula for unbiased one is the following: s=Ni=1(ˉxxi)2N1

In the same time if we knew what is a true mean μ of the outcome, we can compute the deviation the following way: s=Ni=1(μxi)2N
So it looks natural to write the deviation using ˉx instead of μ as Ni=1(ˉxxi)2N
But we know from statistics that the last formula is biased.

Let us consider how different are these values numerically. I will denote the sum of squares on the top by S=Ni=1(ˉxxi)2, to simplify the look of my calculations. Thus s=Ni=1(ˉxxi)2N1=SN1,

and I want to compute SN1SN=
We can take out S as a common factor: S(1N11N)=
and then bring fractions to common denominator and combine them. S(NN(N1)N1N(N1))=S(NN1N(N1))=
For difference of square roots there is a special trick in math, based on formula (ab)(a+b)=a2b2
As you see, if a and b are square roots, then by multiplying by their sum we can get rid of them. Usually we cannot multiply expressions on will, but with a fraction we are allowed to multiply top and bottom by the same expression. S((NN1)(N+N1)N(N1)(N+N1))=
Using the above mentioned trick simplifies our top (but not the bottom): S(N(N1)N(N1)(N+N1))=
S(1N(N1)(N+N1))=
SN(N1)(N+N1)
This is not very convenient. Let us consider it in relation to sample deviation. The difference between unbiased and biased valuessample deviation=
SN(N1)(N+N1)SN1=
Here we can cancel S, flip fractions and then cancel N1. I hope you’re still with me. 1N(N1)(N+N1)1N1=N1N(N1)(N+N1)=
1N(N+N1)
The last expression is better, but still can use some doctoring. I can replace all N by N1, thus reducing the fraction denominator and getting an upper bound for the whole expression: 1N(N+N1)<1N1(N1+N1)=12(N1)
Finally we got a handy formula. If N=101, then our accuracy for sample deviation computed with a biased formula is 12(100)100%=0.5%. When N=10,001, then it will be 0.005%. When N=1,000,001, which is more close to what happens in Big Data, then the accuracy will be 0.5106%.

Conclusion.

In Big Data a difference between biased and unbiased estimators of the same value becomes very small and could be even negligible.

No comments:

Post a Comment