Let us consider a numerical difference between a biased estimator and unbiased one. I will take a simple case of standard deviation of a sample mean. Say we have some measurements {x1,x2,…,xN} of a particular outcome. We can compute its average, ˉx, and we want to estimate its deviation as well. The formula for unbiased one is the following: s=√∑Ni=1(ˉx−xi)2N−1
In the same time if we knew what is a true mean μ of the outcome, we can compute the deviation the following way: s=√∑Ni=1(μ−xi)2N
So it looks natural to write the deviation using ˉx instead of μ as √∑Ni=1(ˉx−xi)2N
But we know from statistics that the last formula is biased.
Let us consider how different are these values numerically. I will denote the sum of squares on the top by S=∑Ni=1(ˉx−xi)2, to simplify the look of my calculations. Thus s=√∑Ni=1(ˉx−xi)2N−1=√SN−1,
and I want to compute √SN−1−√SN=
We can take out √S as a common factor: √S(1√N−1−1√N)=
and then bring fractions to common denominator and combine them. √S(√N√N(N−1)−√N−1√N(N−1))=√S(√N−√N−1√N(N−1))=
For difference of square roots there is a special trick in math, based on formula (a−b)(a+b)=a2−b2
As you see, if a and b are square roots, then by multiplying by their sum we can get rid of them. Usually we cannot multiply expressions on will, but with a fraction we are allowed to multiply top and bottom by the same expression. √S((√N−√N−1)(√N+√N−1)√N(N−1)(√N+√N−1))=
Using the above mentioned trick simplifies our top (but not the bottom): √S(N−(N−1)√N(N−1)(√N+√N−1))=
√S(1√N(N−1)(√N+√N−1))=
√S√N(N−1)(√N+√N−1)
This is not very convenient. Let us consider it in relation to sample deviation. The difference between unbiased and biased valuessample deviation=
√S√N(N−1)(√N+√N−1)√SN−1=
Here we can cancel √S, flip fractions and then cancel √N−1. I hope you’re still with me. 1√N(N−1)(√N+√N−1)1√N−1=√N−1√N(N−1)(√N+√N−1)=
1√N(√N+√N−1)
The last expression is better, but still can use some doctoring. I can replace all √N by √N−1, thus reducing the fraction denominator and getting an upper bound for the whole expression: 1√N(√N+√N−1)<1√N−1(√N−1+√N−1)=12(N−1)
Finally we got a handy formula. If N=101, then our accuracy for sample deviation computed with a biased formula is 12(100)⋅100%=0.5%. When N=10,001, then it will be 0.005%. When N=1,000,001, which is more close to what happens in Big Data, then the accuracy will be 0.5⋅10−6%.
Conclusion.
In Big Data a difference between biased and unbiased estimators of the same value becomes very small and could be even negligible.
No comments:
Post a Comment