Wednesday, March 9, 2016

Biased Estimators vs Unbiased Ones

Let us consider a numerical difference between a biased estimator and unbiased one. I will take a simple case of standard deviation of a sample mean. Say we have some measurements \(\{ x_1, x_2, \ldots, x_N\}\) of a particular outcome. We can compute its average, \({\bar x}\), and we want to estimate its deviation as well. The formula for unbiased one is the following: \[ s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}} \] In the same time if we knew what is a true mean \(\mu\) of the outcome, we can compute the deviation the following way: \[ s=\sqrt{\frac{\sum_{i=1}^{N}(\mu-x_i)^2}{N}} \] So it looks natural to write the deviation using \({\bar x}\) instead of \(\mu\) as \[ \sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N}} \] But we know from statistics that the last formula is biased.

Let us consider how different are these values numerically. I will denote the sum of squares on the top by \(S=\sum_{i=1}^{N}({\bar x}-x_i)^2\), to simplify the look of my calculations. Thus \[ s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}}= \sqrt{\frac{S}{N-1}}, \] and I want to compute \[ \sqrt{\frac{S}{N-1}}-\sqrt{\frac{S}{N}}= \] We can take out \(\sqrt{S}\) as a common factor: \[ \sqrt{S}\left(\frac{1}{\sqrt{N-1}}-\frac{1}{\sqrt{N}}\right)= \] and then bring fractions to common denominator and combine them. \[ \sqrt{S}\left(\frac{\sqrt{N}}{\sqrt{N(N-1)}}- \frac{\sqrt{N-1}}{\sqrt{N(N-1)}}\right)= \sqrt{S}\left(\frac{\sqrt{N}-\sqrt{N-1}}{\sqrt{N(N-1)}}\right)= \] For difference of square roots there is a special trick in math, based on formula \[ (a-b)(a+b)=a^2-b^2 \] As you see, if \(a\) and \(b\) are square roots, then by multiplying by their sum we can get rid of them. Usually we cannot multiply expressions on will, but with a fraction we are allowed to multiply top and bottom by the same expression. \[ \sqrt{S}\left( \frac{\left(\sqrt{N}-\sqrt{N-1}\right)\left(\sqrt{N}+\sqrt{N-1}\right)} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)= \] Using the above mentioned trick simplifies our top (but not the bottom): \[ \sqrt{S}\left( \frac{N-(N-1)} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)= \] \[ \sqrt{S}\left( \frac{1} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)= \] \[ \frac{\sqrt{S}} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \] This is not very convenient. Let us consider it in relation to sample deviation. \[ \frac{\mbox{The difference between unbiased and biased values}} {\mbox{sample deviation}}= \] \[ \frac{\frac{\sqrt{S}} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}}{\sqrt{\frac{S}{N-1}}}= \] Here we can cancel \(\sqrt{S}\), flip fractions and then cancel \(\sqrt{N-1}\). I hope you’re still with me. \[ \frac{\frac{1} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}} {\frac{1}{\sqrt{N-1}}}= \frac{\sqrt{N-1}}{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}= \] \[ \frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)} \] The last expression is better, but still can use some doctoring. I can replace all \(\sqrt{N}\) by \(\sqrt{N-1}\), thus reducing the fraction denominator and getting an upper bound for the whole expression: \[ \frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)}< \frac{1}{\sqrt{N-1}\left(\sqrt{N-1}+\sqrt{N-1}\right)}= \frac{1}{2(N-1)} \] Finally we got a handy formula. If \(N=101\), then our accuracy for sample deviation computed with a biased formula is \(\frac{1}{2(100)}\cdot 100\%=0.5\%\). When \(N=10,001\), then it will be \(0.005\%\). When \(N=1,000,001\), which is more close to what happens in Big Data, then the accuracy will be \(0.5\cdot 10^{-6}\%\).

Conclusion.

In Big Data a difference between biased and unbiased estimators of the same value becomes very small and could be even negligible.