## Wednesday, March 9, 2016

### Biased Estimators vs Unbiased Ones

Let us consider a numerical difference between a biased estimator and unbiased one. I will take a simple case of standard deviation of a sample mean. Say we have some measurements $$\{ x_1, x_2, \ldots, x_N\}$$ of a particular outcome. We can compute its average, $${\bar x}$$, and we want to estimate its deviation as well. The formula for unbiased one is the following: $s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}}$ In the same time if we knew what is a true mean $$\mu$$ of the outcome, we can compute the deviation the following way: $s=\sqrt{\frac{\sum_{i=1}^{N}(\mu-x_i)^2}{N}}$ So it looks natural to write the deviation using $${\bar x}$$ instead of $$\mu$$ as $\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N}}$ But we know from statistics that the last formula is biased.

Let us consider how different are these values numerically. I will denote the sum of squares on the top by $$S=\sum_{i=1}^{N}({\bar x}-x_i)^2$$, to simplify the look of my calculations. Thus $s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}}= \sqrt{\frac{S}{N-1}},$ and I want to compute $\sqrt{\frac{S}{N-1}}-\sqrt{\frac{S}{N}}=$ We can take out $$\sqrt{S}$$ as a common factor: $\sqrt{S}\left(\frac{1}{\sqrt{N-1}}-\frac{1}{\sqrt{N}}\right)=$ and then bring fractions to common denominator and combine them. $\sqrt{S}\left(\frac{\sqrt{N}}{\sqrt{N(N-1)}}- \frac{\sqrt{N-1}}{\sqrt{N(N-1)}}\right)= \sqrt{S}\left(\frac{\sqrt{N}-\sqrt{N-1}}{\sqrt{N(N-1)}}\right)=$ For difference of square roots there is a special trick in math, based on formula $(a-b)(a+b)=a^2-b^2$ As you see, if $$a$$ and $$b$$ are square roots, then by multiplying by their sum we can get rid of them. Usually we cannot multiply expressions on will, but with a fraction we are allowed to multiply top and bottom by the same expression. $\sqrt{S}\left( \frac{\left(\sqrt{N}-\sqrt{N-1}\right)\left(\sqrt{N}+\sqrt{N-1}\right)} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)=$ Using the above mentioned trick simplifies our top (but not the bottom): $\sqrt{S}\left( \frac{N-(N-1)} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)=$ $\sqrt{S}\left( \frac{1} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)=$ $\frac{\sqrt{S}} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}$ This is not very convenient. Let us consider it in relation to sample deviation. $\frac{\mbox{The difference between unbiased and biased values}} {\mbox{sample deviation}}=$ $\frac{\frac{\sqrt{S}} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}}{\sqrt{\frac{S}{N-1}}}=$ Here we can cancel $$\sqrt{S}$$, flip fractions and then cancel $$\sqrt{N-1}$$. I hope you’re still with me. $\frac{\frac{1} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}} {\frac{1}{\sqrt{N-1}}}= \frac{\sqrt{N-1}}{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}=$ $\frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)}$ The last expression is better, but still can use some doctoring. I can replace all $$\sqrt{N}$$ by $$\sqrt{N-1}$$, thus reducing the fraction denominator and getting an upper bound for the whole expression: $\frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)}< \frac{1}{\sqrt{N-1}\left(\sqrt{N-1}+\sqrt{N-1}\right)}= \frac{1}{2(N-1)}$ Finally we got a handy formula. If $$N=101$$, then our accuracy for sample deviation computed with a biased formula is $$\frac{1}{2(100)}\cdot 100\%=0.5\%$$. When $$N=10,001$$, then it will be $$0.005\%$$. When $$N=1,000,001$$, which is more close to what happens in Big Data, then the accuracy will be $$0.5\cdot 10^{-6}\%$$.

### Conclusion.

In Big Data a difference between biased and unbiased estimators of the same value becomes very small and could be even negligible.