Mathematician in Data Science: March 2016

Let us consider a numerical difference between a biased estimator and unbiased one. I will take a simple case of standard deviation of a sample mean. Say we have some measurements \(\{ x_1, x_2, \ldots, x_N\}\) of a particular outcome. We can compute its average, \({\bar x}\), and we want to estimate its deviation as well. The formula for unbiased one is the following: \[ s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}} \] In the same time if we knew what is a true mean \(\mu\) of the outcome, we can compute the deviation the following way: \[ s=\sqrt{\frac{\sum_{i=1}^{N}(\mu-x_i)^2}{N}} \] So it looks natural to write the deviation using \({\bar x}\) instead of \(\mu\) as \[ \sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N}} \] But we know from statistics that the last formula is biased.

Let us consider how different are these values numerically. I will denote the sum of squares on the top by \(S=\sum_{i=1}^{N}({\bar x}-x_i)^2\), to simplify the look of my calculations. Thus \[ s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}}= \sqrt{\frac{S}{N-1}}, \] and I want to compute \[ \sqrt{\frac{S}{N-1}}-\sqrt{\frac{S}{N}}= \] We can take out \(\sqrt{S}\) as a common factor: \[ \sqrt{S}\left(\frac{1}{\sqrt{N-1}}-\frac{1}{\sqrt{N}}\right)= \] and then bring fractions to common denominator and combine them. \[ \sqrt{S}\left(\frac{\sqrt{N}}{\sqrt{N(N-1)}}- \frac{\sqrt{N-1}}{\sqrt{N(N-1)}}\right)= \sqrt{S}\left(\frac{\sqrt{N}-\sqrt{N-1}}{\sqrt{N(N-1)}}\right)= \] For difference of square roots there is a special trick in math, based on formula \[ (a-b)(a+b)=a^2-b^2 \] As you see, if \(a\) and \(b\) are square roots, then by multiplying by their sum we can get rid of them. Usually we cannot multiply expressions on will, but with a fraction we are allowed to multiply top and bottom by the same expression. \[ \sqrt{S}\left( \frac{\left(\sqrt{N}-\sqrt{N-1}\right)\left(\sqrt{N}+\sqrt{N-1}\right)} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)= \] Using the above mentioned trick simplifies our top (but not the bottom): \[ \sqrt{S}\left( \frac{N-(N-1)} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)= \] \[ \sqrt{S}\left( \frac{1} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \right)= \] \[ \frac{\sqrt{S}} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)} \] This is not very convenient. Let us consider it in relation to sample deviation. \[ \frac{\mbox{The difference between unbiased and biased values}} {\mbox{sample deviation}}= \] \[ \frac{\frac{\sqrt{S}} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}}{\sqrt{\frac{S}{N-1}}}= \] Here we can cancel \(\sqrt{S}\), flip fractions and then cancel \(\sqrt{N-1}\). I hope you’re still with me. \[ \frac{\frac{1} {\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}} {\frac{1}{\sqrt{N-1}}}= \frac{\sqrt{N-1}}{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}= \] \[ \frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)} \] The last expression is better, but still can use some doctoring. I can replace all \(\sqrt{N}\) by \(\sqrt{N-1}\), thus reducing the fraction denominator and getting an upper bound for the whole expression: \[ \frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)}< \frac{1}{\sqrt{N-1}\left(\sqrt{N-1}+\sqrt{N-1}\right)}= \frac{1}{2(N-1)} \] Finally we got a handy formula. If \(N=101\), then our accuracy for sample deviation computed with a biased formula is \(\frac{1}{2(100)}\cdot 100\%=0.5\%\). When \(N=10,001\), then it will be \(0.005\%\). When \(N=1,000,001\), which is more close to what happens in Big Data, then the accuracy will be \(0.5\cdot 10^{-6}\%\).

Conclusion.

In Big Data a difference between biased and unbiased estimators of the same value becomes very small and could be even negligible.

Mathematician in Data Science

Wednesday, March 9, 2016

Biased Estimators vs Unbiased Ones

Conclusion.