Definitions

# Sample mean and sample covariance

Sample mean and sample covariance are statistics computed from a collection of data, thought of as being random.

## Sample mean and covariance

Given a random sample $textstyle mathbf\left\{x\right\}_\left\{1\right\},ldots,mathbf\left\{x\right\}_\left\{N\right\}$ from an $textstyle n$-dimensional random variable $textstyle mathbf\left\{X\right\}$ (i.e., realizations of $textstyle N$ independent random variables with the same distribution as $textstyle mathbf\left\{X\right\}$), the sample mean is

$mathbf\left\{bar\left\{x\right\}\right\}=frac\left\{1\right\}\left\{N\right\}sum_\left\{k=1\right\}^\left\{N\right\}mathbf\left\{x\right\}_\left\{k\right\}.$

In coordinates, writing the vectors as columns,

$mathbf\left\{x\right\}_\left\{k\right\}=left\left[begin\left\{array\right\} \left[c\right]\left\{c\right\}x_\left\{1k\right\} vdots x_\left\{nk\right\}end\left\{array\right\} right\right] ,quadmathbf\left\{bar\left\{x\right\}\right\}=left\left[begin\left\{array\right\} \left[c\right]\left\{c\right\}bar\left\{x\right\}_\left\{1\right\} vdots bar\left\{x\right\}_\left\{n\right\}end\left\{array\right\} right\right] ,$

the entries of the sample mean are

$bar\left\{x\right\}_\left\{i\right\}=frac\left\{1\right\}\left\{N\right\}sum_\left\{k=1\right\}^\left\{N\right\}x_\left\{ik\right\},quad i=1,ldots,n.$

The sample covariance of $textstyle mathbf\left\{x\right\}_\left\{1\right\},ldots,mathbf\left\{x\right\}_\left\{N\right\}$ is the $textstyle n$ by $textstyle n$ matrix $textstyle mathbf\left\{Q\right\}=left\left[ q_\left\{ij\right\}right\right]$ with the entries given by

$q_\left\{ij\right\}=frac\left\{1\right\}\left\{N-1\right\}sum_\left\{k=1\right\}^\left\{N\right\}left\left( x_\left\{ik\right\}-bar\left\{x\right\}_\left\{i\right\}right\right) left\left(x_\left\{jk\right\}-bar\left\{x\right\}_\left\{j\right\}right\right)$

The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random variable $textstyle mathbf\left\{X\right\}$. The reason why the sample covariance matrix has $textstyle N-1$ in the denominator rather than $textstyle N$ is essentially that the population mean $E\left(X\right)$ is not known and is replaced by the sample mean $textstylebar\left\{x\right\}$. If the population mean $E\left(X\right)$ is known, the analogous unbiased estimate

$q_\left\{ij\right\}=frac\left\{1\right\}\left\{N\right\}sum_\left\{k=1\right\}^\left\{N\right\}left\left( x_\left\{ik\right\}-E\left(X_i\right)right\right) left\left(x_\left\{jk\right\}-E\left(X_j\right)right\right)$

with the population mean indeed does have $textstyle N$. This is an example why in probability and statistics it is essential to distinguish between upper case letters (random variables) and lower case letters (realizations of the random variables).

$q_\left\{ij\right\}=frac\left\{1\right\}\left\{N\right\}sum_\left\{k=1\right\}^\left\{N\right\}left\left( x_\left\{ik\right\}-bar\left\{x\right\}_\left\{i\right\}right\right) left\left(x_\left\{jk\right\}-bar\left\{x\right\}_\left\{j\right\}right\right)$

for the Gaussian distribution case has $textstyle N$ as well. The difference of course diminishes for large $textstyle N$.

## Weighted samples

In a weighted sample, each vector $textstyle textbf\left\{x\right\}_\left\{k\right\}$ is assigned a weight $textstyle w_\left\{k\right\}geq0$. Without loss of generality, assume that the weights are normalized:

$sum_\left\{k=1\right\}^\left\{N\right\}w_\left\{k\right\}=1.$

(If they are not, divide the weights by their sum.) Then the weighted mean $textstyle mathbf\left\{bar\left\{x\right\}\right\}$ and the weighted covariance matrix $textstyle mathbf\left\{Q\right\}=left\left[ q_\left\{ij\right\}right\right]$ are given by

$mathbf\left\{bar\left\{x\right\}\right\}=sum_\left\{k=1\right\}^\left\{N\right\}w_\left\{k\right\}mathbf\left\{x\right\}_\left\{k\right\}$

and

$q_\left\{ij\right\}=frac\left\{sum_\left\{k=1\right\}^\left\{N\right\}w_\left\{k\right\}left\left( x_\left\{ik\right\}-bar\left\{x\right\}_\left\{i\right\}right\right) left\left(x_\left\{jk\right\}-bar\left\{x\right\}_\left\{j\right\}right\right) \right\}\left\{1-sum_\left\{k=1\right\}^\left\{N\right\}w_\left\{k\right\}^\left\{2\right\}\right\}.$

If all weights are the same, $textstyle w_\left\{k\right\}=1/N$, the weighted mean and covariance reduce to the sample mean and covariance above.