Gaussian Mixture Model (GMM) is a special kind of mixture model. In GMM, parameters are estimated from training data using iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a prior model. This post will discuss how to estimate parameters by EM.

In Mixture of Gaussians, variables are sampled from a weighted summation of Gaussian densities:  $latex p(x)=\sum_{k=1}^{K} \pi_k \cdot N(x|\mu_k, \Sigma_k) &s=1$

where $latex \sum_{k=1}^{K} \pi_k = 1&s=1$. Parameters are $latex \theta=\{\pi_k,\mu_k, \Sigma_k\}_{k=1}^{K} &s=1$, and data are $latex X=\{x_j\}_{j=1}^{N} &s=1$.

Expectation-Maximization (EM) algorithm can be used to find the (local) optimum solution of MLE.

The log likelihood of the data is $latex L(X) = \sum_{j} \log p(x_j|\theta) &s=1$

In EM, we assume the existence of latent variables $latex z_{j} (z_j = 1, \cdots, K) &s=1$, which means the distribution $latex x_j &s=1$ comes from. We use $latex \phi_{jk} &s=1$ to represent the probability that $latex x_j&s=1$ comes from $latex k^{th} &s=1$ distribution:  $latex \phi_{jk} = p(z_j = k|X,\theta) &s=1$. Therefore we have

$latex \begin{array}{cll} L(X|\theta) & = & \sum_j \log p(x_j|\theta) \\
& = & \sum_j \log \sum_k p(x_j,z_j=k|\theta) \\
& = & \sum_j \log { \sum_k p(z_j=k|x_j, \theta^{(i)}) \cdot \frac {p(x_j,z_j|\theta)} {p(z_j=k|x_j, \theta^{(i)})} } \\
& \ge & \sum_j \sum_k p(z_j=k|x_j, \theta^{(i)}) \cdot \log \frac {p(x_j,z_j|\theta)} {p(z_j=k|x_j, \theta^{(i)})} \end{array}&s=1$

In E-step, we have already calculated the Q function:

$latex \begin{array}{cll} Q(\theta,\theta^{(i)}) & = & \sum_j \sum_k p(z_j=k|x_j, \theta^{(i)}) \cdot \log p(x_j,z_j|\theta) \\
& = &\sum_j \sum_k p(z_j=k|x_j, \theta^{(i)}) \cdot (\log \pi_k + \log N(x_j|\mu_k,\Sigma_k)) \end{array}&s=1$.

In M-step (remember: M-step will maximize Q function w.r.t parameters; we can find either closed-form solution or use gradient descent, etc.):

(1) $latex \phi_{jk}^{(i)}=p(z_j=k|X,\theta^{(i)})=\frac{p(z_j=k,X|\theta^{(i)})} {\sum_{k’}{p(z_j=k’,X|\theta^{(i)})}} &s=1$, where $latex p(z_j=k,x_j|\theta^{(i)}) = \pi_k N(x_j|\mu_k^{(i)}, \Sigma_k^{(i)}) &s=1$

(2) Maximize $latex Q(\theta,\theta^{(i)}) &s=1$ w.r.t. $latex \theta &s=1$. This optimization process may require knowledge about Lagrange multiplier. For example, when updating $latex \pi_k &s=1$, because of the constraint on $latex \pi_k &s=1$, we will maximize the following function:

$latex J(\theta,\theta^{(i)}) = Q(\theta,\theta^{(i)}) + \lambda (\sum_k \pi_k -1) &s=1$

By setting the derivation w.r.t $latex \pi_k &s=1$ to zero, we can get the value of $latex \lambda &s=1$, thus getting the closed-form of updating rule of $latex \pi_k &s=1$.

Ref: Gaussian Mixture Models, A guide to EM and Mixture Models

I will never use the equation system in wp again…