Example of a generative Gaussian mixture
We can now implement this model in Python using a simple bidimensional dataset, created using the make_blobs() function provided by Scikit-Learn:
from sklearn.datasets import make_blobs
nb_samples = 1000
nb_unlabeled = 750
X, Y = make_blobs(n_samples=nb_samples, n_features=2, centers=2, cluster_std=2.5, random_state=100)
unlabeled_idx = np.random.choice(np.arange(0, nb_samples, 1), replace=False, size=nb_unlabeled)
Y[unlabeled_idx] = -1
We have created 1,000 samples belonging to 2 classes. 750 points have then been randomly selected to become our unlabeled dataset (the corresponding class has been set to -1). We can now initialize two Gaussian distributions by defining their mean, covariance, and weight. One possibility is to use random values:
import numpy as np
# First Gaussian
m1 = np.random.uniform(-7.5, 10.0, size=2)
c1 = np.random.uniform(5.0, 15.0, size=(2, 2))
c1 = np.dot(c1, c1.T)
q1 = 0.5
# Second Gaussian
m2 = np.random.uniform(-7.5, 10.0, size=2)
c2 = np.random.uniform(5.0, 15.0, size=(2, 2))
c2 = np.dot(c2, c2.T)
q2 = 0.5
However, as the covariance matrices must be positive semi definite, it's useful to alter the random values (by multiplying each matrix by the corresponding transpose) or to set hard-coded initial parameters. In this case, we could pick the following example:
import numpy as np
# First Gaussian
m1 = np.array([-3.0, -4.5])
c1 = np.array([[25.0, 5.0],
[5.0, 35.0]])
q1 = 0.5
# Second Gaussian
m2 = np.array([5.0, 10.0])
c2 = np.array([[25.0, -10.0],
[-10.0, 25.0]])
q2 = 0.5
The resulting plot is shown in the following graph, where the small diamonds represent the unlabeled points and the bigger dots, the samples belonging to the known classes:
Initial configuration of the Gaussian mixture
The two Gaussians are represented by the concentric ellipses. We can now execute the training procedure. For simplicity, we repeat the update for a fixed number of iterations. The reader can easily modify the code in order to introduce a threshold:
from scipy.stats import multivariate_normal
nb_iterations = 5
for i in range(nb_iterations):
Pij = np.zeros((nb_samples, 2))
for i in range(nb_samples):
if Y[i] == -1:
p1 = multivariate_normal.pdf(X[i], m1, c1, allow_singular=True) * q1
p2 = multivariate_normal.pdf(X[i], m2, c2, allow_singular=True) * q2
Pij[i] = [p1, p2] / (p1 + p2)
else:
Pij[i, :] = [1.0, 0.0] if Y[i] == 0 else [0.0, 1.0]
n = np.sum(Pij, axis=0)
m = np.sum(np.dot(Pij.T, X), axis=0)
m1 = np.dot(Pij[:, 0], X) / n[0]
m2 = np.dot(Pij[:, 1], X) / n[1]
q1 = n[0] / float(nb_samples)
q2 = n[1] / float(nb_samples)
c1 = np.zeros((2, 2))
c2 = np.zeros((2, 2))
for t in range(nb_samples):
c1 += Pij[t, 0] * np.outer(X[t] - m1, X[t] - m1)
c2 += Pij[t, 1] * np.outer(X[t] - m2, X[t] - m2)
c1 /= n[0]
c2 /= n[1]
The first thing at the beginning of each cycle is to initialize the Pij matrix that will be used to store the p(yi|xj,θ,w) values. Then, for each sample, we can compute p(yi|xj,θ,w) considering whether it's labeled or not. The Gaussian probability is computed using the SciPy function multivariate_normal.pdf(). When the whole Pij matrix has been populated, we can update the parameters (means and covariance matrix) of both Gaussians and the relative weights. The algorithm is very fast; after five iterations, we get the stable state represented in the following graph:
The two Gaussians have perfectly mapped the space by setting their parameters so as to cover the high-density regions. We can check for some unlabeled points, as follows:
print(np.round(X[Y==-1][0:10], 3))
[[ 1.67 7.204] [ -1.347 -5.672] [ -2.395 10.952] [ -0.261 6.526] [ 1.053 8.961] [ -0.579 -7.431] [ 0.956 9.739] [ -5.889 5.227] [ -2.761 8.615] [ -1.777 4.717]]
It's easy to locate them in the previous plot. The corresponding classes can be obtained through the last Pij matrix:
print(np.round(Pij[Y==-1][0:10], 3))
[[ 0.002 0.998] [ 1. 0. ] [ 0. 1. ] [ 0.003 0.997] [ 0. 1. ] [ 1. 0. ] [ 0. 1. ] [ 0.007 0.993] [ 0. 1. ] [ 0.02 0.98 ]]
This immediately verifies that they have been correctly labeled and assigned to the right cluster. This algorithm is very fast and produces excellent results in terms of density estimation. In Chapter 5, EM Algorithm and Applications, we are going to discuss a general version of this algorithm, explaining the complete training procedure based on the EM algorithm.
In all the examples that involve random numbers, the seed is set equal to 1,000 (np.random.seed(1000)). Other values or subsequent experiments without resetting it can yield slightly different results.