Python Machine Learning Cookbook（Second Edition）

上QQ阅读APP看书，第一时间看更新

How to do it...

Let's see how to perform a clustering data analysis using the k-means algorithm:

The full code for this recipe is given in the kmeans.py file that has already been provided to you. Now let's take a look at how it's built. Create a new Python file, and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans

Now let's load the input data and define the number of clusters. We will use the data_multivar.txt file that has already been provided to you:

input_file = ('data_multivar.txt')
# Load data
x = []
with open(input_file, 'r') as f:
    for line in f.readlines():
        data = [float(i) for i in line.split(',')]
        x.append(data)

data = np.array(x)
num_clusters = 4

We need to see what the input data looks like. Let's go ahead and add the following lines of code to the Python file:

plt.figure() 
plt.scatter(data[:,0], data[:,1], marker='o',  
        facecolors='none', edgecolors='k', s=30) 
x_min, x_max = min(data[:, 0]) - 1, max(data[:, 0]) + 1 
y_min, y_max = min(data[:, 1]) - 1, max(data[:, 1]) + 1 
plt.title('Input data') 
plt.xlim(x_min, x_max) 
plt.ylim(y_min, y_max) 
plt.xticks(()) 
plt.yticks(())

If you run this code, you will get the following output:

We are now ready to train the model. Let's initialize the kmeans object and train it:

kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10) 
kmeans.fit(data)

Now that the data is trained, we need to visualize the boundaries. Let's go ahead and add the following lines of code to the Python file:

# Step size of the mesh 
step_size = 0.01 
 
# Plot the boundaries 
x_min, x_max = min(data[:, 0]) - 1, max(data[:, 0]) + 1 
y_min, y_max = min(data[:, 1]) - 1, max(data[:, 1]) + 1 
x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size)) 
 
# Predict labels for all points in the mesh 
predicted_labels = kmeans.predict(np.c_[x_values.ravel(), y_values.ravel()])

We just evaluated the model across a grid of points. Let's plot these results to view the boundaries:

# Plot the results 
predicted_labels = predicted_labels.reshape(x_values.shape) 
plt.figure() 
plt.clf() 
plt.imshow(predicted_labels, interpolation='nearest', 
           extent=(x_values.min(), x_values.max(), y_values.min(), y_values.max()), 
           cmap=plt.cm.Paired, 
           aspect='auto', origin='lower') 
 
plt.scatter(data[:,0], data[:,1], marker='o',  
        facecolors='none', edgecolors='k', s=30)

Now let's overlay centroids on top of it:

centroids = kmeans.cluster_centers_ 
plt.scatter(centroids[:,0], centroids[:,1], marker='o', s=200, linewidths=3, 
        color='k', zorder=10, facecolors='black') 
x_min, x_max = min(data[:, 0]) - 1, max(data[:, 0]) + 1 
y_min, y_max = min(data[:, 1]) - 1, max(data[:, 1]) + 1 
plt.title('Centoids and boundaries obtained using KMeans') 
plt.xlim(x_min, x_max) 
plt.ylim(y_min, y_max) 
plt.xticks(()) 
plt.yticks(()) 
plt.show()

If you run this code, you should see the following output:

The four centroids and their boundaries are sufficiently highlighted.

本周热推：

癌症疗愈录：肿瘤门诊叙事纪实神经系统常见疾病的诊断与治疗儿童音乐治疗理论与应用方法癫痫综合征管理手册癫痫的诊断与治疗：临床实践与思考