Minimizing the inertia
One of the biggest drawbacks of K-means and similar algorithms is the explicit request for the number of clusters. Sometimes this piece of information is imposed by external constraints (for example, in the example of breast cancer, there are only two possible diagnoses), but in many cases (when an exploratory analysis is needed), the data scientist has to check different configurations and evaluate them. The simplest way to evaluate K-means performance and choose an appropriate number of clusters is based on the comparison of different final inertias.
Let's start with the following simpler example based on 12 very compact Gaussian blobs generated with the scikit-learn function make_blobs():
from sklearn.datasets import make_blobs
X, Y = make_blobs(n_samples=2000, n_features=2, centers=12,
cluster_std=0.05, center_box=[-5, 5], random_state=100)
The blobs are represented in the following screenshot:
Let's now compute the inertia (available as an instance variable inertia_ in a trained KMeans model) for K ∈ [2, 20], as follows:
from sklearn.cluster import KMeans
inertias = []
for i in range(2, 21):
km = KMeans(n_clusters=i, max_iter=1000, random_state=1000)
km.fit(X)
inertias.append(km.inertia_)
The resulting plot is as follows:
The previous plot shows a common behavior. When the number of clusters is very small, the density is proportionally low, hence the cohesion is low and, as a result, the inertia is high. Increasing the number of clusters forces the model to create more cohesive groups and the inertia starts to decrease abruptly. If we continue this process and M >> K, we will observe a very slow approach toward the value corresponding to a configuration where K=M (each sample is a cluster). The generic heuristic rule (when there are no external constraints) is to pick the number of clusters corresponding to the point that separates the high-variation region from the almost flat one. In this way, we are sure that all clusters have reached their maximum cohesion without internal fragmentation. Of course, in this case, if we had selected K=15, nine blobs would have been assigned to different clusters, while, the other three would have split into two parts. Obviously, as we are splitting a high-density region, the inertia remains low, but the principle of maximum separation is not followed anymore.
We can now repeat the experiment with the Breast Cancer Wisconsin dataset with K ∈ [2, 50], as follows:
from sklearn.cluster import KMeans
inertias = []
for i in range(2, 51):
km = KMeans(n_clusters=i, max_iter=1000, random_state=1000)
km.fit(cdf)
inertias.append(km.inertia_)
The resulting plot is shown in the following screenshot:
In this case, the ground truth suggests that we should cluster into two groups corresponding to the diagnoses. However, the plot shows a drastic descent that ends at K=8 and continues with a lower slope until about K=40. During the preliminary analysis, we have seen that the bidimensional projection is made up of many isolated blobs that share the same diagnosis. Therefore, we could decide to employ, for example, K=8 and to analyze the features corresponding to each cluster. As this is not a classification task, the ground truth can be used as the main reference, but a correct exploratory analysis can try to understand the composition of the substructures in order to provide further details for the technicians (for example, physicians).
Let's now perform a K-means clustering with eight clusters on the Breast Cancer Wisconsin dataset in order to describe the structure of two sample groups, as follows:
import pandas as pd
from sklearn.cluster import KMeans
km = KMeans(n_clusters=8, max_iter=1000, random_state=1000)
Y_pred = km.fit_predict(cdf)
df_km = pd.DataFrame(Y_pred, columns=['prediction'], index=cdf.index)
kmdff = pd.concat([dff, df_km], axis=1)
The resulting plot is shown in the following screenshot:
Let's now consider the subcluster located in the bottom part of the plot (-25 < x < 30 and -60 < y < -40), as follows:
sdff = dff[(dff.x > -25.0) & (dff.x < 30.0) & (dff.y > -60.0) & (dff.y < -40.0)]
print(sdff[['perimeter_mean', 'area_mean', 'smoothness_mean',
'concavity_mean', 'symmetry_mean']].describe())
A print-friendly version of the statistical table is shown in the following screenshot:
From the ground truth, we know that all these samples are malignant, but we can try to determine a rule. The ratio area_mean/perimeter_mean is about 9.23 and the relative standard deviations are very small compared to the means. This means that these samples represent extended tumors in a very narrow range. Moreover, both concavity_mean and symmetry_mean are larger than the overall values. Hence (without the presumption of scientifically reasonable analysis), we can conclude that samples assigned to these clusters represent very bad tumors that have arrived at an advanced stage.
To make a comparison with benign samples, let's consider now the area delimited by x > -10 and 20 < y < 50, as follows:
sdff = dff[(dff.x > -10.0) & (dff.y > 20.0) & (dff.y < 50.0)]
print(sdff[['perimeter_mean', 'area_mean', 'smoothness_mean',
'concavity_mean', 'symmetry_mean']].describe())
The result is shown in the following screenshot:
In this case, the ratio area_mean/perimeter_mean is about 4.89, but area_mean has a larger standard deviation (indeed, its max value is about 410). The concavity_mean is extremely small with respect to the previous one (even with approximately the same standard deviation), while the symmetry_mean is almost equivalent. From this brief analysis, we can deduce that symmetry_mean is not a discriminant feature, while a ratio area_mean/perimeter_mean less than 5.42 (considering the max values) with a concavity_mean less or equal to 0.04 should guarantee a benign result. As concavity_mean can reach a very large max value (larger than the one associated with malignant samples), it's necessary to consider also the other features in order to decide whether its value should be considered as an alarm. However, we can conclude, saying that all samples belonging to these clusters are benign with a negligible error probability. I'd like to repeat that this is more an exercise than a real analysis and, in such situations, the main task of the data scientist is to collect contextual pieces of information that can support the conclusions. Even in the presence of the ground truth, this validation process is always mandatory because the complexity of the underlying causes can lead to completely wrong statements and rules.