Analysis of the Breast Cancer Wisconsin dataset
In this chapter, we are using the well-known Breast Cancer Wisconsin dataset to perform a cluster analysis. Originally, the dataset was proposed in order to train classifiers; however, it can be very helpful for a non-trivial cluster analysis. It contains 569 records made up of 32 attributes (including the diagnosis and an identification number). All the attributes are strictly related to biological and morphological properties of the tumors, but our goal is to validate generic hypotheses considering the ground truth (benign or malignant) and the statistical properties of the dataset. Before moving on, it's important to clarify some points. The dataset is high-dimensional and the clusters are non-convex (so we cannot expect a perfect segmentation). Moreover our goal is not using a clustering algorithm to obtain the results of a classifier; therefore, the ground truth must be taken into account only as a generic indication of a potential grouping. The goal of such an example is to show how to perform a brief preliminary analysis, select the optimal number of clusters, and validate the final results.
Once downloaded (as explained in the technical requirements section), the CSV file must be placed in a folder that we generically indicate as <data_folder>. The first step is loading the dataset and performing a global statistical analysis through the function describe() exposed by a pandas DataFrame, as follows:
import numpy as np
import pandas as pd
bc_dataset_path = '<data_path>\wdbc.data'
bc_dataset_columns = ['id','diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se','texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst',
'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
df = pd.read_csv(bc_dataset_path, index_col=0, names=bc_dataset_columns).fillna(0.0)
print(df.describe())
I strongly suggest using a Jupyter Notebook (in this case, the command must be only df.describe()), where all the commands yield inline outputs. For practical reasons, in the following screenshot, the first part of the tabular output (containing eight attributes) is shown:
Of course, I invite the reader to check the values for all attributes, even if we are focusing our attention only on a subset. In particular, we need to observe the different scales existing among the first eight attributes. The standard deviations range from 0.01 to 350 and this means that many vectors could be extremely similar only because of one or two attributes. On the other side, normalizing the value with a variance scaling will give all the attributes the same responsibility (for example, area_mean is bounded between 143.5 and 2501, while smoothness_mean is bounded between 0.05 and 0.16. Forcing them to have the same variance can influence the biological impact of the factors and, as we don't have any specific indication, we are not authorized to make such a choice). Clearly, some attributes will have a higher weight in the clustering process and we accept their major influence as a context-related condition.
Let's start now a preliminary analysis considering the pair-plot of perimeter_mean, area_mean, smoothness_mean, concavity_mean, and symmetry_mean. The plot is shown in the following screenshot:
The diagram plots each non-diagonal attribute as a function of all the other ones, while the diagonal plots represent the distributions of every attribute split into two components (in this case, this is the diagnosis). Hence, the second non-diagonal plot (top-left) is the diagram of perimeter_mean as a function of area_mean, and so on. A rapid analysis highlights some interesting elements:
- area_mean and perimeter_mean have a clear correlation and determine a sharp separation. When area_mean is greater than about 1,000, obviously also the perimeter increases and the diagnosis switches abruptly from benign to malignant. Hence these two attributes are determinant for the final result and one of them is likely to be redundant.
- Other plots (for example, perimeter_mean/area_mean versus smoothness_mean, area_mean versus symmetry_mean, concavity_mean versus smoothness_mean, and concavity_mean versus symmetry_mean) have a horizontal separation, (which becomes vertical inverting the axis). This means that, for almost all values assumed by the independent variable (x axis), there's a threshold that separates the values of the other variable in two sets (benign and malignant).
- Some plots (for example, perimeter_mean/area_mean versus concavity_mean/concavity_mean versus symmetry_mean) show a slightly negative sloped diagonal separation. This means that when the independent variable is small the diagnosis remains constant for almost all values of the dependent variable, while, on the other side, when the independent variable becomes larger and larger, the diagnosis switches proportionally to the opposite value. For example, for small perimeter_mean values, the concavity_mean can reach its maximum without affecting the diagnosis, (which is benign), while perimeter_mean > 150 yields always a malignant diagnosis independently on the concavity_mean.
Of course, we cannot easily draw our conclusions from a split analysis (because we need to consider all the interactions), but this activity will be helpful in order to provide each cluster with a semantic label. At this point, it's helpful to visualize the dataset (without the non-structural attributes) on a bidimensional plane through a t-Distributed Stochastic Neighbor Embedding (t-SNE) transformation (for further details, please check Visualizing Data using t-SNE, van der Maaten L., Hinton G., Journal of Machine Learning Research 9, 2008). This can be done as follows:
import pandas as pd
from sklearn.manifold import TSNE
cdf = df.drop(['diagnosis'], axis=1)
tsne = TSNE(n_components=2, perplexity=10, random_state=1000)
data_tsne = tsne.fit_transform(cdf)
df_tsne = pd.DataFrame(data_tsne, columns=['x', 'y'], index=cdf.index)
dff = pd.concat([df, df_tsne], axis=1)
The resulting plot is shown in the following screenshot:
The diagram is highly non-linear (don't forget this a projection from ℜ30 to ℜ2), but the majority of malignant samples are in the half-plane y < 0. Unfortunately, also a moderate percentage of benign samples are in this region, hence we don't expect a perfect separation using K=2 (in this case, it's very difficult to understand the real geometry, but t-SNE guarantees that the bidimensional distribution has the smallest Kullback-Leibler divergence with the original high-dimensional one). Let's now perform an initial clustering with K=2. We are going to create an instance of the KMeans scikit-learn class with n_clusters=2 and max_iter=1000 (the random_state will be always set equal to 1000 whenever possible).
The remaining parameters are the default ones (K-means++ initialization with 10 attempts), as follows:
import pandas as pd
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2, max_iter=1000, random_state=1000)
Y_pred = km.fit_predict(cdf)
df_km = pd.DataFrame(Y_pred, columns=['prediction'], index=cdf.index)
kmdff = pd.concat([dff, df_km], axis=1)
The resulting plot is shown in the following screenshot:
Not surprisingly, the result is rather accurate for y < -20, but the algorithm is not able to also include the boundary points (y ≈ 0) into the main malignant cluster. This is mainly due to the non-convexity of the original sets and it's very difficult to solve the problem using K-means. Moreover, in the projection, most of the malignant samples with y ≈ 0 are mixed with benign ones, so the probability of error is high also with other methods based on the proximity. The only chance of correctly separating those samples is derived from the original distribution. In fact, if the points belonging to the same category could be captured by disjoint balls in ℜ30, K-means could also succeed. Unfortunately, in this case, the mixed set seems very cohesive, hence we cannot expect to improve the performance without a transformation. However, for our purposes, this result allows us to apply the main evaluation metrics and then, to move from K=2 to larger values. With K>2, we are going to analyze some of the clusters, comparing their structure with the pair-plot.