Hands-On Unsupervised Learning with Python

上QQ阅读APP看书，第一时间看更新

Introduction to clustering

As we explained in Chapter 1, Getting Started with Unsupervised Learning, the main goal of a cluster analysis is to group the elements of a dataset according to a similarity measure or a proximity criterion. In the first part of this chapter, we are going to focus on the former approach, while in the second part and in the next chapter, we will analyze more generic methods that exploit other geometric features of the dataset.

Let's take a data generating process p_data(x) and draw N samples from it:

It's possible to assume that the probability space of p_data(x) is partitionable into (potentially infinite) configurations containing K (for K=1,2, ...) regions so that p_data(x; k) represents the probability of a sample belonging to a cluster k. In this way, we are stating that every possible clustering structure is already existing when p_data(x) is determined. Further assumptions can be made on the clustering probability distribution that better approximate p_data(x) (as we're going to see in Chapter 5, Soft Clustering and Gaussian Mixture Models). However, as we are trying to split the probability space (and the corresponding samples) into cohesive groups, we can assume two possible strategies:

Hard clustering: In this case, each sample x_p ∈ X is assigned to a cluster K_i and K_i ∩ K_j = ∅ for i ≠ j. The majority of algorithms we are going to discuss belong to this category. In this case, the problem can be expressed as a parameterized function that assigns a cluster to each input sample:

Soft clustering: It is often subdivided into probabilistic and fuzzy clustering and such an approach determines the probability p(x) of every sample x_p ∈ X belonging to predetermined clusters. Hence, if there are K clusters, we have a probability vector p(x) = [p₁(x), p₂(x), ..., p_k(x)], where p_i(x) represents the probability of being assigned to the cluster i. In this case, the clusters are not disjointed and, generally, a sample will belong to all clusters with a membership degree that is equivalent to a probability (this concept is peculiar to fuzzy clustering).

For our purposes, in this chapter we simply assume that the dataset X is drawn from a data generating process whose space, given a metric function, is splittable into compact regions separated from each other. In fact, our main objective is to find K clusters that satisfy the double property of maximum cohesion and maximum separation. This concept will be clearer when discussing the K-means algorithm. However, it's possible to imagine clusters as blobs whose density is much higher than the one observable in the space separating two or more of them, as shown in the following diagram:

Bidimensional clustering structure obeying the rule of maximum cohesion and maximum separation. N _k represents the number of samples belonging to the cluster k while N _out(r) is the number of samples that are outside the balls centered at each cluster center with a maximum radius r

In the preceding diagram, we are assuming that the majority of samples will be captured by one of the balls, considering the maximum distance of a sample from the center. However, as we don't want to impose any restriction on the growth of a ball (that is, it can contain any number of samples), it's preferable not to consider the radius and to evaluate the separating region by sampling small subregions (of the whole space) and collecting their densities.

In a perfect scenario, the clusters span some subregions whose density is D, while the separating region is characterized by a density d << D. The discussion about geometric properties can become extremely complex and, in many cases, it's extremely theoretical. Henceforth, we consider only the distance between the closest points belonging to different clusters. If this value is much smaller than the maximum distance between a sample and its cluster center for all clusters, we can be sure that the separation is effective and it's easy to distinguish between clusters and separating regions. Instead, when working with a distance metric (for example, in K-means), another important requirement we need to consider is the convexity of the clusters. A generic set C is convex if ∀ x₁, x₂ ∈ C, and all the points belonging to the segment connecting x₁ and x₂ belong to C. In the following diagram there's a comparison between convex and non-convex (concave) clusters:

Example of a convex cluster (left) and a concave one (right)

Algorithms such as K-means are unfortunately unable to manage non-convex clusters because of the symmetry of the distance functions. In our exploration, we are going to show this limitation and how other methods overcome it.