K-means++
Finding the optimal initial configuration is equivalent to minimizing the inertia; however, Arthur and Vassilvitskii (in K-means++: The Advantages of Careful Seeding, Arthur D., Vassilvitskii S., Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007) have proposed an alternative initialization method (called K-means++), which can dramatically improve the convergence speed by choosing initial centroids with a higher probability of being close to the final ones. The complete proof is quite complex and can be found in the aforementioned paper. In this context, we are providing directly the final results and some important consequences.
Let's consider the function D(•) defined as:
D(•) represents the shortest distance between a sample x ∈ X and a centroid already selected. Once the function has been computed, it's possible to determine a probability distribution G(x) as follows:
The first centroid μ1 is sampled from a uniform distribution. At this point, it's possible to compute D(•) for all samples x ∈ X and, therefore, the distribution G(x). It's straightforward that if we sample from G(x), the probability of selecting a value in a dense region is much larger than the probability of uniform sampling or of picking a centroid in a separating region. Hence, we continue by sampling μ2 from G(x). The procedure is repeated until all K centroids have been determined. Of course, as this is a probabilistic approach, we have no guarantee that the final configuration is optimal. However, the employment of K-means++ is O(log K)-competitive. In fact, if Sopt is the theoretical optimal value for S, the authors proved that the following inequality holds:
As S is decreased by a better choice, the previous formula sets an upper bound for the expected value E[S] roughly proportional to log K. For example, for K=10, E[S] 19.88Sopt and E[S] 12.87Sopt for K=3. This result reveals two important elements. The first one is that K-means++ performs better when K is not extremely large and the second, probably also the most important, is that a single K-means++ initialization cannot be enough to obtain the optimal configuration. For this reason, common implementations (for example, scikit-learn) performs a variable number of initializations and select the one whose initial inertia is the smallest.