Mastering Java for Data Science
上QQ阅读APP看书,第一时间看更新

Clustering

Typically, when people talk about unsupervised learning, they talk about cluster analysis or clustering. A cluster analysis algorithm takes a set of data points and tries to categorize them into groups such that similar items belong to the same group, and different items do not. There are many ways where it can be used, for example, in customer segmentation or text categorization.

Customer segmentation is an example of clustering. Given some description of customers, we try to put them into groups such that the customers in one group have similar profiles and behave in a similar way. This information can be used to understand what do the people in these groups want, and this can be used to target them with better advertisements and other promotional messages.

Another example is text categorization. Given a collection of texts, we would like to find common topics among these texts and arrange the texts according to these topics. For example, given a set of complaints in an e-commerce store, we may want to put ones that talk about similar things together, and this should help the users of the system navigate through the complaints easier.

Examples of cluster analysis algorithms are hierarchical clustering, k-means, density-based spatial clustering of applications with noise (DBSCAN), and many others. We will talk about clustering in detail in the first part of Chapter 5, Unsupervised Learning - Clustering and Dimensionality Reduction.