Selecting the algorithm
The selection of the algorithm is arguably the most important decision that an ML application engineer will need to make, and the one that will take the most research. Sometimes, it is even required to combine an ML algorithm with traditional computer science algorithms to make a problem more tractable—an example of this is a recommender system that we consider later.
A good first step to start homing in on the best algorithm to solve a given problem is to determine whether a supervised or unsupervised approach is required. We introduced both earlier in the chapter. As a rule of thumb, when you are in possession of a labeled dataset and wish to categorize or predict a previously unseen sample, this will use a supervised algorithm. When you wish to understand an unlabeled dataset better by clustering it into different groups, possibly to then classify new samples against, you will use an unsupervised learning algorithm. A deeper understanding of the advantages and pitfalls of each algorithm and a thorough exploration of your data will provide enough information to select an algorithm. To help you get started, we cover a range of supervised learning algorithms in Chapter 3, Supervised Learning, and unsupervised learning algorithms in Chapter 4, Unsupervised Learning.
Some problems can lend themselves to a deft application of both ML techniques and traditional computer science. One such problem is recommender systems, which are now widespread in online retailers such as Amazon and Netflix. This problem asks, given a dataset of each users set of purchased items, predict a set of N items that the user is most likely to purchase next. This is exemplified in Amazons people who buy X also buy Y system.
The basic idea of the solution is that, if two users purchase very similar items, then any items not in the intersection of their purchased items are good candidates for their future purchases. First, transform the dataset so that it maps pairs of items to a score that expresses their co-occurrence. This can be computed by taking the number of times that the same customer has purchased both items, divided by the number of times a customer has purchased either one or the other, to give a number between 0 and 1. This now provides a labeled dataset to train a supervised algorithm such as a binary classifier to predict the score for a previously unseen pair. Combining this with a sorting algorithm can produce, given a single item, a list of items in a sorted rank of purchasability.