Effective Amazon Machine Learning
上QQ阅读APP看书,第一时间看更新

Building the simplest predictive analytics algorithm

Predictive analytics can be very simple. We introduce a very simple example of a predictive model in the context of binary classification based on a simple threshold.

Imagine that a truck transporting small oranges and large grapefruits runs off the road; all the boxes of fruits open up, and all the fruits end up mixed together. Equipped with a simple weighing scale and a way to roll the fruits out of the truck, you want to be able to separate them automatically based on their weights. You have some information on the average weights of small oranges (96g) and large grapefruits (166g).

According to the USDA, the average weight of a medium-sized orange is 131 grams, while a larger orange weighs approximately 184 grams, and a smaller one around 96 grams. 

  • Large grapefruit (approx 4-1/2'' dia) 166g
  • Medium grapefruit (approx 4'' dia) 128g
  • Small grapefruit (approx 3-1/2'' dia) 100g

Your predictive model is the following:

  • You arbitrarily set a threshold of 130g
  • You weigh each fruit
  • If the fruit weighs more than 130g, it's a grapefruit; otherwise it's an orange

There! You have a robust reliable, predictive model that can be applied to all your mixed up fruits to separate them. Note that in this case, you've set the threshold with an educated guess. There was no machine learning involved.

In machine learning, the models learn by themselves. Instead of setting the threshold yourself, you let your program evolve and calculate the weight separation threshold of fruits by itself.

For that, you would set aside a certain number of oranges and grapefruits. This is called the training dataset. It's important that this training dataset has roughly the same number of oranges and grapefruits.

And you let the machine decide the threshold value by itself. A possible algorithm could be along these lines:

  1. Set original weight estimation at w_0 = 100g to initialize and a counter k = 0
  2. For each new fruit in the training dataset, adjust the weight estimation according to the following:
        For each new fruit_weight:
w(k+1) = (k*w(k) + fruit_weight)/ (k+1)
k = k+1

Assuming that your training dataset is representative of all the remaining fruits and that you have enough fruits, the threshold would converge under certain conditions to the best average between all the fruit weights. A value which you use to separate all the other fruits depending on whether they weight more or less than the threshold you estimated. The following plot shows the convergence of this crude algorithm to estimate the average weight of the fruits:

This problem is a typical binary classification model. If we had not two but three types of fruits (lemons, oranges, and grapefruit), we would have a multiclass classification problem.

In this example, we only have one predictor: the weight of the fruit. We could add another predictor such as the diameter. This would result in what is called a multivariate classification problem.

In practice, machine learning uses more complex algorithms such as the SGD, the linear algorithm used by Amazon ML. Other classic prediction algorithms include Support Vector Machines, Bayes classifiers, Random forests and so on. Each algorithm has its strength and set of assumptions on the dataset.