上QQ阅读APP看书，第一时间看更新

Defining the machine learning problem

As defined by Tom Mitchell, the problem must be a well-defined machine learning problem. The three important questions to be solved at this stage include the following:

Do we have the right problem?
Do we have the right data?
Do we have the right success criteria?

The problem should be such that the outcome that is going to be obtained as a solution to the problem is valuable for the business. There should be sufficient historical data that should be available for learning/training purposes. The objective should be measurable and we should know how much of the objective has been achieved at any point in time.

For example, if we are going to identify fraudulent transactions from a set of online transactions, then determining such fraudulent transactions is definitely valuable for the business. We need to have a sufficient set of online transactions. We should have a sufficient set of transactions that belong to various fraudulent categories. We should also have a mechanism to determine whether the outcome predicted as a fraudulent or nonfraudulent transaction can be verified and validated for the accuracy of prediction.

To give users an idea of what data would be sufficient to implement machine learning, we could say that a dataset of at least 100 items should be fine for starters and 1,000 would be nice. The more data we have that may cover all realistic scenarios for the problem domain, the better it is for the learning algorithm.