上QQ阅读APP看书，第一时间看更新

Preparing the data

The data preparation activity is key to the success of the learning solution. The data is the key entity required for machine learning and it must be prepared properly to ensure the proper end results and objectives are obtained.

Data engineers usually spend around 80-90 percent of their overall time in the data preparation phase to get the right data, as this is fundamental and the most critical task for the success of the implementation of the machine learning program.

The following actions need to be performed in order to prepare the data:

Identify all sources of data: We need to identify all data sources that can solve the problem at hand and collect the data from multiple sources—files, databases, emails, mobile devices, the internet, and so on.
Explore the data: This step involves understanding the nature of the data, as follows:
- Integrate data from different systems and explore it.
- Understand the characteristics and nature of the data.
- Go through the correlations between data entities.
- Identify the outliers. Outliers will help with identifying any problems with the data.
- Apply various statistical principles such as calculating the median, mean, mode, range, and standard deviation to arrive at data skewness. This will help with understanding the nature and spread of data.
- If data is skewed or we see the value of the range is outside the expected boundary, we know that the data has a problem and we need to revisit the source of the data.
- Visualization of data through graphs will also help with understanding the spread and quality of the data.
Preprocess the data: The goal of this step is to create data in a format that can be used for the next step:
- Data cleansing:
  - Addressing the missing values. A common strategy used to impute missing values is to replace missing values with the mean or median value. It is important to define a strategy for replacing missing values.
  - Addressing duplicate values, invalid data, inconsistent data, outliers, and so on.
- Feature selection: Choosing the data features that are the most appropriate for the problem at hand. Removing redundant or irrelevant features that will simplify the process.
- Feature transformation: This phase maps the data from one format to another that will help in proceeding to the next steps of machine learning. This involves normalizing the data and dimensionality reduction. This involves combining various features into one feature or creating new features. For example, say we have the date and time as attributes.
  
  It would be more meaningful to have them transformed as a day of the week, a day of the month, and a year, which would provide more meaningful insight:
  - To create Cartesian products of one variable with another. For example, if we have two variables, such as population density (maths, physics, and commerce) and gender (girls and boys), the features formed by a Cartesian product of these two variables might contain useful information resulting in features such as (maths_girls, physics_girls, commerce_girls, maths_boys, physics_boys, and commerce_boys).
  - Binning numeric variables to categories. For example, the size value of hips/shoulders can be binned to categories such as small, medium, large, and extra large.
  - Domain-specific features, for example, combining the subjects maths, physics, and chemistry to a maths group and combining physics, chemistry, and biology to a biology group.
Divide the data into training and test sets: Once the data is transformed, we then need to select the required test set and a training set. An algorithm is evaluated against the test dataset after training it on the training dataset. This split of the data into training and test datasets may be as direct as performing a random split of data (66 percent for training, 34 percent for testing) or it may involve more complicated sampling methods.

The 66 percent /34 percent split is just a guide. If you have 1 million pieces of data, a 90 percent /10 percent split should be enough. With 100 million pieces of data, you can even go down to 99 percent /1 percent.

A trained model is not exposed to the test dataset during training and any predictions made on that dataset are designed to be indicative of the performance of the model in general. As such, we need to make sure the selection of datasets is representative of the problem that we are solving.