
Handling categorical data
So far, we have only been working with numerical values. However, it is not uncommon that real-world datasets contain one or more categorical feature columns. In this section, we will make use of simple yet effective examples to see how we deal with this type of data in numerical computing libraries.
Nominal and ordinal features
When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.
Creating an example dataset
Before we explore different techniques to handle such categorical data, let's create a new DataFrame
to illustrate the problem:
>>> import pandas as pd >>> df = pd.DataFrame([ ... ['green', 'M', 10.1, 'class1'], ... ['red', 'L', 13.5, 'class2'], ... ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color size price classlabel 0 green M 10.1 class1 1 red L 13.5 class2 2 blue XL 15.3 class1
As we can see in the preceding output, the newly created DataFrame
contains a nominal feature (color
), an ordinal feature (size
), and a numerical feature (price
) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. The learning algorithms for classification that we discuss in this book do not use ordinal information in class labels.
Mapping ordinal features
To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size
feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, :
>>> size_mapping = { ... 'XL': 3, ... 'L': 2, ... 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1
If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()}
that can then be used via the pandas map
method on the transformed feature column, similar to the size_mapping
dictionary that we used previously. We can use it as follows:
>>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0 M 1 L 2 XL Name: size, dtype: object
Encoding class labels
Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0
:
>>> import numpy as np >>> class_mapping = {label:idx for idx,label in ... enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1}
Next, we can use the mapping dictionary to transform the class labels into integers:
>>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df color size price classlabel 0 green 1 10.1 0 1 red 2 13.5 1 2 blue 3 15.3 0
We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation:
>>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1
Alternatively, there is a convenient LabelEncoder
class directly implemented in scikit-learn to achieve this:
>>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0])
Note that the fit_transform
method is just a shortcut for calling fit
and transform
separately, and we can use the inverse_transform
method to transform the integer class labels back into their original string representation:
>>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object)
The class_le.inverse_transform(y)
call may raise a DeprecationWarning
due to an implementation detail in scikit-learn. It was already addressed in a pull request (https://github.com/scikit-learn/scikit-learn/pull/9816), and the patch will be released with the next version of scikit-learn (i.e., v. 0.20.0).
Performing one-hot encoding on nominal features
In the previous section, we used a simple dictionary-mapping approach to convert the ordinal size
feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder
to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color
column of our dataset, as follows:
>>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1], [2, 2, 13.5], [0, 3, 15.3]], dtype=object)
After executing the preceding code, the first column of the NumPy array X
now holds the new color
values, which are encoded as follows:
blue
=0
green
=1
red
=2
If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green
is larger than blue
, and red
is larger than green
. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal.
A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color
feature into three new features: blue
, green
, and red
. Binary values can then be used to indicate the particular color
of a sample; for example, a blue
sample can be encoded as blue=1
, green=0
, red=0
. To perform this transformation, we can use the OneHotEncoder
that is implemented in the scikit-learn.preprocessing module:
>>> from sklearn.preprocessing import OneHotEncoder >>> ohe = OneHotEncoder(categorical_features=[0]) >>> ohe.fit_transform(X).toarray() array([[ 0. , 1. , 0. , 1. , 10.1], [ 0. , 0. , 1. , 2. , 13.5], [ 1. , 0. , 0. , 3. , 15.3]])
When we initialized the OneHotEncoder
, we defined the column position of the variable that we want to transform via the categorical_features
parameter (note that color
is the first column in the feature matrix X
). By default, the OneHotEncoder
returns a sparse matrix when we use the transform
method, and we converted the sparse matrix representation into a regular (dense) NumPy array for the purpose of visualization via the toarray
method. Sparse matrices are a more efficient way of storing large datasets and one that is supported by many scikit-learn functions, which is especially useful if an array contains a lot of zeros. To omit the toarray
step, we could alternatively initialize the encoder as OneHotEncoder(..., sparse=False)
to return a regular NumPy array.
An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies
method implemented in pandas. Applied to a DataFrame
, the get_dummies
method will only convert string columns and leave all other columns unchanged:
>>> pd.get_dummies(df[['price', 'color', 'size']]) price size color_blue color_green color_red 0 10.1 1 0 1 0 1 13.5 2 0 0 1 2 15.3 3 1 0 0
When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue
, the feature information is still preserved since if we observe color_green=0
and color_red=0
, it implies that the observation must be blue
.
If we use the get_dummies
function, we can drop the first column by passing a True
argument to the drop_first
parameter, as shown in the following code example:
>>> pd.get_dummies(df[['price', 'color', 'size']], ... drop_first=True) price size color_green color_red 0 10.1 1 1 0 1 13.5 2 0 1 2 15.3 3 0 0
The OneHotEncoder
does not have a parameter for column removal, but we can simply slice the one-hot encoded NumPy array as shown in the following code snippet:
ohe = OneHotEncoder(categorical_features=[0]) ohe.fit_transform(X).toarray()[:, 1:] array([[ 1. , 0. , 1. , 10.1], [ 0. , 1. , 2. , 13.5], [ 0. , 0. , 3. , 15.3]])