Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Creating dummy variables

Creating dummy variables is a method to create separate variable for each category of a categorical variable., Although, the categorical variable contains plenty of information and might show a causal relationship with output variable, it can't be used in the predictive models like linear and logistic regression without any processing.

In our dataset, sex is a categorical variable with two categories that are male and female. We can create two dummy variables out of this, as follows:

dummy_sex=pd.get_dummies(data['sex'],prefix='sex')

The result of this statement is, as follows:

Fig. 2.17: Dummy variable for the sex variable in the Titanic dataset

This process is called dummifying, the variable creates two new variables that take either 1 or 0 value depending on what the sex of the passenger was. If the sex was female, sex_female would be 1 and sex_male would be 0. If the sex was male, sex_male would be 1 and sex_female would be 0. In general, all but one dummy variable in a row will have a 0 value. The variable derived from the value (for that row) in the original column will have a value of 1.

These two new variables can be joined to the source data frame, so that they can be used in the models. The method to that is illustrated, as follows:

column_name=data.columns.values.tolist()
column_name.remove('sex')
data[column_name].join(dummy_sex)

The column names are converted to a list and the sex is removed from the list before joining these two dummy variables to the dataset, as it will not make sense to have a sex variable with these two dummy variables.