Creating dummy variables
Creating dummy variables is a method to create separate variable for each category of a categorical variable., Although, the categorical variable contains plenty of information and might show a causal relationship with output variable, it can't be used in the predictive models like linear and logistic regression without any processing.
In our dataset, sex
is a categorical variable with two categories that are male and female. We can create two dummy variables out of this, as follows:
dummy_sex=pd.get_dummies(data['sex'],prefix='sex')
The result of this statement is, as follows:
Fig. 2.17: Dummy variable for the sex variable in the Titanic dataset
This process is called dummifying, the variable creates two new variables that take either 1
or 0
value depending on what the sex of the passenger was. If the sex was female, sex_female
would be 1
and sex_male
would be 0
. If the sex was male, sex_male
would be 1
and sex_female
would be 0
. In general, all but one dummy variable in a row will have a 0
value. The variable derived from the value (for that row) in the original column will have a value of 1
.
These two new variables can be joined to the source data frame, so that they can be used in the models. The method to that is illustrated, as follows:
column_name=data.columns.values.tolist() column_name.remove('sex') data[column_name].join(dummy_sex)
The column names are converted to a list and the sex is removed from the list before joining these two dummy variables to the dataset, as it will not make sense to have a sex variable with these two dummy variables.