Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Grouping the data – aggregation, filtering, and transformation

In this section, you will learn how to aggregate data over categorical variables. This is a very common practice when the data consists of categorical variables. This analysis enables us to conduct a category-wise analysis and take further decisions regarding the modelling.

To illustrate the concepts of grouping and aggregating data better, let's create a simple dummy data frame that has a rich mix of both numerical and categorical variables. Let's use whatever we have explored till now about random numbers to create this data frame, as shown in the following snippet:

import numpy as np
import pandas as pd
a=['Male','Female']
b=['Rich','Poor','Middle Class']
gender=[]
seb=[]
for i in range(1,101):
    gender.append(np.random.choice(a))
    seb.append(np.random.choice(b))
height=30*np.random.randn(100)+155
weight=20*np.random.randn(100)+60
age=10*np.random.randn(100)+35
income=1500*np.random.randn(100)+15000

df=pd.DataFrame({'Gender':gender,'Height':height,'Weight':weight,'Age':age,'Income':income,'Socio-Eco':seb})
df.head()

The output data frame df looks something as follows:

Fig. 3.22: The resulting dummy data frame df containing 6 columns

As we can see from the preceding code snippet, the shape of the data frame is 100x6.

Grouping can be done over a categorical variable using the groupby function. The column name of the categorical variable needs to be specified for this. Suppose that we wish to group the data frame based on the Gender variable. This can be done by writing the following:

df.groupby('Gender')

If you run the preceding snippet on your IDE, you will get the following output indicating that a groupby object has been created:

Fig. 3.23: Prompt showing that the groupby object has been created

The groupby function doesn't split the original data frame into several groups, instead it creates a groupby object that has two attributes, which are name and group.

These attributes can be accessed by following the name of the groupby object with '.', followed by the name of the attribute. For example, to access the group attribute, one can write the following:

grouped = df.groupby('Gender')
grouped.groups

The following is the output:

Fig. 3.24: Two groups based on gender

The numbers indicate the row numbers that belong to that particular group.

One important feature of these attributes is that they are iterable, and the same operation can be applied to each group just by looping. This comes in very handy when the number of groups are large and one needs results of the operation separately for each group.

Let's perform a simple operation to illustrate this. Let's try to print the name and groups in the groupby object that we just created. This can be done as follows:

grouped=df.groupby('Gender')
for names,groups in grouped:
    print names
    print groups

This prints the name of the group followed by the entire data for this group. The output looks something like the following:

Fig. 3.25.1: Name and the data in the group with gender female

Here is the second group as a part of the output:

Fig. 3.25.2: Name and the data in the group with gender male

A single group can be selected by writing the following:

grouped.get_group('Female')

This would generate only the first of the two groups, as shown in the preceding screenshot.

A data frame can be grouped over more than one categorical variable as well. As in this case, the data frame can be grouped over both Gender and Soci-Eco by writing something like the following:

grouped=df.groupby(['Gender','Socio-Eco'])

This should create six groups from a combination of two categories of Gender and three categories of the Socio-Eco variable. This can be checked by checking the length of the groupby object as follows:

len(grouped)

It indeed returns six. To look at how these groups look, let's run the same iteration on the group attributes as we did earlier:

grouped=df.groupby(['Gender','Socio-Eco'])
for names,groups in grouped:
    print names
    print groups

The code gives six groups' names and their entire data as the output. There would be six of such groups in total.

The first group looks like the following:

Fig. 3.26.1: Name and the data in the group with gender female and Socio_Eco Middle Class

The second group looks like the following:

Fig. 3.26.2: Name and the data in the group with gender female and Socio_Eco Middle Class

Aggregation

There are various aggregations that are possible on a data frame, such as sum, mean, describe, size, and so on. The aggregation basically means applying a function to all the groups all at once and getting a result from that particular group.

Let's see the sum function. We just need to write the following code snippet to see how it works:

grouped=df.groupby(['Gender','Socio-Eco'])
grouped.sum()

We gets the following table as the result:

Fig. 3.27: Sum of each column for different groups

To get the number of rows in each group (or calculate the size of each group), we can write something similar to the following code snippet:

grouped=df.groupby(['Gender','Socio-Eco'])
grouped.size()

This results in a table, as shown in the following screenshot:

Fig. 3.28: Size of each group

One can use the describe function to get the summary statistics for each group separately. The syntax is exactly the same as it is for the earlier two functions:

grouped=df.groupby(['Gender','Socio-Eco'])
grouped.describe()

This output looks similar to the following table:

Fig. 3.29: All the summary statistics of each column for different groups

The groupby objects behave similar to an inpidual data frame, in the sense that one can select columns from these groupby objects just as we do from the data frames:

grouped=df.groupby(['Gender','Socio-Eco'])
grouped_income=grouped['Income']

One can apply different functions to different columns. The aggregate method used to do this is shown in the following snippet. With the following snippet, one can calculate sum of Income, mean of Age, and standard deviation of Height, as shown:

grouped=df.groupby(['Gender','Socio-Eco'])
grouped.aggregate({'Income':np.sum,'Age':np.mean,'Height':np.std})

The output of the preceding snippet looks similar to the following table:

Fig. 3.30: Selected summary statistics of selected columns for different groups

We can also define a function using the lambda method of defining a calculation in Python. Suppose you don't want the mean of age but the ratio of mean and standard deviation for height. You can define the formula for this ratio using the lambda method, illustrated as follows:

grouped=df.groupby(['Gender','Socio-Eco'])
grouped.aggregate({'Age':np.mean,'Height':lambda x:np.mean(x)/np.std(x)})

Rather than applying different functions to different columns, one can apply several functions to all the columns at the same time, as shown:

grouped.aggregate([np.sum, np.mean, np.std])

The output of the code snippet contains the result of all the three functions applied on all the columns of the groupby object, as seen in the following screenshot:

Fig. 3.31: More than one selected summary statistics of selected columns for different groups

Filtering

One important operation that can be applied on the groupby objects is filter. We can filter elements based on the properties of groups. Suppose we want to choose elements from the Age column that are a part of the group wherein the sum of Age is greater than 700. This filtering can be done by writing the following snippet:

grouped['Age'].filter(lambda x:x.sum()>700)

The output contains the row numbers that are part of the group where the sum of Age is greater than 700. The output is, as follows:

Fig. 3.32: The rows left after filtering it for elements, which are part of groups, where the sum of ages is greater than 700

Transformation

One can use the transform method to mathematically transform all the elements in a numerical column. Suppose, we wish to calculate the standard normal values for all the elements in the numerical columns of our data frame; this can be done in a manner as shown:

zscore = lambda x: (x - x.mean()) / x.std()
grouped.transform(zscore)

The output contains standard normal values for all the numerical columns in the data frame, as shown in the following screenshot:

Fig. 3.33: Result of applying a lambda defined function on the columns of groups

The transform method comes in handy in a lot of situations. For example, it can be used to fill the missing values with the mean of the non-missing values, as shown:

f = lambda x: x.fillna(x.mean()
grouped.transform(f)
Miscellaneous operations

In many situations, one needs to select the nth row of each group of a groupby object, most often the first and the last row. This can be easily done once the groupby object is created. Let's see how:

  • The first row of each group can be selected by writing the following code snippet:
    grouped.head(1)
  • While the last row of each group can be selected by writing the following code snippet:
    grouped.tail(1)

The result of the former, is as shown:

Fig. 3.34: First few rows of the grouped element

In general, we can use the nth function to get the nth row from a group, as illustrated:

grouped=df.groupby('Gender')
grouped.nth(1)

This gives the following result:

Fig. 3.35: First rows of each group

One can use any number (of course, less than the number of rows in each group) as the argument for the nth function.

It is always a good practice to sort the data frame for the relevant columns before creating the groupby object from the data frame. Suppose, you want to look at the youngest male and female members of this data frame.

This can be done by sorting the data frame, creating a groupby object, and then taking the first element of each group:

df1=df.sort(['Age','Income'])
grouped=df1.groupby('Gender')
grouped.head(1)

The output has two rows containing the details of the two youngest members from the two groups:

Fig. 3.36: Sorting by the age column before grouping by gender and then selecting the first row from each group can give you the oldest/youngest guy in the group

The oldest members can be identified in the same way by typing grouped.tail(1).