Identifying the Right Attributes
Given a structured marketing dataset, the first thing you should do is to try and build intuition for the data and create insights. It is also possible to make a call on whether a certain attribute is required for the analysis or not. The insights generated should instinctively agree with the values and there should be no doubts about the quality of the data, its interpretation, or its application for solving the business problems we are interested in. If some values don't make intuitive sense, we must dig deeper into the data, remove outliers, and understand why the attribute has those values. This is important in order to avoid inaccurate model creation, building a model on the wrong data, or the inefficient use of resources.
Before we start with the model creation, we should summarize the attributes in our data and objectively compare them with our business expectations. To quantify business expectations, we generally have target metrics whose relationships we want to analyze with the attributes in our data. These metrics may depend on domain knowledge and business acumen and are known as Key Performance Indicators (KPIs).
Assuming that the data is stored in pandas DataFrames, this analysis can be performed using pandas itself, as pandas supports many functions and collections useful for generating insights and summaries. Most of the time, the result of these operations are other pandas DataFrames or Series, so it is also possible to chain multiple functions on top of each other in one go. Some of the most commonly used functions are as follows:
- info(): This function returns the index, datatype, and memory information of the DataFrame.
- describe(): This function gives descriptive summary statistics for numerical columns in the DataFrame.
- columns: This function returns all of the column names of the DataFrame as an index.
- head(n): This function returns n values in the DataFrame from the top. n is equal to 5 by default.
- tail(n): This function returns n values in the DataFrame from the bottom. n is equal to 5 by default.
- groupby(col)[cols].agg_func: This function collects rows with similar values to col and applies agg_func to cols.
Other than these functions, we also have functions meant for pandas Series. When applied to DataFrames, the functions are applied to every column of the DataFrame as an individual series:
- unique():This function returns the list of unique values in a Series.
- count():This function returns the total number of non-null and non-NA values.
- min():This function returns the minimum value in the Series.
- max():This function returns the maximum value in the Series.
- mean():This function returns the mean of all non-null and non-NA values in the Series.
- median():This function returns the median of all non-null and non-NA values in the Series.
- mode():This function returns the most frequently occurring value in the Series.
- quantile(x): This function returns the value at the xth quantile in the Series (where, x<=1). x can be passed as an array of fractions to get values at multiple quantiles.
Note
The value_counts(dropna=False) function is also a commonly used function, which shows the unique values and counts of categorical values. It works only on pandas Series and cannot be applied to DataFrames.
Once we have identified the attributes that we are interested in analyzing, we can see how these attributes vary in the dataset individually. If the number of unique values in a field is small, we can consider them to be categorical and obtain some groups in our data based on the values of these attributes directly and understand some naive relationships between them. If the number of unique values of an attribute is large, we consider the data to be continuous and analyze it in a more subjective way through visualizations.
Exercise 6: Exploring the Attributes in Sales Data
The sales of some products in your company in the past few years have been increasing and the company wants to build a marketing strategy for them. Read the sales.csv file provided within the Lesson02 folder on GitHub (this is the cleaned data we prepared in the last chapter) and identify the KPIs for this analysis. Also, generate some insights with the data by using the correct attributes for each analysis:
- Import pandas into the console and read the sales.csv file into a pandas DataFrame named sales, as shown:
import pandas as pd
sales = pd.read_csv('sales.csv')
sales.head()
Your output will look as follows:
Figure 2.1: The first five rows of sales.csv
- Now look only at the columns and their contents using the following code to ascertain their relevance in further analysis:
sales.columns
This gives you the following output:
Figure 2.2: The columns in sales.csv
Also, use the info function:
sales.info()
This gives you the following output:
Figure 2.3: Information about the sales DataFrame
- Identify the categorical fields and their distribution using the unique() function, as shown:
First check the Year column:
sales['Year'].unique()
You will get the following output:
Figure 2.4: The number of years the data is spread over
Then check the Product line column:
sales['Product line'].unique()
You will get the following output:
Figure 2.5: The different product lines the data covers
Then check the Product type column:
sales['Product type'].unique()
You will get the following output:
Figure 2.6: The different types of products the data covers
Check the Product column:
sales['Product'].unique()
You will get the following output:
Figure 2.7: Different products covered in the dataset
Check the Order method type column:
sales['Order method type'].unique()
You will get the following output:
Figure 2.8: Different ways in which people making purchases have ordered
Finally, check the Retailer country column:
sales['Retailer country'].unique()
You will get the following output:
Figure 2.9: The countries in which products have been sold
- Now that we have analyzed the categorical values, let's get a quick summary of the numerical fields, using the describe function to make sure that they are relevant for further analysis:
sales.describe()
This gives the following output:
Figure 2.10: Description of the numerical columns in sales.csv
As all the values show considerable variation in the data, we will keep all of them for further analysis.
- Now that we have shortlisted the categorical fields that we are interested in, let's analyze their spread in the data and see if we need to do any filtering. Do this first for the Year column:
sales['Year'].value_counts()
This gives the following output:
Figure 2.11: Frequency table of the Year column
Repeat this for the Product line column:
sales['Product line'].value_counts()
This gives the following output:
Figure 2.12: Frequency table of the Product line column
Then check for the Product type column:
sales['Product type'].value_counts()
This gives the following output:
Figure 2.13: Frequency table of the Product line column
Check for the Order method type column:
sales['Order method type'].value_counts()
This gives the following output:
Figure 2.14: Frequency table of the Product line column
Finally, check for the Retailer country column:
sales['Retailer country'].value_counts()
You should get the following output:
Figure 2.15: Frequency table of the Product line column
As all columns occur reasonably frequently, and there are no unexplained values in the data so far, we can proceed without filtering the data for now.
- Now that we have understood the spread in the categorical fields, we should also dig deeper into the spread of the numerical fields in the data and check whether we need to filter some values. We will do this by checking the quantiles of each categorical field:
sales.quantile([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
This gives the following output:
Figure 2.16: Spread of the numerical columns in sales.csv
- Let's use qualitative reasoning to create some quick insights for country-wide statistics now. We will first take the sum across different countries by selecting attributes that give an idea of overall values, such as revenue, product cost, quantity, gross profit, and so on, instead of unit values such as Unit cost or Unit sale price. Use the following code:
sales.groupby('Retailer country')['Revenue','Planned revenue','Product cost','Quantity','Gross profit'].sum()
You should get the following output:
Figure 2.17: Total revenue, cost, quantities sold, and profit in each country in the past four years
From the preceding figure, we can infer that Denmark made the least sales and the US made the most sales in the past four years. Most countries generated revenue of around 20,000,000 USD and almost reached their planned revenue targets.
- Let's now take the mean across different countries by using attributes that give an idea of the individual value of the product, such as unit sale price, unit cost, quantity, and so on. Use the following code:
sales.groupby('Retailer country')['Revenue','Planned revenue','Product cost','Quantity','Unit cost','Unit price','Gross profit','Unit sale price'].mean()
You should get the following output:
Figure 2.18: The average revenue, cost, quantity, and so on for each country
From the preceding figure, you will observe that the US, China, the UK, Finland, Japan, and some other countries made the highest revenue on average. Also, the average cost of the product is about 43 USD across all countries.
- Let's look at what countries were affected the worst when sales dipped. Were there some countries for which sales never dipped? Use the following code to group data by Retailer country:
sales.dropna().groupby('Retailer country')['Revenue','Planned revenue','Product cost','Quantity','Unit cost','Unit price','Gross profit','Unit sale price'].min()
You should get the following output:
Figure 2.19: The lowest price, quantity, cost prices, and so on for each country
From the preceding figure, you can infer that almost every product has at some point made a loss in most countries. Brazil, Spain, and Canada are some good exceptions.
- Similarly, let's now generate statistics with respect to other categorical variables, such as Year, Product line, Product type, and Product. Use the following code for the Year variable:
sales.groupby('Year')['Revenue','Planned revenue','Product cost','Quantity','Unit cost','Unit price','Gross profit','Unit sale price'].sum()
This gives the following output:
Figure 2.20: Total revenue, cost, quantities, and so on sold every year
From the above figure, it appears that revenue, profits, and quantities have dipped in the year 2007. However, we have seen previously that more than 90% of the data is from before 2007, so we should not be alarmed by this. There is considerable progress every year here.
- Use the following code for the Product line variable:
sales.groupby('Product line')['Revenue','Planned revenue','Product cost','Quantity','Unit cost','Unit price','Gross profit','Unit sale price'].sum()
You should get the following output:
Figure 2.21: Total revenue, cost, quantities, and so on, generated by each product division
The preceding figure indicates that the sale of Camping Equipment is the bread and butter of the company.
- Use the following code for the Product type variable:
sales.groupby('Product type')['Revenue','Planned revenue','Product cost','Quantity','Unit cost','Unit price','Gross profit','Unit sale price'].sum()
You should get the following output:
Figure 2.22: Total revenue, cost, quantities, and so on generated by each product type
You will observe that Sleeping Bags are a major source of revenue for the company because the unit cost of sleeping bags is the highest. Also, the number of Sleeping Bags sold is the second lowest across all types.
- Use the following code for the Product variable:
sales.groupby('Product')['Revenue','Planned revenue','Product cost','Quantity','Unit cost','Unit price','Gross profit','Unit sale price'].mean()
You should get the following output:
Figure 2.23: Average revenue, cost, quantities, and so on generated by each method of ordering
Observe that most sales were generated through the internet (more than all the other sources combined).
- Finally, you need to identify the KPIs. Looking at the previous insights and stats generated, it would make sense to target Revenue as one of the KPIs for further analysis.
Congratulations! You have successfully explored the attributes in a dataset and identified the KPIs for further analysis. In the next section, we will learn how to generate targeted insights from the prepared data.