Visualizing Data_Data Science for Marketing Analytics-QQ阅读中文玄幻网

上QQ阅读APP看书，第一时间看更新

Visualizing Data

An important aspect of exploring data is to be able to represent the data visually. When data is represented visually, the underlying numbers and distribution become very easy to understand and differences become easy to spot.

Plots in Python are very similar to those in any other paradigm of traditional marketing analytics. We can directly make use of our previous understanding of plots and use them in Python. pandas supports inbuilt functions to visualize the data in them through the plot function. You can choose which ones are which via the kind parameter to the plot function. Some of the most commonly used ones, as used on sales.csv, are as follows:

kde or density for density plots
bar or barh for bar plots
box for boxplot
area for area plots
scatter for scatter plots
hexbin for hexagonal bin plots
pie for pie plots

You can specify which values to pass as the x and y axes by specifying the column names as x and y in the DataFrames.

Exercise 9: Visualizing Data With pandas

Using the sales DataFrame created from the previous exercise, we will create visualizations to explore the distribution of the Revenue KPI. We will look at how different order method types influence the revenue and how it varies compare to the planned revenue, quantity, and gross profit year on year.

Import the module that we will be needing, that is, pandas.
import pandas as pd
Load the sales.csv file into a DataFrame named sales and have a look at the first few rows, as follows:
sales = pd.read_csv("sales.csv")
sales.head()
You will get the following output:

Figure 2.48: Output of sales.head()
Now take the Revenue field and plot it's distribution with the kde parameter as follows:
sales['Revenue'].plot(kind = 'kde')
You will get the following density plot:

Figure 2.49: Distribution of revenue in sales.csv
Next, group the Revenue by Order method type and make a barplot:
sales.groupby('Order method type').sum().plot(kind = 'bar', y = 'Revenue')
This gives the following output:

Figure 2.50: Revenue generated through each Order method type in sales.csv
Let's now group the columns by year and create boxplots to get an idea on a relative scale:
sales.groupby('Year')['Revenue', 'Planned revenue', 'Quantity', 'Gross profit'].plot(kind= 'box')
You should get the following plots:

Figure 2.51: Boxplots for Revenue, Planned Revenue, Quantity, and Gross Profit for 2004 to 2007

Figure 2.52: Boxplots for Revenue, Planned Revenue, Quantity, and Gross Profit for 2005

Figure 2.53: Boxplots for Revenue, Planned Revenue, Quantity, and Gross Profit for 2006 and 2007

Now the plots convey the message we want to convey in a suitable way, but we don't have a lot of control over the plots because we are using pandas to figure things out for us. There are other ways to plot the data which allow us to express the data with more freedom. Let's look at them in this section

Visualization through Seaborn

An important kind of plot that we missed before is the histogram. We can still pass the kind parameter as hist in the plot function, but instead of using default pandas to visualize it, another library, called seaborn, is heavily used in Python. It provides a high-level API to easily generate top-quality plots used in a lot of domains, including statistics.

You can change the environment from regular pandas/Matplotlib to seaborn directly through the set function of seaborn. Seaborn also supports a distplot function, which plots the actual distribution of the pandas series passed to it, which means no longer worrying about binning and other issues. To generate histograms through seaborn, we can pass the kde parameter as False and get rid of the distribution line:

import seaborn as sns

sns.set()

sns.distplot(sales[Gross profit'].dropna(), kde = False)

This gives the following output:

Figure 2.54: Histogram for Gross Profit through Seaborn

However, the actual power of seaborn comes with using it for advanced features such as the PairPlot API.

Note

You can have a look at some of the things you can do directly with seaborn at https://elitedatascience.com/python-seaborn-tutorial.

Visualization with Matplotlib

Python's default visualization library is Matplotlib. Originally developed to bring visualization capabilities from the MATLAB academic tool into open source Python, Matplotlib provides low-level additional features that can be added to plots made from any other visualization library, because all of them—the ones used in pandas and seaborn—are built on top of it.

To start using Matplotlib, we first import the matplotlib.pyplot object as plt. This plt object becomes the basis for generating figures in Matplotlib. Every time we want to change the plot we want to look at, we use classes defined on this plt object and modify them for more, and better, data analysis.

Note

You can have a look at some of the things you can do directly with Matplotlib at https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html.

The following is an example Matplotlib figure that illustrates the different parts of a plot:

Figure 2.55: Breaking down parts of a Matplotlib plot

Some of the functions we can call on this plt object for these options are as follows:

Figure 2.56: Functions that can be used on plt

Note

A tutorial for Matplotlib is available at https://realpython.com/python-matplotlib-guide/.

Activity 2: Analyzing Advertisements

In this activity, we will wrap up our learning from the chapter and practice exploring the data, generating insights, and creating visualizations. Your company has curated its advertisement views through different mediums and the sales made on the same day in Advertising.csv. Read the file, have a look at the dataset, explore some of the features, analyze the relationships, and visualize some of the insights in the data to get a clearer understanding of it:

Open the Jupyter Notebook and load pandas and the visualization libraries that you will need.
Load the data into a pandas DataFrame named ads and look at the first few rows. Your DataFrame should look as follows:

Figure 2.57: The first few rows of Advertising.csv
Understand the distribution of the dataset using the describe function and filter out any irrelevant data.
Have a closer look at the spread of the features using the quantile function and generate relevant insights. You will get the following output:

Figure 2.58: The deciles of ads
Look at the histograms of individual features to understand the values better. You should get the following outputs:

Figure 2.59: Histogram of the TV feature

Figure 2.60: Histogram of the newspaper feature

Figure 2.61: Histogram of the radio feature

Figure 2.62: Histogram of the sales feature
Now identify the right attributes for analysis and the KPIs.
Create focused, more specific, insights pertaining to the KPIs and create visualizations to explain relationships in the data. Understand the scope of the data being used and set expectations for further analysis.
Note
The solution for this activity can be found on page 329.