Chapter 5. Linear Regression with Python
If you have mastered the content of the last two chapters, implementing predictive models will be a cake walk. Remember the 80-20% split between the data cleaning + wrangling and modelling? Then what is the need of dedicating a full chapter to illustrate the model? The reason is not about running a predictive model; it is about understanding the mathematics (algorithms) that goes behind the ready-made methods which we will be using to implement these algorithms. It is about interpreting the swathe of results these models spew after the model implementation and making sense of them in the context. Thus, it is of utmost importance to understand the mathematics behind the algorithms and the result parameters of these models.
With this chapter onwards, we will deal with one predictive modelling algorithm in each chapter. In this chapter, we will discuss a technique called linear regression. It is the most basic and generic technique to create a predictive model out of a historical dataset with an output variable.
The agenda of this chapter is to thoroughly understand the mathematics behind linear regression and the results generated by it by illustrating its implementation on various datasets. The broad agenda of this chapter is, as follows:
- The maths behind the linear regression: How does the model work? How is the equation of the model created based on the dataset? What are the assumptions for this calculation?
- Implementing linear regression with Python: There are a couple of ready-made methods to implement linear regression in Python. Instead of using these ready-made methods, one can write one's own Python code snippet for the entire calculation with custom inputs. However, as linear regression is a regularly used algorithm, the use of ready-made methods is quite common. Its implementation from scratch is generally used to illustrate the maths behind the algorithm.
- Making sense of result parameters: There will be tons of result parameters, such as slope, co-efficient, p-values, and so on. It is very important to understand what each parameter means and the range their values lie in, for the model to be an efficient model.
- Model validation: Any predictive model needs to be validated. One common method of validating is splitting the available dataset into training and testing datasets, as discussed in the previous chapter. The training dataset is used to develop the model while the testing is used to compare the result predicted by the model to the actual values.
- Handling issues related to linear regression: Issues, such as multi-collinearity, handling categorical variables, non-linear relationships, and so on come up while implementing a linear regression; these need to be taken care of to ensure an efficient model.
Before we kick-start the chapter, let's discuss what a model means and entails. A mathematical/statistical/predictive model is nothing but a mathematical equation consisting of input variables yielding an output when values of the input variables are provided. For example, let us, for a moment, assume that the price (P) of a house is linearly dependent upon its size (S), amenities (A), and availability of transport (T). The equation will look like this:
This is called the model and the variables a1, a2, and a3 are called the variable coefficients. The variable P is the predicted output while the S, A, and T are input variables. Here, S, A, and T are known but a1, a2, and a3 are not. These parameters are estimated using the historical input and output data. Once, the value of these parameters is found, the equation (model) becomes ready for testing. Now, S, A, and T can be numerical, binary, categorical, and so on; while P can also be numerical, binary, or categorical and it is this need to tackle various types of variables that gives rise to a large number of models.