Linear Regression – Introduction

Multiple Linear Regression or MLR is as extension of Linear Regression or Ordinary Least Square (OLS) method of forming a model within several variables, and used to predict the outcome. Example – in the last blog of regression we have seen the affect of price on sales of soaps, here we will see that how sales is affected by price and other variables like discount. We will also see how to predict with more than one variable.

In the previous blog you have seen how regression is done, what are the important terminologies, their interpretation, making model and how to use the models for prediction. All these were done for univariate linear regression, one dependent and one independent variable. In this blog you will see how to do regression when there is more than one independent variable. We will learn Multiple Linear Regression here.

In the last example we saw sales of soaps, but in real life such kind of sales does not happen in retail FMCG space. Companies send their sales representatives to the retailer for sales, and the decisions are made by the retailers on many decisions such as price offered, commission or discount given on the sales values, rent or shelf space fees, insurance against any defective goods, inventory and many. We consider them in the form of variables which may affect in large extent or less or may be have no effect on the dependent variable. For better understanding and better understanding of the subject we will go with one dependent and two independent variables.

Click here for the working file.

Joe the sales representative gives the detail of the 15 deals done by him with retailers of his area to his boss. The data is present in the working file.

Dependent Variable – Soap Cartoon, the no of boxes of soaps sold.

Independent Variables – (i) Price per box and (ii) Discount on selling price given to retailers for that deal.

How to do Multiple Linear Regression in Excel

There are many ways of doing regression as described in detail in the previous blog, we shall do it using Data Analysis of Excel as it takes lesser time. Go to Data – Data Analysis – Regression. Choose the dependent variable in the Y-Range and choose both the independent variables for the X-Range. For detailed steps, you may refer to the previous blog.

After we get the results, we need to proceed to check the some values of the outcome in the following manner –

  1. Standard Error value.
  2. Residuals – remove the outliers and re-run the regression.
  3. Significant f statistics.
  4. p values – if any value is less than significance interval, remove that independent variable and re run the regression.
  5. R2 value

Standard Error

Let’s understand more about standard error before checking the above points. The brief explanation of how to find out outlier has been explained in the previous blog where Standard Error was used. The standard error gives a measure of how well the regression model represents the data. This is actually the average distance of all the observations from the trend line or line of good fit. Another way of understanding it is drawing a line parallel to the line of good fit at a distance equal to Standard error. Smaller the value of Standard Error, closer is that parallel line to the line of good fit, and better is the model which represents the data and better will be the prediction.

Checking the Outlier

To check the outlier, we divide the absolute residual value with the standard error. Absolute value is only considered to remove the negatives from the residual values. Residual values are difference between the predicted value and Actual Value of Dependent Variable. We have considered 95% confidence interval during regression, hence….

We see that the 15th value is more than 2, but not very high. It depends on the manager to decide to whether consider it or delete it and further do the regression again. For the time being let’s move ahead to check point 2 and will decide at the end whether to keep it or remove it and do the regression again.

significant f

The significant f value of 2.24123816694146E-10 (2.2 X 10-10) looks very good. It can be interpreted as when we will run the regression 1010 time, the model will be wrong 2 times.

P Value

The p values of the independent variables are also less than 0.05, which means that both the independent variables have significant relationship with the independent variable.

R square

R2 value and adjusted R2 values of 97% are comparatively on the higher side which shows that the model will be highly accurate and very much a better alternative than considering average for future prediction.

All seems good except high value of Standard Error of 12.55 which is very high. Higher value questions the validity of the regression model. To improve the model and reduce the error, let’s consider the 15th observation as an outlier and re run the regression.

In the 2nd Regression output we see that the 13th value is like an outlier, and rest all are just like Regression_1. Standard Error has reduce to 10, but still it is higher. So let’s remove the 13th value and regress the data.

X1 – 160

X2 – 30

In Regression_3 we see that Standard Error has reduced to 7.4 and also there are no outlier. Other values are also good. Let’s summarise the data for comparison –

STANDARD ERROR12.559.997.45
SIGNIFICANT F2.24 x 10-101.89 x 10-101.34 x 10-10
P- VALUEX1 – 0.0001
X2 – 0.046
X1 – 1.89 x 10-5
X2 – 0.038
X1 – 2.18 x 10-6
X2 – 0.028
ADJUSTED R297.12%97.98%98.72%

We see that in all aspect, Regression_3 is better than the others. So, the final regression model for the data will be –

Y = 366.65-1.27*X1+100.19*X2

Y – No of Soap Cartoon sold

X1 – Price per box

X2 – Discount on sales price

To understand the interpretation of the terminologies used in Regression read here

In Regression_3, the Standard Error comes down to 7.45 which many will consider not to be a very good fit. But please note that we have considered only 13-15 data for regression. We actually need more past data for making a near perfect regression model, IN real life also hundreds and thousands of data are taken into the regression process. Maybe, if we had a larger data set and more no of relevant Independent Variables, the value of Standard Error may have more likely come down. We will see more regression with larger data set in future blogs.

Let’s predict with this model. What will be the size of the deal when the price per box will be 160 and a discount of 30% on Selling Price is offered to the retailer?

X1 – 160

X2 – 30

Put the values in the Regression equation, we get the value of Y as –

Y = 366.65-1.27*160+100.19*30%

Y = 193.54.

So, it can be predicted that when the price per box will be 160 and a discount of 30% on Selling Price is offered, about 194 no of soap boxes can be sold. The sales manager can predict more values by changing the values of X1 and X2 and take decision on his price offering to achieve required no of sales. Without regression, the manager might have considered average value of sales of past sales, but with regression a lot of error is prediction is reduced and efficiency of decision-making increases.

There are few issues in interpretation of multiple linear regression which will be discussed in further blogs.