Uncategorized

REGRESSION – 4 WAYS OF DOING LINEAR REGRESSION IN EXCEL AND RESULT INTERPRETATION

Regression or linear regression or multiple linear regression are one of the most important statistical modeling. If you are looking to understand the meaning of it then you are in the right place.

Regression equations and models
equations and models

You might have heard about regression, read it in college might have understood the theory. But in this blog, you will get to know what regression is all about in professional world, how to do Regression Analysis in 4 different ways through a real-life example, what all the different terms mean and how to interpret it with the real-life example.

All the big companies and new age companies use analytics to make data-based decision making. Regression Analysis is one of the most frequently used statistical method. The more you get used to implement the regression in real life practical scenario, the better are the chances for you to prove yourself in Job and interview in those companies.

Regression Analysis shows how the variables are in relationship numerically. Preliminary understanding of the subject you can refer to my Blog.

Regression practical applications

What do we do with regression? Here are few prominent things that you can do with regression –

  • Establish relationship between different variables from your data, or to know if there is any relationship with the variable in your data and how they affect the relationship. For example, you can know if the no of hours your exercise and the amount of food you eat does really helps to decrease your weight. If these variables really affect your weight, then which variable to what extent affects your body weight.

  • Predict future results with your present and future data. If you want to know what your weight will be in near future with the other variables like exercise hours, amount of food in-taken etc. then you can apply regression techniques.

Practical Examples from various fields

Apart from the above, most important is that you will get insight on each variable, and how they affect each other. Based upon that you can decide how to work on that. Here are few examples on the approach of regression to different problems on the industries –

Sales Management

  • A sales person can use regression to understand i) different factors that affects his sales numbers. ii) How these factors affect the sales. Iii) Which factors affects the most and which the least. iv) Which factors are redundant. v) Future prediction or forecasting of sales. And many more.

Marketing Management

  • A marketing person can i) establish relationship between the marketing budget and objectives. ii) when to do promotion. iii) What factors to consider to attract the target audience. iv) What should be the discount and price of a product. v) Forecast the old and new product demand. And many more.

Operations Management

  • Operations manager can i) optimize the whole process of work in the workplace. ii) Control cost after knowing different processes where cost can be reduced. iii) Decide on the number of people required for maximum productivity. And many more.

HR or Human Resource Management

  • HR manager can i) predict if a person will join the organization or not. ii) make performance measuring model for employees and track performance. iii) Training effectiveness measurement of employees. iv) Employee attrition. And many more.

Regression helps to build a model and reduce the error from the data and give you possible answers with reduced uncertainty.

Types of Regression

There are different types of regression and let’s see the important ones which are mostly used before delving into it further. So, the most used Regression are –

  • Linear Regression or Ordinary Least Square (OLS), further can be divided into
    • Simple linear regression
    • Multiple linear regression
  • Non-Linear Regression for example Logistic Regression

There are many more, but we will focus on these two as these are widely used and most of the analysis can be taken care of by these two. Let us start with Linear Regression.

We will see the simplest of the regression – Simple Linear Regression or Univariate Linear Regression. One Dependent Variable and one Independent Variable is used to find out the relationship in between them. The model developed can be used to predict the dependent variable also. But this analysis has very less significance in the real-life world problem as dependent variables are affected by more than one variable. However, this is the starting step to understand the world of regression and variables.

First step should be to check if the data are linear or not. The best way is to check is through the scatter plot discussed ahead.

Case for Regression

Let us take example of sales of soap and its relationship to the per unit price in the given working data sheet. We generally have the notion that price and quantity have inverse relation i.e. one increases, other decreases. We will see the same by using regression, and shall also see how much does the price impact the sales.

Please click here to access the working file

Joe is a sales executive in a FMCG company and he gets the target to sell soaps in lot to his distributors. The data shows the no of soaps sold in 15 days. The prices were different for different days with premonition to achieve the sales target, given by Joe’s boss. Now the Joe can use this data to predict the sales of soap for next few days using these old data. So, our objective is to determine a regression model to establish the relationship of the price and sales, and predict the future sales with respect to the price. Sales is considered as the Dependent Variable, and Price per unit as Independent Variable. Let’s see how it is done in different ways.

There are different ways to Regression in MS-Excel, some of the ways are shown below –

  1. Scatter Plot
  2. Manual Way
  3. Data Analysis
  4. Solver Technique

SCATTER PLOT METHOD –

This is the simplest way of getting a regression equation. This gives you visual representation of the data, if it is linear or not. and after that you derive the equation from the Excel menus.

First Step

Keep no of soaps sold in the Y-axis and Price per unit in the X-axis. After selection go to Insert and select the scatter as shown in the figure.

select the scatter as shown in the figure.
First Step

Second Step

You will get the scatter plot on the screen. In the Design, go to Add Chart Element, then to Trendline and then finally to More Trendline Options.

In the Design, go to Add Chart Element, then to Trendline and then finally to More Trendline Options.
Second Step

Third Step

You will see Format Trendline window on the right side of the sheet. Select the two options Display Equation on Chat and Display R-Squared Value on Chart. You will see that an equation and R2 values displaying on the scatter plot along with the dotted trend line. The trend line which is most suited is Linear type, chosen automatically by excel. Even looking after the scatter plot, one can make it out that the relationship is linear and a straight line is the best suited trend line or goodness of fit line. One can choose other trendlines such as exponential, logarithmic, polynomial etc. as shown and see that those lines do not fit the scatter plot.

The other way to check is by changing the trend lines, the R2 value also reduces. Usually higher the R2 value, the better is the model. R2 is explained at later part of this article.

The Equation of the Regression model is
Third Step

Regression Equation

The Equation of the Regression model is

Y = -7.2308*X + 51.215

We will explain all the terms after we finish with different ways of doing regression.

MANUAL METHOD –

This is similar to the process which is used in school, put the data in the formulas and get the results. This is one of the ways to learn the basic of each and every element in regression. This process is explained in detail and in lengthy way to grasp the basic.

Find the average of X and Y as shown
Manual Way of Getting Regression Equation in Excel

First Step

Find the average of X and Y as shown

Second Step

Find the Deviation from average of both X and Y. i.e. difference between each value of X and Y from the corresponding average value.

Third Step

Find the Variance of both X and Y. i.e. square of Deviation.

Fourth Step

Find the Covariance. i.e. Product of Deviation of X and Y.

Fifth Step

Find the value of m from the formula

m = Sum of Covariance of X and Y / Sum of variance of X

The value of m comes as -7.23

Sixth Step – Similarly find the value of C

C = Yaverage – (m * Xaverage)

The value of the C comes as 51.21

Regression Equation

So, the regression equation comes out as –

Y = -7.23*X + 51.21

Some further explanation

Deviation from average is nothing but the difference between the actual individual value and the average of that group of values. It is error value from the average. As per the word, it shows how much the individual values are deviating from the average. Out of the fifteen deals, the average no of soaps sold per deal was 16.26, but actual no of soaps sold in each case were different from the mean and that differences is Deviation. All the individual deviation values will give improper idea of how the numbers are dispersed or spread, so we need average of these Deviations to know how dispersed the data is to know if the central tendency really represents the whole data.

The deviation values can be positive as well as negative as individual values are sometimes higher and sometimes lower than the mean. The sum of total deviation will always be 0. That means that we cannot find out average deviation values to make an estimate of variance.

As sum of deviation is 0, we cannot find the average of deviation, hence we use other way to find it out. We square the individual Deviation, and make the difference values positive. This squared value of Deviation from Mean is called Variance. Variance can be used to find Standard Deviation, but we will restrict to go to Standard Deviation here.

After Variance is found, we find the Covariance as given in the fourth step. Sum of Covariance tells how the two variables varies together. Covariance only indicated the direction of linear relationship between the variables. In this case, the Covariance of -78.33 means that the no of soaps sold and Price per unit varies opposite to each other, i.e. one increases, other decreases. After knowing the Covariance, we find the Slope (m) and Constant (C) by the formulas shown in Fifth and Sixth Step.

The values for the linear equation come out as Y = -7.23*X + 51.21

DATA ANALYSIS METHOD

First Step

Go to DATAData AnalysisRegression

Go to DATA – Data Analysis – Regression
First Step

Second Step

Select no of soaps sold in the Y-range and Price per unit in the X-Range.

Choose levels if while selecting the range you have also included the headings from the X range and Y range.

Select the area where you want the regression result to appear, it can be in the same sheet or in different work sheet. We have chosen different worksheet. You may choose other components such as residual values to appear in the regression result.

Choose no of soaps sold in the Y-range and Price per unit in the X-Range.
Second Step

Third Step

You will get this result in different work sheet.

The equation of the regression model can be seen from the Co-efficient marked in red rectangle
Third Step

The regression model is made from the Co-efficients marked in red rectangle –

Y = – 7.23*X + 51.21

We will explain all the terms marked in red, after we finish with different ways of doing regression.

SOLVER TECHNIQUE –

This is bit of complicated technique, although it will give the result for simple regression like considered in this example of soaps, but it is more advisable to use when there are a lot of variables (usually more than 10 to 16). However, just to make you more familiar with how this technique is used, it is shown as below. More details is in the next blog related to Multiple Linear Regression.

We know that liner regression equations are in the form of Y = m*X +C

where Y is the dependent variable,

X the independent variable, and

C is the constant or intercept value of the Y axis. Or

No of soaps sold = (m * Price per unit) + Constant

Earlier we used to find the value of m and the constant, but in this method, we will assume the values beforehand. These assumed values are used in the above equation to find the predicted value of the no of soaps sold. Here we assume the values as 1 for both. Then find the difference between the actual soaps sold and the predicted values of soaps sold

Here we assume the values as 1 for both. Then find the difference between the actual soaps sold and the predicted values of soaps sold
Pre-Step_1

Square the value and then use solver to minimize the sum of the square value to get the actual values of m and C

Square the value and then use solver to minimize the sum of the square value to get the actual values of m and C
Pre_Step_2

First Step

After finishing the above steps, go to Data – Solver. You will get the below screen where you need to select in the following manner –

After finishing the above steps, go to Data – Solver. You will get the below screen where you need to select in the following manner
First Step
  • In set objective box, select the cell containing value of the Sum of the square of the difference between the actual and predicted value. Highlighted in Red.
  • Tick the Min i.e. minimize. Done to minimize the value of the above value selected.
  • Click Ok.

Second Step

You will see that the values of Constant, Beta, predicted value of soaps sold, Difference between the predicted and Actual and the Sum of the difference of the square, changes. This means that the model is optimized.

You will see that the values of Constant, Beta, predicted value of soaps sold, Difference between the predicted and Actual and the Sum of the difference of the square, changes. This means that the model has been optimized.
Second Step

Regression Equation

From the final optimized values of Constant and m, the equation formed is

Y = 51.215 – 7.23 * X

We will explain it in more detail in blog of Multiple Linear Regression.

Interpretation of the Regression

In all the above steps we get the same equation for the regression model

Y = 51.215 – 7.23 * X

This equation can be interpreted as

No of soaps sold = (m * Price per unit) + Constant

Y – No of soaps sold, which is the dependent variable

X – Price per unit of the soap, which is the Independent variable

M – Slope = -7.23 which is negative. This shows that the price and no of soaps sold are inversely related. Or we can say that as price increases, the no of soaps sales, decreases. To be more precise we can say that as we increase the price by 1 unit, the no of sales will decrease by about 7 units. C – Constant = 51.215. it is actually the intercept on the Y-Axis, or the no of units of soaps sold when price is 0. When you extend the trend line in the graph, it will cut the y axis at 51.215. This means when the price is 0, about 51 units of soap will be sold.

This shows that the price and no of soaps sold are inversely related. Or we can say that as price increases, the no of soaps sales, decreases. To be more precise we can say that as we increase the price by 1 unit, the no of sales will decrease by about 7 units.

It does not make sense that when we will distribute the soap for free, only about 51 units will be taken. But this is actually the value based on the mathematical calculation done by the software on the data collected. It can also be seen as the demand of that particular soap in the market; however, this cannot be true most of the time.

Important terminologies and their interpretations –

R2

We mentioned about it in the Scatter plot section. It gives the information if the regression model is reliable or not. It tells how strong is the relationship. To understand this, we need to go to the Manual Method of regression. Suppose we do not have the price and only have the data of no of soaps sold. How will you predict the future sales? In that case mostly Mean is considered. Or that average no of soaps sold in the future every day shall be about 16. However, we know that in next days’ deal, the probability of selling exactly 16 no of soaps is very low. The variance of Y or the no of soaps sold is 766.93

To check if average can be used in place of Regression, we use the regression equation Y = 51.215 – 7.23 * X for the given value of price. We call it the predicted value of sales of no of soaps and indicate it as Yhat.

Deviance and Residual

Error from average is called Deviance, which is squared to get the Variance.

Y-Yavg = Deviance; Deviance2 = Variance

Variance can be called as Error before using Regression or using Average as predictor.

And Error from predicted value is called Residual, we square the residual values.

Y-Ypredicted = Residual; Residual2

Residual2 can be called as error after using regression.

both Variance and Residual gives squared distance between the values and the actual values. Or both gives squared distance between the actual values and the other calculated values of Y.
The vertical distance between the data points in red and the trend line is the error which should be less for a better regression model

Practical inference of Variance and Residual

So, both Variance and Residual gives squared distance between the values and the actual values. Or both gives squared distance between the actual values and the other calculated values of Y. We see that the value of residual here is 200.52. Which means that the residual error is much lesser than Variance error. Hence, we can say that instead of using Average as predictor, Regression model is a better predictor, and it reduces the error which is measured by R2 or Proportional Reduction in Error.

R2 = (Error Before Regression – Error after Regression) / Error before regression

OR

R2 = (Variance of Y – Residual square of Y) / Variance of Y

So R2 tells us how much the error has been reduced using Regression over Mean or average. In this example the R2 value of 0.7385 tells us that the error in the predicted values of no of soaps sold is reduced by 73.85% and regression model is better than using average of no of soaps sold. Or we can say in other words that the predicted values of Y or the no of Soaps sold will be 73.85% more accurate using regression equation than using Average or mean of the past values of soaps sold.

Usually, higher the R2 value better is the regression model. But R2 is a relative value and does not give an absolute value of error. So sometimes it is not considered as a reliable source to understand if the model is a good fit or not. We have one more parameter that is more preferred to check if the model is a good fit or not and that is called standard error.

Standard Error

It tells if the regression model is valid or not. if you remember Standard Deviation, then it gives a position of the data from its central tendency values. Similarly, standard error in regression gives an average distance that the observed values are away from the regression line or the trend line. Unlike R2, SE gives exact value hence is more preferred value to consider if the regression model is good or not. The lower the value, better the regression is, as it gives exact average value of the error.

In this example, the SE value of 3.927 means that all the no values of the no of soaps sold are at an average distance of 3.927 away from the trend line. More easy way to imagine is think of an imaginary line drawn at a distance of 3.927 away from the trend line. That imaginary line is same as the trend line, but at a distance of 3.927 which contains all the actual values of the no of soaps sold.

That imaginary line is same as the trend line, but at a distance of 3.927 which contains all the actual values of the no of soaps sold.
Trend Line, Sales of Soap, Predicted sales of soap

How to use Standard Error to remove the outlier values will be discussed in another blog. Just an introduction on how to do it – Take the absolute value of residuals (Residuals are the difference between the actual value and the predicted value, as discussed in above sections) of each value and divide each by standard error. If any of the value is significantly larger than 2, then that value should be considered as an outlier and should be removed from the data set and the regression should be done again.

If any of the value is significantly larger than 2, then that value should be considered as an outlier and should be removed from the data set and the regression should be done again.
Use of Standard Error

Significant f

Significant f Should be less than 0.05 in case Confidence Interval is 95%. If these criteria are fulfilled then it means that the regression model is OK to use. Other way of looking into it is that Significant f gives the probability of failure of the regression model. In this example the value of 4.03*10-5 means that if this model is run 105 times, this model will fail only 4 times.

p Value

p value of Independent Variables – Should be less than 0.05 in case Confidence Interval is 95%. It gives significance level of each independent variable to the dependent variable. If the value is less than the significance interval (i.e. less than 0.05) then there is a significance correlation between the Independent and the dependent variable, and changes in the independent variable are associated with the changes in the dependent variable.

Adjusted R2

Adjusted R2 is more significant in multinomial linear regression. In univariate linear regression, the R2 and adjusted R2 are nearly or exactly equal. In case of multivariate linear regression as one increases the number of independent variables, the value of R2 also increases irrespective of relationship they have with the dependent variable. This process may increase the R2 value and someone will be under the impression that as it is high so the model is reliable. But on the contrary it does not give the measure of goodness of fit statistic. Adjusted R2 provides an adjustment to the R2, such that an independent variable that has a correlation to Y increases adjusted R-squared and any variable without a strong correlation will make adjusted R-squared decrease. That is the desired property of a goodness-of-fit statistic.

How to Predict using Regression model?

Now one of the most important part comes how you can predict the future outcome with regression. For this, you just need to put in the values of X to get the values of Y in the regression equation. Suppose, Joe wants to know how many units of soap he will be able to sell if the price per unit is kept at 5.7? Put the value of X as 5.7 in the Regression equation Y = 51.215 – 7.23 * X

So the no of soap sales will be 9.99. Or we can say that sales of 10 units of soaps can happen if price per unit is 5.7. But there are different variables apart from price that can impacts sales, not taken into consideration here. This is in the next blog of Multiple Linear Regression.

Before moving to another blog regarding multiple linear regression, one must note the below steps to follow while doing a regression analysis.

Check the following –

  1. Residuals – remove the outliers and re-run the regression.
  2. Significant f statistics.
  3. p values – if any value is less than significance interval, remove that independent variable and re run the regression.
  4. Then check R2

Next blog

In the next blog we will learn about Multiple Linear Regression with a real-life example. We will see how the relations is establish within one dependent variable and many independent variables, measured and interpreted.

One of the ways of looking at multiple linear regression is to look it as a part of simple linear regression. MLR is SLR with only one independent variable and one dependent variable, better R2, standard error and other important terms of measurement of model of regression.