Thursday, 5 July 2018

Regression Unplugged II


I assume now we have a good idea on how regression works and what we trying to do in least square method to calculate the coefficient and slope , we discussed simple linear regression involving 2 variables and there can be many aspect of looking at a prediction variable y.

In above example lets assume we were predicting the number of users hitting the blog with respect to days of month so we can predict the monthly usage and pitch to our sponsor , Now there may be more than one factor affecting the prediction.

Y = No of users in blog  (Dependent variable / Target Variable)
x1=  Days of moth (Independent variable)

Y=  b0 + b1x1 + E

This is more robust regression equation where b0 represents the slope , b1 intercept and e is the error ,very similar to the one we calculated (for simple calculation case we did not consider the error).

for more than one variable affecting , the equation changes 

Y = b0+ b1x1+ b2x2+ b3x3+ b4x4

x2 = No of new additions to blog 
x3 =  Seo rating of the blog
x4 =  Positive comments on the blog 
x5 = Followers ..

Measure of your  Regression Algorithm

We know how to calculate the prediction of a dependent variable based on the independents , now how do we know how good our prediction worked.There are certain metrics to measure that:

1. Concept of R square and Adjusted R square:

Now the above diagram from udemy will help us understand how R square is calculated and how this is the most important matrix in regression

R Square is said to be  measure of success of regression model, how much variability of Y wrt x is explained by model. R square value lies between 0.2 to 0.8 and there is absolutely no indicator of how best R square is considered for model application.

R square value 1 means we can explain all possible Y with x ,and the x and y both starts from origin, complete absence of a intercept.

Variability =   ∑ (y- y mean )2
Explained variability = ∑(y'-y mean)2
Error =∑ e

R-square = Explained variability / Total Variability 

(for our data set it was 0.97 , check Regression I)

Adjusted R square =Adjusted R- square also indicates how well terms fit a curve or line, but adjusts for the number of terms in a model

If you add more and more useless  variables (for blog suppose you add prize of laptop as independent variable x5) to a model, adjusted r-squared will decrease. If you add more useful variables, adjusted r-squared will increase.

Adjusted R square is also called as a measure to compare the models.

2. P value

Third table in the previous blog gives the most important metrics , now we ll be discussing about the t test and p value of X , how to read them and what do they mean.

The P value is the probability of finding the observed results when the null hypothesis (H 0) of a study question is true.
In our case null hypothesis we have considered is b0 = 0 , which means line crosses y axis at origin. P value 0 implies that there are no chances of the hypothesis being true and we will stay with our predictions.

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.

We will have another write up for p values separately.

3. Things to watch out while performing regression:


y= b0+b1x1+b2x2

If the relation between y and x1 or y and x2 is not linear (as shown above) ,then linear regression is not possible. Although, alternatively we can try using logarithmic and exponential transformations to convert a non linear to linear.

Missing Important Variable

Sometimes we fail to include the important variable and rather involve a correlated variable (to the important one we secluded ) , it messes up our predictability ratio and we fail to explain the results, so best rule is try and include everything and use backward elimination to exclude the less relevant ones.

Outliers Removing / Log transformations

The yellow guy in our set in the outlier which will influence our prediction and we need to treat it.
Certain techniques are available to treat them , we ll discuss them in separate articles.

Auto correlation

It is generally observed in time series data and can’t be avoided, we have to detect patterns and understand the nature of auto correlation. rather we use alternative approach to predict like Auto regressive or moving average models,

No Multi col linearity

If there are 2 variables which are directly related eg a=5b+6 , so we have a fixed relation between a and b , we cant use them in a model without getting impacted in results. 

Some basic handling tricks are
Drop one of them
Combine both
Use both, but be careful

For now we discussed the basic regression and the factors influencing the prediction, going forward we ll pull real examples and try and solve some problems.

Let me know in comments if you find this useful.


Post a comment