## Regression Unplugged II

**Cont..**

I assume now we have a good idea on how regression works and what we trying to do in least square method to calculate the coefficient and slope , we discussed simple linear regression involving 2 variables and there can be many aspect of looking at a prediction variable y.

In above example lets assume we were predicting the number of users hitting the blog with respect to days of month so we can predict the monthly usage and pitch to our sponsor , Now there may be more than one factor affecting the prediction.

Y = No of users in blog (Dependent variable / Target Variable)

x1= Days of moth (Independent variable)

Y= b0 + b1x1 + E

This is more robust regression equation where b0 represents the slope , b1 intercept and e is the error ,very similar to the one we calculated (for simple calculation case we did not consider the error).

for more than one variable affecting , the equation changes

Y = b0+ b1x1+ b2x2+ b3x3+ b4x4

x2 = No of new additions to blog

x3 = Seo rating of the blog

x4 = Positive comments on the blog

x5 = Followers ..

**Measure of your Regression Algorithm**

We know how to calculate the prediction of a dependent variable based on the independents , now how do we know how good our prediction worked.There are certain metrics to measure that:

**1. Concept of R square and Adjusted R square:**

Now the above diagram from udemy will help us understand how R square is calculated and how this is the most important matrix in regression

R Square is said to be measure of success of regression model,
how much variability of Y wrt x is explained by model. R square value lies between 0.2 to 0.8 and there is absolutely no indicator of how best R square is considered for model application.

R square value 1 means we can explain all possible Y with x ,and the x and y both starts from origin, complete absence of a intercept.

**Variability = ∑ (y- y mean )**

^{2}**Explained variability = ∑(y'-y mean)**

^{2}**Error =∑ e**

**R-square = Explained variability / Total Variability**

**(for our data set it was 0.97 , check Regression I)**

Adjusted R square =Adjusted R- square also indicates how well
terms fit a curve or line, but adjusts for the number of terms in a model

If you add more and more useless variables (for blog suppose you
add prize of laptop as independent variable x5) to a model, adjusted r-squared will decrease. If you add more useful variables,
adjusted r-squared will increase.

Adjusted R square is also called as a measure to compare the models.

**2. P value**

Third table in the previous blog gives the most important
metrics , now we ll be discussing about the t test and p value of X , how to
read them and what do they mean.

The P value is the probability of finding the
observed results when the null hypothesis (H 0) of a study question is
true.

In our case null hypothesis we have considered is b0 = 0 ,
which means line crosses y axis at origin. P value 0 implies that there are no
chances of the hypothesis being true and we will stay with our predictions.

A small p-value (typically ≤ 0.05) indicates
strong evidence against the null hypothesis, so you reject the null hypothesis.
A large p-value (> 0.05) indicates weak evidence against the null
hypothesis, so you fail to reject the null hypothesis.

We will have another write up for p values separately.

**3. Things to watch out while performing regression:**

**Linearity**

**:**

y= b0+b1x1+b2x2

If the relation between y and x1 or y and x2 is not linear (as shown above) ,then linear regression is not possible. Although, alternatively we can try using logarithmic and exponential transformations to convert a non linear to linear.

**Missing Important Variable**
Sometimes we fail to include the important variable and rather involve a correlated variable (to the important one we secluded ) , it messes up our predictability ratio and we fail
to explain the results, so best rule is try and include everything and use
backward elimination to exclude the less relevant ones.

**Outliers Removing / Log transformations**

The
yellow guy in our set in the outlier which will influence our prediction and we
need to treat it.

Certain
techniques are available to treat them , we ll discuss them in separate
articles.

**Auto correlation**

It is
generally observed in time series data and can’t be avoided, we have to detect
patterns and understand the nature of auto correlation. rather we use
alternative approach to predict like Auto regressive or moving average models,

**No Multi**

**col linearity**

If
there are 2 variables which are directly related eg a=5b+6 , so we have a fixed
relation between a and b , we cant use them in a model without getting impacted
in results.

Some
basic handling tricks are

Drop
one of them

Combine
both

Use
both, but be careful

For
now we discussed the basic regression and the factors influencing the prediction, going
forward we ll pull real examples and try and solve some problems.

Let me
know in comments if you find this useful.

## 0 comments:

## Post a Comment