Logistic Regression
Its been a while for new post, I was going through a lot of world
cup prediction models and ended up creating one logistic regression set and
predicted around 75% times correctly. Data collection was a challenge and I
included a lot of manually created/transformed variables, will publish in detail how I
trained and scored my model.
Moving on in our regular learning series we have Logistic
Regression, as always, my focus will be to understand the mathematics behind
and how do we use it for our life.
Introduction:
In the linear regression model, the dependent
variable y is considered continuous, whereas in logistic regression
it is categorical.
Linear regression uses the general linear equation
Y= b0 + b1x1 + E
where Y is a continuous dependent variable and
independent variables Xi are continuous.
Above graph shows a basic graphical comparison between linear and
logistic form of regression, linear provides a continuous values for Y ,
however logistic provides a probability of Y value with the help of sigmoid
function and the output always is categorical from 1 or 0 (simple logistic regression.
Math behind and How do we predict in Logistic Regression
Now we all know that we need to predict the y values in regression
as:
Y = b0+b1x+E ____ Eq
1
For the distribution which is nonlinear we will use sigmoid
function to calculate the y values which states
P=1/1+ey ____ Eq 2
Solve it for Y and you ll get
Y=ln (p/1p)
p/1p is called Odds
What are
odds :
Odds: Probability of event occurring / Probability of event not occurring P/1P
Odd Ratio: If P1 if for current event, and P0 is for previous
event (P1/1P1) / (P0/1P0)
Substituting in Eq1, we get
Ln(p/1p) = b0+b1x ____ Eq 3
The plan is to get the probability and then decide an optimum level
(based on understanding of problem), above which we ll define the target to be
1 and below which we ll call it 0 (only 2 variables as of now).
Something similar is explained in the graphs by udemy with a lot
more clarity
For every x value we are calculating the probability and based on
that we decide to mark it 1 or 0 , to calculate probability we use above Sigmoid function. It does not use OLS (Ordinary Least Square) for parameter
estimation. Instead, it uses maximum likelihood estimation (MLE)
We have a scenario to
explain further where a Football Team scores in first half and now we want to
see what the probability of is winning of that team.
Goals

1

2

2

3

4

3

1

2

1

2

2

3

4

5

0

1

Win

0

1

0

0

1

1

0

1

0

1

0

1

1

1

0

0

FROM R:
Summary of Logistic
regression in R gives:
Estimate Std. Error t value
Pr(>t)
(Intercept) 0.08333 0.19754
0.422 0.67953
LR$Goals. 0.25926 0.07603
3.410 0.00423 **
The
output indicates that Goals in first half studying is holding good significance
The Intercept
(b0/c) is 0.08333 and Goals(b1/m) is 0.25926.
Going
forward These coefficients are entered in the logistic regression equation to
estimate the odds (equivalently, probability) of passing the exam:
From Eq 2
P= P=1/1+e^{y }
Or P=1/1+e^{(b0+b1x)}
^{}
Now we know b0 and b1 from our regression coefficients, we can
calculate the probabilities.
Goals in first half

1

2

3

4

5

P(winning)

0.54

0.6

0.66

0.71

0.76

And we decide a optimum value of 0.60 and assume if a team
scores 2 goals more than other team , that has very good chances of winning and
we declare it as winner , of course there are numerous other factors involved
but this is one aspect of looking at it.
Performance of Logistic Regression Model
AIC (Akaike Information Criteria) and Deviance –
The analogous metric of adjusted R² in
linear regression we have AIC in logistic regression is AIC.
AIC is
the measure of fit which penalizes model for the number of model coefficients.
Therefore, we always prefer model with minimum AIC value.
It
measures flexibility of the models. Its analogous to adjusted R2 in multiple
linear regression where it tries to prevent you from including irrelevant
predictor variables. Lower AIC of model is better than the model having higher
AIC.
Summary of Model
Estimate Std. Error t value Pr(>t)
(Intercept) 0.08333 0.19754 0.422 0.67953
LR$Goals. 0.25926 0.07603 3.410 0.00423 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1560847)
Null deviance: 4.0000 on 15 degrees of freedom
Residual deviance: 2.1852 on 14 degrees of freedom
(400 observations deleted due to missingness)
AIC: 19.552
Deviance
It is a
measure of goodness of fit of a generalized linear model. Higher the deviance value,
poorer is the model fit.
The summary of the model says:
Null deviance: 4.0000 on 15 degrees of freedom
Null deviance: 4.0000 on 15 degrees of freedom
Residual deviance: 2.18
on 14 degrees of freedom
Null Deviance indicates the response predicted by a model with
nothing but an intercept. Lower the value, better the model.
Residual deviance indicates the response predicted by a model on adding
independent variables. Lower the value, better the model.
Lower
value of residual deviance points out that the model has become better when it includes
more variable (generic).
Confusion Matrix
Its
just a comparison of actual and predicted values of Y and allows us to find
accuracy and avoid over fitting , We ll have additional write up on confusion matrix
but in brief the accuracy is calculated from.
Accuracy = True Positive +True negative / True Positive + True negative + False Positive + False negative.
ROC Curve (Receiver operating
characteristic)
The best definition I could find is Receiver
Operating Characteristic (ROC) summarizes the model’s performance by evaluating
the trade offs between true positive rate (sensitivity) and false positive rate
(1 specificity).
The area under curve (AUC), referred
to as index of accuracy(A) or concordance index, is a perfect performance
metric for ROC curve. Higher the area under curve, better the prediction power
of the model.
It will be explained in detail in another
article as it is very important measure for classification model
Application:
FIFA WC/Any Sports event Winner.
Courtesy wiki, another example might be to predict whether an
Indian voter will vote BJP or Trinamool Congress or Left Front or Congress,
based on age, income, sex, race, state of residence, votes in previous
elections
ROC Curve (Receiver operating characteristic)
The best definition I could find is Receiver
Operating Characteristic (ROC) summarizes the model’s performance by evaluating
the trade offs between true positive rate (sensitivity) and false positive rate
(1 specificity).
The area under curve (AUC), referred
to as index of accuracy(A) or concordance index, is a perfect performance
metric for ROC curve. Higher the area under curve, better the prediction power
of the model.
It will be explained in detail in another
article as it is very important measure for classification model
Application:
FIFA WC/Any Sports event Winner.
Courtesy wiki, another example might be to predict whether an
Indian voter will vote BJP or Trinamool Congress or Left Front or Congress,
based on age, income, sex, race, state of residence, votes in previous
elections
your article on data science is very interesting thank you so much.
ReplyDeleteData Science Training in Hyderabad
very informative article post. much thanks again
ReplyDeleteData Science Training in Hyderabad