Saturday, 6 April 2019
The place is updated by running a model 28 hours before the match with latest data scoring.
Feel free to reach out for more queries on the model manipulations.
You can check the daily predictions before match in Twitter https://twitter.com/jishturohit
The algorithm has been changed to incorporate the latest findings patterns in IPL , trained with rcent data and lets see how the predictions goes.
Thursday, 4 April 2019
IPL 2019 Prediction Model : Data Points
April 04, 2019 Rohit Jishtu
Cricket: The game of uncertainties and the force which binds
the country as big as India. Cricket has a new definition these days, it’s a
more entertaining game than in past, because of the induction of new format
called T20 cricket.
All over the world the business folks are trying to cash up
this new popularity and Indian Premier League is one of the diamonds of these
new business models.
IPL is more than cricket, Iits money, its flashy, its surprise
cricket, its entertainment and more than that its business. There are people
earning more than players, franchise and sponsors and there has been a lot of
debates about this already.
I have been following IPL for so many years and its as unpredictable
as one game can get, its just as good as a Soccer Match , but can mathematics
achieve the possibility of finding who will win the match , is there a pattern
IPL is following and who will win today ?
I have been collecting data points and based on the historic
result trying to predict the winner and scores and let’s see how close it can
get.
The following criteria are a few in thousands to predict if
the team is going to win the match or not.
The model is in development phase and data collection has
been a challenge, hoping to add more attributes going forward and train my
Predictor.
Column Name

Description

Team 1

Name of the Team whose victory is to be predicted

No of players changed

The count of players changed from previous Game

Recent Form

50% Last 5 match ,20% close match , 20% last cup , 10%
controversies

No of bowlers in top 5 list

Players count in top 5 of tournament

No of batters in top 5
list

Players count in top 5 of tournament

AUS/NZ Players

Players from NZ Australia Region

SA Players

Players from SA Region

ASIA Players

Players from ASIA

UK Players

Players from UK

Spinners

Spinners

Fast Bowlers

Pacers /Mediam pacers

Wicket Rating

slow , fast , flat , bouncy

Avg Score

Average score in pitch

Avg Wickets Taken

Out of 20 how many wickets fall

Captain Form

30% team win , 40% own form ,20% last cups , 10%
controversies/pressure

Key players

superstar /single handed match winning players

close matches win

last 3 balls win(40%) + 180+ chases win (30%) + 150 defends
(30%)

Regular player

No of Players playing 80% times not playing or injured

Team 2 Opposition


No of players changed

The count of players changed from previous Game

Recent Form

50% Last 5 match ,20% close match , 20% last cup , 10%
controversies

No of bowlers in top 5 list

Players count in top 5 of tournament

No of batters in top 5
list

Players count in top 5 of tournament

AUS/NZ Players

Players from NZ Australia Region

SA Players

Players from SA Region

ASIA Players

Players from ASIA

UK Players

Players from UK

Spinners

Spinners

Fast Bowlers

Pacers /Mediam pacers

Wicket Rating

slow , fast , flat , bouncy

Avg Score

Average score in pitch

Avg Wickets Taken

Out of 20 how many wickets fall

Captain Form

30% team win , 40% own form ,20% last cups , 10%
controversies/pressure

Key players

superstar /single handed match winning players

close matches win

last 3 balls win(40%) + 180+ chases win (30%) + 150 defends
(30%)

Regular player

No of Players playing 80% times not playing or injured

Team 1 is Winner

Y/N

Algorithms / Betting / Classification / cricket / Dream11 / gameplay / IPL / Machine Learning / Mathematics / Prediction / Sports
Thursday, 13 September 2018
Classification III : Baye's Theorem
September 13, 2018 Rohit Jishtu
Bayes Theorem:
This is arguably one of
the most important concepts of probability and predictions based on existing
set, and the reason I got to start butterfly predictions.
Bayes Theorem : It the probability of an event,
based on prior knowledge of conditions that might be related to the event.
Wednesday, 29 August 2018
Sunday, 26 August 2018
Saturday, 18 August 2018
Recommendation System III
August 18, 2018 Rohit Jishtu
R Code for Market Basket Analysis : Explained
Market Basket AnalysisRecommendation using Market Basket Analysis Association rules is well explained in above article, now to understand it further we will be going through the code in R and ow do we implement the algorithm.
Tuesday, 7 August 2018
Prediction 1 Cricket : Ind Vs Eng Test 3
August 07, 2018 Rohit Jishtu
Starting with our first prediction algorithm and
attributes explained, we will be trying this in India England recent series and
test the accuracy of this with real time data.
Batsman Runs
I ll be coming up with runs range for every batsman and lets see how accurate it is for this match.
My top Picks for India Vs England Test 3
Batsmen Predicted Runs
1. K L Rahul  50100
2. J Bairstov  65100
3. A Cook 45100
4. A Rahane  43100
Bowlers Predicted Wickets
1. I Sharma 35
2. J Anderson 46
3. B Stokes 25
4 J Bumrah 35
Looking forward to share more in this space.
Saturday, 4 August 2018
Recommendation System II
August 04, 2018 Rohit Jishtu
Recommendation System
We have always seen the famous product sites recommending
products to users. There must be some set of rules which drives them to right product,
and if they recommend right product, probability of acquiring or using the
product increases. It works best with Netflix, Amazon, YouTube and so on.
Sunday, 29 July 2018
Classification Series : Part I
July 29, 2018 Rohit Jishtu
Decision Trees
Almost all (newbies in ML) are well familiar that machine learning divides itself in three broader categories supervised, unsupervised and reinforced learning.
Supervised: Exploration of output variable or rather prediction variable based on available training set, which must be true in first assumption. Algorithm find mathematical relations between all inputs and output from training set and same equation is applied to ‘new data’.
Predicting winner of match between 2 teams, based on historic set which includes all the time they played in similar conditions and what could be the result.
Unsupervised: A algorithm explores data without been given any output direction, algorithm finds groups which may be useful and exhibiting same properties.
Movies recommendation of Netflix based on the content you have watched or people in same characteristic as you have watched.
Reinforced: A wonderful definition from McKinsey article states that, an algorithm learns to perform a task simply by trying to maximize rewards it receives for its actions.
An interesting example is game play in Chess In order to determine the best move, the players need to think about various factors. The number of possibilities is so large that it is not possible to perform a search. If we were to build a machine to play such a game using traditional techniques, we need to specify huge number of rules to cover all these possibilities.
Reinforcement learning completely bypasses this problem. We do not need to manually specify any rules. The learning agent simply learns by playing the game.
For a last few articles we have been covering mostly about supervised, the simple linear predictions and math’s behind it, from this series we will start on classifications models, which are interesting and provides result for categorical attributes not a definite number.
Introduction and Maths
We will start the series with Decision Tree, most important concept of classification model.
Here is our sample Problem for Tree Based Models:
Problem Statement: We have a set of people their education, profession and Job status (yes /No ) , we want to predict based on our historic data whether they will subscribe to our channel , butterfly predictions.
Education

Age

Job

Content

U Grad

30

no

Subscribe

U Grad

25

no

Subscribe

Grad

35

no

Unsubscribe

Grad

45

no

Subscribe

U Grad

20

no

Subscribe

Grad

32

Yes

Unsubscribe

Grad

45

Yes

Unsubscribe

U Grad

19

no

Unsubscribe

Grad

42

no

Subscribe

U Grad

40

Yes

Subscribe

Solution : Find the root node, how do we know the best node to start our decision tree, the concept of entropy comes into picture.
Entropy can be defined as the degree of randomness in simple words. If there are two classes, “P” and “N”, we’d like our split to result in a group of mostly “P” cases and another of mostly “N” cases, so one group has P nearly 1, introduce purity measure entropy
If there are two classes (P,N) in prediction variable of the data set then
P(+)= Subscribe
N() = Unsubscribe
Entropy (class) =H= P/ P+N log_{2 }(P/P+N) N/P+N log_{2 }(n/P+N)________Eq1
Lowest entropy is what we are looking for root node, which means P(P)= 1 and P(N)=0 , H=0
Solving for above set in our reference (above table)
P(subscribe)=6 (content field)
P(unsubscribe)=4 (content field)
Formula in Eq1 becomes
H=0.6(log_{2}0.6)0.4(log_{2}0.4)
H (class) =97%
Now for information gain, the formula is
IG(attributes) = H(class) H(attribute)
IG= H (class) P/ P+N log_{2 }(P/P+N) N/P+N log_{2 }(n/P+N)________________Eq2
P(class attribute 1) = attribute 1 occurrence (for that split)
N (class attribute 2)= attribute 2 occurrence (for that split)
This way you need to calculate IG for
IG (Education) , IG(Age), IG(Job)
And for the split Information gain should be maximum, the maximum gain amongst the 3 attributes Education age and job will be root node and so on for every node.
Keep repeating the same process for all the columns and eventually we will get the following tree plotted.
Keep repeating the same process for all the columns and eventually we will get the following tree plotted.
The numbers at the top of each node in the tree correspond to the branch numbers in the textual representation of the trees as generated by the default print() method
The splits are mostly selfexplanatory, The first split is based on Job and If Job is no of user , then 20% chances that that guy will unsubscribe and so on ,from our sample people who have a job and age is more than 20 and are undergraduates will subscribe to the blog with a good percentage of 40%. So clearly a good hint at our target audience.
A very simple R code can help us achieve the above set.
Control : Courtesy R documentation
Minbucket :The minimum number of observations in any terminal <leaf> node.
If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3, as appropriate.
CP (complexity parameter). Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall Rsquared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by crossvalidation, and that hence the program need not pursue it.
Summary
Left/Right Son :"son" tells you the number of the next node below that split. The "obs" numbers are how many of the training data are on each side.
Primary Splits : Those are the leading variables that could have been used in a split.
Surrogate Splits : They are used when there are missing predictor values. Another split, that approximates the original split's results, can be used it its values are not missing.
We will be discussing on how to check the effectiveness of R models in further articles as well.
Applications
Major of predictions which require categorical answers, decision tree fits them well with good accuracy.
Sports Predictions, Categorization of Books , Prediction of employee performance while joining and so many more .
This is just a start of a series of algo in this section and let me know how you feel in comments section.
Sunday, 22 July 2018
Logistic Regression
July 22, 2018 Rohit Jishtu
Its been a while for new post, I was going through a lot of world
cup prediction models and ended up creating one logistic regression set and
predicted around 75% times correctly. Data collection was a challenge and I
included a lot of manually created/transformed variables, will publish in detail how I
trained and scored my model.
Moving on in our regular learning series we have Logistic
Regression, as always, my focus will be to understand the mathematics behind
and how do we use it for our life.
Introduction:
In the linear regression model, the dependent
variable y is considered continuous, whereas in logistic regression
it is categorical.
Linear regression uses the general linear equation
Y= b0 + b1x1 + E
where Y is a continuous dependent variable and
independent variables Xi are continuous.
Above graph shows a basic graphical comparison between linear and
logistic form of regression, linear provides a continuous values for Y ,
however logistic provides a probability of Y value with the help of sigmoid
function and the output always is categorical from 1 or 0 (simple logistic regression.
Math behind and How do we predict in Logistic Regression
Now we all know that we need to predict the y values in regression
as:
Y = b0+b1x+E ____ Eq
1
For the distribution which is nonlinear we will use sigmoid
function to calculate the y values which states
P=1/1+ey ____ Eq 2
Solve it for Y and you ll get
Y=ln (p/1p)
p/1p is called Odds
What are
odds :
Odds: Probability of event occurring / Probability of event not occurring P/1P
Odd Ratio: If P1 if for current event, and P0 is for previous
event (P1/1P1) / (P0/1P0)
Substituting in Eq1, we get
Ln(p/1p) = b0+b1x ____ Eq 3
The plan is to get the probability and then decide an optimum level
(based on understanding of problem), above which we ll define the target to be
1 and below which we ll call it 0 (only 2 variables as of now).
Something similar is explained in the graphs by udemy with a lot
more clarity
For every x value we are calculating the probability and based on
that we decide to mark it 1 or 0 , to calculate probability we use above Sigmoid function. It does not use OLS (Ordinary Least Square) for parameter
estimation. Instead, it uses maximum likelihood estimation (MLE)
We have a scenario to
explain further where a Football Team scores in first half and now we want to
see what the probability of is winning of that team.
Goals

1

2

2

3

4

3

1

2

1

2

2

3

4

5

0

1

Win

0

1

0

0

1

1

0

1

0

1

0

1

1

1

0

0

FROM R:
Summary of Logistic
regression in R gives:
Estimate Std. Error t value
Pr(>t)
(Intercept) 0.08333 0.19754
0.422 0.67953
LR$Goals. 0.25926 0.07603
3.410 0.00423 **
The
output indicates that Goals in first half studying is holding good significance
The Intercept
(b0/c) is 0.08333 and Goals(b1/m) is 0.25926.
Going
forward These coefficients are entered in the logistic regression equation to
estimate the odds (equivalently, probability) of passing the exam:
From Eq 2
P= P=1/1+e^{y }
Or P=1/1+e^{(b0+b1x)}
^{}
Now we know b0 and b1 from our regression coefficients, we can
calculate the probabilities.
Goals in first half

1

2

3

4

5

P(winning)

0.54

0.6

0.66

0.71

0.76

And we decide a optimum value of 0.60 and assume if a team
scores 2 goals more than other team , that has very good chances of winning and
we declare it as winner , of course there are numerous other factors involved
but this is one aspect of looking at it.
Performance of Logistic Regression Model
AIC (Akaike Information Criteria) and Deviance –
The analogous metric of adjusted R² in
linear regression we have AIC in logistic regression is AIC.
AIC is
the measure of fit which penalizes model for the number of model coefficients.
Therefore, we always prefer model with minimum AIC value.
It
measures flexibility of the models. Its analogous to adjusted R2 in multiple
linear regression where it tries to prevent you from including irrelevant
predictor variables. Lower AIC of model is better than the model having higher
AIC.
Summary of Model
Estimate Std. Error t value Pr(>t)
(Intercept) 0.08333 0.19754 0.422 0.67953
LR$Goals. 0.25926 0.07603 3.410 0.00423 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1560847)
Null deviance: 4.0000 on 15 degrees of freedom
Residual deviance: 2.1852 on 14 degrees of freedom
(400 observations deleted due to missingness)
AIC: 19.552
Deviance
It is a
measure of goodness of fit of a generalized linear model. Higher the deviance value,
poorer is the model fit.
The summary of the model says:
Null deviance: 4.0000 on 15 degrees of freedom
Null deviance: 4.0000 on 15 degrees of freedom
Residual deviance: 2.18
on 14 degrees of freedom
Null Deviance indicates the response predicted by a model with
nothing but an intercept. Lower the value, better the model.
Residual deviance indicates the response predicted by a model on adding
independent variables. Lower the value, better the model.
Lower
value of residual deviance points out that the model has become better when it includes
more variable (generic).
Confusion Matrix
Its
just a comparison of actual and predicted values of Y and allows us to find
accuracy and avoid over fitting , We ll have additional write up on confusion matrix
but in brief the accuracy is calculated from.
Accuracy = True Positive +True negative / True Positive + True negative + False Positive + False negative.
ROC Curve (Receiver operating
characteristic)
The best definition I could find is Receiver
Operating Characteristic (ROC) summarizes the model’s performance by evaluating
the trade offs between true positive rate (sensitivity) and false positive rate
(1 specificity).
The area under curve (AUC), referred
to as index of accuracy(A) or concordance index, is a perfect performance
metric for ROC curve. Higher the area under curve, better the prediction power
of the model.
It will be explained in detail in another
article as it is very important measure for classification model
Application:
FIFA WC/Any Sports event Winner.
Courtesy wiki, another example might be to predict whether an
Indian voter will vote BJP or Trinamool Congress or Left Front or Congress,
based on age, income, sex, race, state of residence, votes in previous
elections
ROC Curve (Receiver operating characteristic)
The best definition I could find is Receiver
Operating Characteristic (ROC) summarizes the model’s performance by evaluating
the trade offs between true positive rate (sensitivity) and false positive rate
(1 specificity).
The area under curve (AUC), referred
to as index of accuracy(A) or concordance index, is a perfect performance
metric for ROC curve. Higher the area under curve, better the prediction power
of the model.
It will be explained in detail in another
article as it is very important measure for classification model
Application:
FIFA WC/Any Sports event Winner.
Courtesy wiki, another example might be to predict whether an
Indian voter will vote BJP or Trinamool Congress or Left Front or Congress,
based on age, income, sex, race, state of residence, votes in previous
elections