Saturday, 6 April 2019

IPL Results




The place is updated by running a model 2-8 hours before the match with latest data scoring.
Feel free to reach out for more queries on the model manipulations.



You can check the daily predictions before match in Twitter https://twitter.com/jishturohit

The algorithm has been changed to incorporate the latest findings patterns in IPL , trained with rcent data and lets see how the predictions goes.


Thursday, 4 April 2019

IPL 2019 Prediction Model : Data Points





Cricket: The game of uncertainties and the force which binds the country as big as India. Cricket has a new definition these days, it’s a more entertaining game than in past, because of the induction of new format called T-20 cricket.

All over the world the business folks are trying to cash up this new popularity and Indian Premier League is one of the diamonds of these new business models.

IPL is more than cricket, Iits money, its flashy, its surprise cricket, its entertainment and more than that its business. There are people earning more than players, franchise and sponsors and there has been a lot of debates about this already.

I have been following IPL for so many years and its as unpredictable as one game can get, its just as good as a Soccer Match , but can mathematics achieve the possibility of finding who will win the match , is there a pattern IPL is following and who will win today ?
I have been collecting data points and based on the historic result trying to predict the winner and scores and let’s see how close it can get.

The following criteria are a few in thousands to predict if the team is going to win the match or not.
The model is in development phase and data collection has been a challenge, hoping to add more attributes going forward and train my Predictor.

Column Name
Description
Team 1
Name of the Team whose victory is to be predicted
No of players changed
The count of players changed from previous Game
Recent Form
50% Last 5 match ,20% close match , 20% last cup , 10% controversies
No of bowlers in top 5 list
Players count in top 5 of tournament
No of batters  in top 5 list
Players count in top 5 of tournament
AUS/NZ Players
Players from NZ Australia Region
SA Players
Players from SA  Region
ASIA Players
Players from ASIA
UK Players
Players from UK
Spinners
Spinners
Fast Bowlers
Pacers /Mediam pacers
Wicket Rating
slow , fast , flat , bouncy
Avg Score
Average score in pitch
Avg Wickets Taken
Out of 20 how many wickets fall
Captain Form
30% team win , 40% own form ,20% last cups , 10% controversies/pressure 
Key players
superstar /single handed match winning players
close matches win
last 3 balls win(40%) + 180+ chases win (30%) + 150- defends (30%)
Regular player
No of Players playing 80% times not playing or injured
Team 2- Opposition

No of players changed
The count of players changed from previous Game
Recent Form
50% Last 5 match ,20% close match , 20% last cup , 10% controversies
No of bowlers in top 5 list
Players count in top 5 of tournament
No of batters  in top 5 list
Players count in top 5 of tournament
AUS/NZ Players
Players from NZ Australia Region
SA Players
Players from SA  Region
ASIA Players
Players from ASIA
UK Players
Players from UK
Spinners
Spinners
Fast Bowlers
Pacers /Mediam pacers
Wicket Rating
slow , fast , flat , bouncy
Avg Score
Average score in pitch
Avg Wickets Taken
Out of 20 how many wickets fall
Captain Form
30% team win , 40% own form ,20% last cups , 10% controversies/pressure 
Key players
superstar /single handed match winning players
close matches win
last 3 balls win(40%) + 180+ chases win (30%) + 150- defends (30%)
Regular player
No of Players playing 80% times not playing or injured
Team 1 is Winner
Y/N

Thursday, 13 September 2018

Classification III : Baye's Theorem



Bayes Theorem:

This is arguably one of the most important concepts of probability and predictions based on existing set, and the reason I got to start butterfly predictions.
Bayes Theorem : It the probability of an event, based on prior knowledge of conditions that might be related to the event.

Wednesday, 29 August 2018

Sunday, 26 August 2018

Saturday, 18 August 2018

Tuesday, 7 August 2018

Prediction 1- Cricket : Ind Vs Eng Test 3





Starting with our first prediction algorithm and attributes explained, we will be trying this in India England recent series and test the accuracy of this with real time data.


Batsman Runs 

I ll be coming up with runs range for every batsman and lets see how accurate it is for this match.


My top Picks for India Vs England Test 3

Batsmen- Predicted Runs

1. K L Rahul  - 50-100
2. J Bairstov  - 65-100
3. A Cook    -45-100
4. A Rahane - 43-100

Bowlers -Predicted Wickets

1. I Sharma  -3-5 
2. J Anderson -4-6
3. B Stokes -2-5
4  J Bumrah -3-5

Results 



Looking forward to share more in this space.


Saturday, 4 August 2018

Recommendation System II


Recommendation System

We have always seen the famous product sites recommending products to users. There must be some set of rules which drives them to right product, and if they recommend right product, probability of acquiring or using the product increases. It works best with Netflix, Amazon, YouTube and so on.

Sunday, 29 July 2018

Classification Series : Part I



Decision Trees 

Almost all (newbies in ML) are well familiar that machine learning divides itself in three broader categories supervised, unsupervised and reinforced learning.

Supervised: Exploration of output variable or rather prediction variable based on available training set, which must be true in first assumption. Algorithm find mathematical relations between all inputs and output from training set and same equation is applied to ‘new data’.
Predicting winner of match between 2 teams, based on historic set which includes all the time they played in similar conditions and what could be the result.

Unsupervised: A algorithm explores data without been given any output direction, algorithm finds groups which may be useful and exhibiting same properties.
Movies recommendation of Netflix based on the content you have watched or people in same characteristic as you have watched.

Reinforced: A wonderful definition from McKinsey article states that, an algorithm learns to perform a task simply by trying to maximize rewards it receives for its actions.

An interesting example is game play in Chess In order to determine the best move, the players need to think about various factors. The number of possibilities is so large that it is not possible to perform a search. If we were to build a machine to play such a game using traditional techniques, we need to specify huge number of rules to cover all these possibilities.

Reinforcement learning completely bypasses this problem. We do not need to manually specify any rules. The learning agent simply learns by playing the game.

For a last few articles we have been covering mostly about supervised, the simple linear predictions and math’s behind it, from this series we will start on classifications models, which are interesting and provides result for categorical attributes not a definite number.

Introduction and Maths

We will start the series with Decision Tree, most important concept of classification model.
Here is our sample Problem for Tree Based Models:

Problem Statement: We have a set of people their education, profession and Job status (yes /No ) , we want to predict based on our historic data whether they will subscribe to our channel , butterfly predictions.

                                                    
Education
Age
Job
Content
U Grad
30
no
Subscribe
U Grad
25
no
Subscribe
Grad
35
no
Unsubscribe
Grad
45
no
Subscribe
U Grad
20
no
Subscribe
Grad
32
Yes
Unsubscribe
Grad
45
Yes
Unsubscribe
U Grad
19
no
Unsubscribe
Grad
42
no
Subscribe
U Grad
40
Yes
Subscribe




Solution : Find the root node, how do we know the best node to start our decision tree, the concept of entropy comes into picture.
Entropy can be defined as the degree of randomness in simple words. If there are two classes, “P” and “N”, we’d like our split to result in a group of mostly “P” cases and another of mostly “N” cases, so one group has P nearly 1, introduce purity measure entropy

If there are two classes (P,N) in prediction variable of the data set then

P(+)= Subscribe
N(-) = Unsubscribe

Entropy (class) =H=  -P/ P+N  log2 (P/P+N) -N/P+N  log2 (n/P+N)________Eq1

Lowest entropy is what we are looking for root node, which means P(P)= 1 and P(N)=0 , H=0

Solving for above set in our reference (above table)

P(subscribe)=6 (content field)
P(unsubscribe)=4 (content field)

Formula in Eq1 becomes

H=-0.6(log20.6)-0.4(log20.4)
H (class) =97%

Now for information gain, the formula is
IG(attributes) = H(class)- H(attribute)

IG= H (class)- P/ P+N  log2 (P/P+N) -N/P+N  log2 (n/P+N)________________Eq2

P(class attribute 1) = attribute 1 occurrence  (for that split)
N (class attribute 2)= attribute 2 occurrence (for that split)

This way you need to calculate IG for

IG (Education) , IG(Age), IG(Job)

And for the split Information gain should be maximum, the maximum gain amongst the 3 attributes Education age and job will be root node and so on for every node.

Keep repeating the same process for all the columns and eventually we will get the following tree plotted.

The numbers at the top of each node in the tree correspond to the branch numbers in the textual representation of the trees as generated by the default print() method

The tree node has 4 values in all directions, North side represents the majority value of target node (in our case Purchased), West indicates the percentage values present with left condition and east for right condition percentage. South side value represents the percentage of values in that node , hence root node has maximum percentage. 

The splits are mostly self-explanatory, The first split is based on Job and If Job is no of user , then 20% chances that that guy will unsubscribe and so on ,from our sample people who have a job and age is more than 20 and are undergraduates will subscribe to the blog with a good percentage of 40%. So clearly a good hint at our target audience.




R Code
A very simple R code can help us achieve the above set.



Control : Courtesy R documentation

Minsplit :The minimum number of observations that must exist in a node in order for a split to be attempted.

Minbucket :The minimum number of observations in any terminal <leaf> node.
If only one of minbucket or minsplit is specified, the code either  sets minsplit to minbucket*3            or minbucket to minsplit/3, as appropriate.

CP (complexity parameter). Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it.

Summary 



Left/Right Son :"son" tells you the number of the next node below that split. The "obs" numbers are how many of the training data are on each side.

Primary Splits : Those are the leading variables that could have been used in a split.

Surrogate Splits : They are used when there are missing predictor values. Another split, that approximates the original split's results, can be used it its values are not missing.

We will be discussing on how to check the effectiveness of R models in further articles as well.

Applications

Major of predictions which require categorical answers, decision tree fits them well with good accuracy.
Sports Predictions, Categorization of Books , Prediction of employee performance while joining and so many more .
This is just a start of a series of algo in this section and let me know how you feel in comments section.

Further tutorials on classification

Classification Tree


Sunday, 22 July 2018

Logistic Regression



Its been a while for new post, I was going through a lot of world cup prediction models and ended up creating one logistic regression set and predicted around 75% times correctly. Data collection was a challenge and I included a lot of manually created/transformed variables, will publish in detail how I trained and scored my model.

Moving on in our regular learning series we have Logistic Regression, as always, my focus will be to understand the mathematics behind and how do we use it for our life.

Introduction:

In the linear regression model, the dependent variable y is considered continuous, whereas in logistic regression it is categorical. 
Linear regression uses the general linear equation 

Y=   b0 + b1x1 + E

where Y is a continuous dependent variable and independent variables Xi are continuous.

Above graph shows a basic graphical comparison between linear and logistic form of regression, linear provides a continuous values for Y , however logistic provides a probability of Y value with the help of sigmoid function and the output always is categorical from 1 or 0 (simple logistic regression.

Math behind and How do we predict in Logistic Regression   

Now we all know that we need to predict the y values in regression as:

Y = b0+b1x+E   ____ Eq 1

For the distribution which is non-linear we will use sigmoid function to calculate the y values which states

P=1/1+e-y ____ Eq 2

Solve it for Y and you ll get

Y=ln (p/1-p)
p/1-p is called Odds

What are odds :

Odds: Probability of event occurring / Probability of event not occurring P/1-P
Odd Ratio: If P1 if for current event, and P0 is for previous event (P1/1-P1) / (P0/1-P0)



Substituting in Eq1, we get

 Ln(p/1-p) = b0+b1x ____ Eq 3

The plan is to get the probability and then decide an optimum level (based on understanding of problem), above which we ll define the target to be 1 and below which we ll call it 0 (only 2 variables as of now).

Something similar is explained in the graphs by udemy with a lot more clarity
For every x value we are calculating the probability and based on that we decide to mark it 1 or 0 , to calculate probability we use above Sigmoid function. It does not use OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE)





We have a scenario to explain further where a Football Team scores in first half and now we want to see what the probability of is winning of that team.


Goals
1
2
2
3
4
3
1
2
1
2
2
3
4
5
0
1
Win
0
1
0
0
1
1
0
1
0
1
0
1
1
1
0
0

FROM R: 

Summary of Logistic regression in R gives:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -0.08333    0.19754  -0.422  0.67953  
LR$Goals.    0.25926    0.07603   3.410  0.00423 **



The output indicates that Goals in first half studying is holding good significance
The Intercept (b0/c) is -0.08333 and Goals(b1/m) is 0.25926.
Going forward These coefficients are entered in the logistic regression equation to estimate the odds (equivalently, probability) of passing the exam:

From Eq 2
P= P=1/1+e-y
Or P=1/1+e-(b0+b1x)

Now we know b0 and b1 from our regression coefficients, we can calculate the probabilities.

Goals in first half
1
2
3
4
5
P(winning)
0.54
0.6
0.66
0.71
0.76

And we decide a optimum value of 0.60 and assume if a team scores 2 goals more than other team , that has very good chances of winning and we declare it as winner , of course there are numerous other factors involved but this is one aspect of looking at it.

Performance of Logistic Regression Model

AIC (Akaike Information Criteria) and Deviance – 

The analogous metric of adjusted R² in linear regression we have AIC in logistic regression is AIC.
AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.

It measures flexibility of the models. Its analogous to adjusted R2 in multiple linear regression where it tries to prevent you from including irrelevant predictor variables. Lower AIC of model is better than the model having higher AIC.

Summary of Model  
Estimate Std. Error t value Pr(>|t|)   
(Intercept) -0.08333    0.19754  -0.422  0.67953   
LR$Goals.    0.25926    0.07603   3.410  0.00423 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for gaussian family taken to be 0.1560847)
Null deviance: 4.0000  on 15  degrees of freedom
Residual deviance: 2.1852  on 14  degrees of freedom
  (400 observations deleted due to missingness)
AIC: 19.552

Deviance

It is a measure of goodness of fit of a generalized linear model. Higher the deviance value, poorer is the model fit.

The summary of the model says:
Null deviance: 4.0000 on 15 degrees of freedom
Residual deviance: 2.18 on 14 degrees of freedom

Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model.

Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

Lower value of residual deviance points out that the model has become better when it includes more variable (generic).

Confusion Matrix

Its just a comparison of actual and predicted values of Y and allows us to find accuracy and avoid over fitting , We ll have additional write up on confusion matrix but in brief the accuracy is calculated from.



Accuracy = True Positive +True negative / True Positive + True negative + False Positive + False negative.
 

ROC Curve (Receiver operating characteristic)

The best definition I could find is Receiver Operating Characteristic (ROC) summarizes the model’s performance by evaluating the trade offs between true positive rate (sensitivity) and false positive rate (1- specificity).

The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. 


It will be explained in detail in another article as it is very important measure for classification model

Application:


FIFA WC/Any Sports event Winner.


Courtesy wiki, another example might be to predict whether an Indian voter will vote BJP or Trinamool Congress or Left Front or Congress, based on age, income, sex, race, state of residence, votes in previous elections

let me know what you think in comments👇