Classification Series : Part I
Decision Trees
Almost all (newbies in ML) are well familiar that machine learning divides itself in three broader categories supervised, unsupervised and reinforced learning.
Supervised: Exploration of output variable or rather prediction variable based on available training set, which must be true in first assumption. Algorithm find mathematical relations between all inputs and output from training set and same equation is applied to ‘new data’.
Predicting winner of match between 2 teams, based on historic set which includes all the time they played in similar conditions and what could be the result.
Unsupervised: A algorithm explores data without been given any output direction, algorithm finds groups which may be useful and exhibiting same properties.
Movies recommendation of Netflix based on the content you have watched or people in same characteristic as you have watched.
Reinforced: A wonderful definition from McKinsey article states that, an algorithm learns to perform a task simply by trying to maximize rewards it receives for its actions.
An interesting example is game play in Chess In order to determine the best move, the players need to think about various factors. The number of possibilities is so large that it is not possible to perform a search. If we were to build a machine to play such a game using traditional techniques, we need to specify huge number of rules to cover all these possibilities.
Reinforcement learning completely bypasses this problem. We do not need to manually specify any rules. The learning agent simply learns by playing the game.
For a last few articles we have been covering mostly about supervised, the simple linear predictions and math’s behind it, from this series we will start on classifications models, which are interesting and provides result for categorical attributes not a definite number.
Introduction and Maths
We will start the series with Decision Tree, most important concept of classification model.
Here is our sample Problem for Tree Based Models:
Problem Statement: We have a set of people their education, profession and Job status (yes /No ) , we want to predict based on our historic data whether they will subscribe to our channel , butterfly predictions.
Education

Age

Job

Content

U Grad

30

no

Subscribe

U Grad

25

no

Subscribe

Grad

35

no

Unsubscribe

Grad

45

no

Subscribe

U Grad

20

no

Subscribe

Grad

32

Yes

Unsubscribe

Grad

45

Yes

Unsubscribe

U Grad

19

no

Unsubscribe

Grad

42

no

Subscribe

U Grad

40

Yes

Subscribe

Solution : Find the root node, how do we know the best node to start our decision tree, the concept of entropy comes into picture.
Entropy can be defined as the degree of randomness in simple words. If there are two classes, “P” and “N”, we’d like our split to result in a group of mostly “P” cases and another of mostly “N” cases, so one group has P nearly 1, introduce purity measure entropy
If there are two classes (P,N) in prediction variable of the data set then
P(+)= Subscribe
N() = Unsubscribe
Entropy (class) =H= P/ P+N log_{2 }(P/P+N) N/P+N log_{2 }(n/P+N)________Eq1
Lowest entropy is what we are looking for root node, which means P(P)= 1 and P(N)=0 , H=0
Solving for above set in our reference (above table)
P(subscribe)=6 (content field)
P(unsubscribe)=4 (content field)
Formula in Eq1 becomes
H=0.6(log_{2}0.6)0.4(log_{2}0.4)
H (class) =97%
Now for information gain, the formula is
IG(attributes) = H(class) H(attribute)
IG= H (class) P/ P+N log_{2 }(P/P+N) N/P+N log_{2 }(n/P+N)________________Eq2
P(class attribute 1) = attribute 1 occurrence (for that split)
N (class attribute 2)= attribute 2 occurrence (for that split)
This way you need to calculate IG for
IG (Education) , IG(Age), IG(Job)
And for the split Information gain should be maximum, the maximum gain amongst the 3 attributes Education age and job will be root node and so on for every node.
Keep repeating the same process for all the columns and eventually we will get the following tree plotted.
Keep repeating the same process for all the columns and eventually we will get the following tree plotted.
The numbers at the top of each node in the tree correspond to the branch numbers in the textual representation of the trees as generated by the default print() method
The splits are mostly selfexplanatory, The first split is based on Job and If Job is no of user , then 20% chances that that guy will unsubscribe and so on ,from our sample people who have a job and age is more than 20 and are undergraduates will subscribe to the blog with a good percentage of 40%. So clearly a good hint at our target audience.
A very simple R code can help us achieve the above set.
Control : Courtesy R documentation
Minbucket :The minimum number of observations in any terminal <leaf> node.
If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3, as appropriate.
CP (complexity parameter). Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall Rsquared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by crossvalidation, and that hence the program need not pursue it.
Summary
Left/Right Son :"son" tells you the number of the next node below that split. The "obs" numbers are how many of the training data are on each side.
Primary Splits : Those are the leading variables that could have been used in a split.
Surrogate Splits : They are used when there are missing predictor values. Another split, that approximates the original split's results, can be used it its values are not missing.
We will be discussing on how to check the effectiveness of R models in further articles as well.
Applications
Major of predictions which require categorical answers, decision tree fits them well with good accuracy.
Sports Predictions, Categorization of Books , Prediction of employee performance while joining and so many more .
This is just a start of a series of algo in this section and let me know how you feel in comments section.
Thanks for the wellwritten post and I will follow your updates regularly.
ReplyDeleteMachine Learning course in Chennai
Machine Learning Training in Chennai
RPA Training in Chennai
RPA course in Chennai
Blue Prism Training in Chennai
Blue Prism Training Chennai
Automation Anywhere Training in Chennai
UiPath Training in Chennai
Machine Learning Institute in Chennai