Sunday, 29 July 2018

Classification Series : Part I



Decision Trees 

Almost all (newbies in ML) are well familiar that machine learning divides itself in three broader categories supervised, unsupervised and reinforced learning.

Supervised: Exploration of output variable or rather prediction variable based on available training set, which must be true in first assumption. Algorithm find mathematical relations between all inputs and output from training set and same equation is applied to ‘new data’.
Predicting winner of match between 2 teams, based on historic set which includes all the time they played in similar conditions and what could be the result.

Unsupervised: A algorithm explores data without been given any output direction, algorithm finds groups which may be useful and exhibiting same properties.
Movies recommendation of Netflix based on the content you have watched or people in same characteristic as you have watched.

Reinforced: A wonderful definition from McKinsey article states that, an algorithm learns to perform a task simply by trying to maximize rewards it receives for its actions.

An interesting example is game play in Chess In order to determine the best move, the players need to think about various factors. The number of possibilities is so large that it is not possible to perform a search. If we were to build a machine to play such a game using traditional techniques, we need to specify huge number of rules to cover all these possibilities.

Reinforcement learning completely bypasses this problem. We do not need to manually specify any rules. The learning agent simply learns by playing the game.

For a last few articles we have been covering mostly about supervised, the simple linear predictions and math’s behind it, from this series we will start on classifications models, which are interesting and provides result for categorical attributes not a definite number.

Introduction and Maths

We will start the series with Decision Tree, most important concept of classification model.
Here is our sample Problem for Tree Based Models:

Problem Statement: We have a set of people their education, profession and Job status (yes /No ) , we want to predict based on our historic data whether they will subscribe to our channel , butterfly predictions.

                                                    
Education
Age
Job
Content
U Grad
30
no
Subscribe
U Grad
25
no
Subscribe
Grad
35
no
Unsubscribe
Grad
45
no
Subscribe
U Grad
20
no
Subscribe
Grad
32
Yes
Unsubscribe
Grad
45
Yes
Unsubscribe
U Grad
19
no
Unsubscribe
Grad
42
no
Subscribe
U Grad
40
Yes
Subscribe




Solution : Find the root node, how do we know the best node to start our decision tree, the concept of entropy comes into picture.
Entropy can be defined as the degree of randomness in simple words. If there are two classes, “P” and “N”, we’d like our split to result in a group of mostly “P” cases and another of mostly “N” cases, so one group has P nearly 1, introduce purity measure entropy

If there are two classes (P,N) in prediction variable of the data set then

P(+)= Subscribe
N(-) = Unsubscribe

Entropy (class) =H=  -P/ P+N  log2 (P/P+N) -N/P+N  log2 (n/P+N)________Eq1

Lowest entropy is what we are looking for root node, which means P(P)= 1 and P(N)=0 , H=0

Solving for above set in our reference (above table)

P(subscribe)=6 (content field)
P(unsubscribe)=4 (content field)

Formula in Eq1 becomes

H=-0.6(log20.6)-0.4(log20.4)
H (class) =97%

Now for information gain, the formula is
IG(attributes) = H(class)- H(attribute)

IG= H (class)- P/ P+N  log2 (P/P+N) -N/P+N  log2 (n/P+N)________________Eq2

P(class attribute 1) = attribute 1 occurrence  (for that split)
N (class attribute 2)= attribute 2 occurrence (for that split)

This way you need to calculate IG for

IG (Education) , IG(Age), IG(Job)

And for the split Information gain should be maximum, the maximum gain amongst the 3 attributes Education age and job will be root node and so on for every node.

Keep repeating the same process for all the columns and eventually we will get the following tree plotted.

The numbers at the top of each node in the tree correspond to the branch numbers in the textual representation of the trees as generated by the default print() method

The tree node has 4 values in all directions, North side represents the majority value of target node (in our case Purchased), West indicates the percentage values present with left condition and east for right condition percentage. South side value represents the percentage of values in that node , hence root node has maximum percentage. 

The splits are mostly self-explanatory, The first split is based on Job and If Job is no of user , then 20% chances that that guy will unsubscribe and so on ,from our sample people who have a job and age is more than 20 and are undergraduates will subscribe to the blog with a good percentage of 40%. So clearly a good hint at our target audience.




R Code
A very simple R code can help us achieve the above set.



Control : Courtesy R documentation

Minsplit :The minimum number of observations that must exist in a node in order for a split to be attempted.

Minbucket :The minimum number of observations in any terminal <leaf> node.
If only one of minbucket or minsplit is specified, the code either  sets minsplit to minbucket*3            or minbucket to minsplit/3, as appropriate.

CP (complexity parameter). Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it.

Summary 



Left/Right Son :"son" tells you the number of the next node below that split. The "obs" numbers are how many of the training data are on each side.

Primary Splits : Those are the leading variables that could have been used in a split.

Surrogate Splits : They are used when there are missing predictor values. Another split, that approximates the original split's results, can be used it its values are not missing.

We will be discussing on how to check the effectiveness of R models in further articles as well.

Applications

Major of predictions which require categorical answers, decision tree fits them well with good accuracy.
Sports Predictions, Categorization of Books , Prediction of employee performance while joining and so many more .
This is just a start of a series of algo in this section and let me know how you feel in comments section.

Further tutorials on classification

Classification Tree


0 comments:

Post a Comment