top of page
Search
  • nithyav2k16

Decision Trees

Updated: Feb 11, 2020

In this post, we will be talking about important concepts in supervised learning - Classification and Decision Trees. Classification is a task of automatically classifying observations with certain features. Formally an observation consists of a vector of features and the class. Classification model will automatically assign a class using the input features based on classes of previous observations. Binary classification problem is a classification problem where there are two possible outcomes. Multi-Class classification takes place when there are more than two possible classes.


Classification in machine learning
Classification in machine learning

An example of a data set that could be used for classification would be a dataset of persons. Persons could have three features and a class. Features could be attributes like age, weight and income. The class could be something like happy or not happy for binary classification problem. For a multi class classification problem, the classes could be happy, satisfied, and not happy. The features in a classification problem can be numerical. These are features that are represented by a number. An example is age or height. Features can also be categorical. These are features that have finite amount of possible values. Examples are travel_class (or) smokes?


Categorical and Numerical features
Categorical and Numerical features

Let's start with a very intuitive form of classification - The decision tree. Suppose you want to diagnose a patient as sick or not sick, so one or zero. The logical thing to do would be to ask questions about the patient, sort of like a guessing game. For example, your first question could be, is patient young or old? The next question depends on the answer. If the patient is old, you could ask has the patient smoked for more than 10 years?If the patient is young, you could ask, is the patient vaccinated against measles and so on. You could represent these questions as a tree.To classify patients as sick or not sick, you could run through this tree and come to a conclusion.


A decision tree example
A decision tree example


I will define what we mean by a tree .A tree consists of nodes like these and edges like these.We call the start of a tree as the roots and the hands are the leafs.The nodes can be split into its children. The opposite relation is a parents relation. To keep it short, here's a handy overview of all the tree relationships.


overview of decision tree relationships
overview of decision tree relationships

The questions that are at the nodes of your tree are simply queries on the features of the data. For example, a data set can contain the age feature. Question could be, is age smaller than or equal to 18 . this splits a data set into two parts, instances for which age is bigger than 18 and the other instances.Suppose you are given the following tree and an observation. Patient is 40 years old, vaccinated against measles and has never smoked in his life. The first question, the model asks is whether the patient is younger than or are exactly 18 . The patient you're classifying is 40 so you need to take to "no" branch. The next question is whether the patient smoked or not. He never did so we take to no branch again.Then it follows a leaf node, which means you can make a prediction - The patient is not sick.


A practical decision tree example
A practical decision tree example

Learn a decision tree

We now know how to classify with a decision tree, but this assumes that you have a decision tree. How can you build it? Well, you'll have to learn the decision tree from a training set. To do this, we need to come up with queries for each node and a tree. At the root node, we'll split the training sets into two parts if you're dealing with a binary test. An example could be age less than or equal to 18. One child node will contain all instances of the training set for which the binary test was true. The other child node will contain all the instances for which the tests were false.


Learning a decision tree
Learning a decision tree

Each child node then again splits the resulting sets. This process continues until you stop at a leaf node which will contain small portion of the training set. The goal of learning at decision tree is to end up with leaves that are pure, which means that there are only instances of one particular class in them. In practice, however, this almost never possible due to noise on the data. After you learn the tree, you can use it to classify new instances. When you classify a new instance, you end up in a leaf. The class you're assigning to an instance is typically the class of the majority of the training instances in that leaf.





As you may have noticed, the key to good tree is to split the tree on the right features. In essence a learning algorithm that is building a tree on training data will have to iterate over a number of possible tests on specific features and choose test that gives the best split of classes. So the learning comes down to two parts. Make a list of possible feature tests and choose the best one. When you want to split a node, first you must make list of possible feature tests .For categorial features you can just add feature tests which haven't been used yet. For numerical features it's somewhat more complex .Here you'll have to propose some features, but also some threshold to perform your splitting up.


Information Gain

The next step is choosing the best test to split the training data. This is pretty complex. You can use several splitting criteria to decide which test is the best to use. One of the most common is information gain - a concept that closely related to entropy. This entropy is in the context of information , not in the context of physics.


information gain
information gain


Basically the information gain defines how much information you gain about training instances in your subgroups when you perform a split based on the feature test. If a test leads to subgroups that are nicely divided per class, the information gain is high. If you end up with subsets that are still pretty scrambled, the information gain is low. The test that gives a highest information gain will be chosen.


Pruning a decision tree

Last but not least, the number of nodes in the decision tree play an important role in effect where the decision tree will over-fit to the training sets are not. If you restrict the size of your decision tree, the bias of the resulting tree model is higher. This will effectively decrease your chance of overfitting. This process is called pruning the decision tree. However, take note that pruning a decision tree will not magically increase the model's predictive accuracy. Most decision tree algorithms use a pretty good stopping criteria, which will cause algorithm to stop splitting nodes at some point.


Hope this article makes it clear about decision trees. For understanding further about decision trees and other topics on data science, visit this page


17 views0 comments

Recent Posts

See All
bottom of page