Skip to main content

Decision Tree

Decision Tree algorithm is a part of supervised learning algorithms.

It is used for both regression and classification.

  1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  2. Splitting: It is a process of dividing a node into two or more sub-nodes. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
  3. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
  4. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
  5. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
  6. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

image

Workflow of a Decision Tree

Decision tree works based on node split strategy.

The different measures of splitting the nodes of a decision tree are -

  • Entropy - It is the measure of randomness in the features. image

  • Information Gain - measures how well an attribute is separated.

    note

    We want highest information gain and lowest entropy

    Information gain = Entropy before split - Entropy after split

  • Gini Index - It is a cost function used to evaluate performance.

    Gini = 1(pi)21 - \sum{(p_i)^2}

    where, pi = probabilities of each class.

    note

    Higher value of Gini Index -> Higher iequality, heterogeneity.

  • Pruning - Trimming of the branches of a tree or removing decision nodes from the root note such that the overall accuracy is not disturbed.

image