Decision Trees¶

Piecewise constant model that adaptively learns to divide predictor space into different regions, and then fits a model in each region, without human intervention for deciding the cuts

Can be applied for regression and classification

Advantages¶

Interpretable, for small trees
Little data processing required

Disadvantages¶

Small trees are not powerful
Large trees tend to overfit, and are hard to regularize
Do not work well for modelling linear relationships
Cannot extrapolate well
Decision regions tend to be highly-fragmented
Resulting function (also Decision boundary for classification) is very non-smooth and blocky

Terms¶

Term	Meaning
Internal Nodes	Labelled by attributes names, to be tested on an attribute Contain the splitting attributes
Leaf/Terminal Nodes	Labelled by class labels
Edges	determined by the nuo of outcomes on n attribute test conditions
Size	Number of leaves

Goal¶

Find cuts to obtain \(R_m\) and improve in-sample fit, while minimizing out-of-sample loss

Since it is computationally-infeasible to consider every possible partition of the predictor space, we adopt a forward stepwise procedure

Decision Tree Approaches¶

Approach	Steps	Disadvantage
Greedy	Divide the problem into different steps, take decisions at each step Build tree in Top-Down manner Split train set into purer subsets (the best split) @ every node	Backtracking is not possible
Recursive

Regression Tree¶

\[ \begin{aligned} \hat y &= \sum_{m=1}^M \hat y_m \cdot I \{ x \in R_m \} \\ \hat y_m &= \underset{x_i \in R_m}{\text{mean}} \ (y_i) \end{aligned} \]

where

\(M =\) no of leaves = no of regions
\(R_m =\) region \(m\)
\(n_m=\) no of obs in \(R_m\)

For every observation that falls into the region Rm, we make the same prediction, which is simply the mean of the response values for the training observations in \(R_m\)

CART Algorithm¶

Greedy recursive partitioning algorithm that divides predictor space through successive binary splits, until a stopping criterion is reached.

This is greedy because at each step of the tree-building process, the optimal split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step

Steps¶

At each step, generate binary split to divide predictor space into two regions that achieves the biggest reduction in \(E_\text{in}\)
The 2 regions are as homogenous in response \(y\) as possible
- \(R_\text{left} = \{ x \in R: x_j < s \}\)
- \(R_\text{right} = \{ x \in R: x_j \ge s \}\)
For regression this means choosing \(j, s\) such that

\(\arg \min\limits_{j, s} \left \{ \sum \limits_{x_j \in R_\text{left}} L(y_i, \hat y_\text{left}) + \sum \limits_{x_j \in R_\text{right}} L(y_i, \hat y_\text{right}) \right \}\)
Prune the tree by choosing a subtree \(T \subset T_0\) that minimizes

\(\sum \limits_{m=1}^{\vert T \vert} \sum \limits_{x_j \in R_m} L(y_i, \hat y_m) + \alpha \vert T \vert, \quad \forall \alpha \ge 0\)

\(\vert T \vert =\) size of tree
\(\alpha\) controls tradeoff b/w subtree’s complexity and fit to training data
- \(\alpha=0 \implies T=T_0\)
- \(\vert T \vert \propto \dfrac{1}{\alpha}\)
- Optimal \(\alpha\) obtained through cross-validation

Classification Tree¶

\[ \begin{aligned} \hat y &= \sum_{m=1}^M \hat c_m \cdot I (x \in R_m) \\ \hat c_m &= \underset{x_i \in R_m}{\text{mode}} \ (y_i) \end{aligned} \]

\(\hat c_m =\) most frequent class in \(R_m\)

Node Impurity¶


Misclassification Rate	\(1 - {\hat p}_m\)
Gini-Index	\(\sum \limits_c^C \hat p_c^m (1 - \hat p_c^m)\)
Cross-Entropy	\(- \sum \limits_c^C \hat p_c^m \cdot \ln \hat p_c^m\)

Steps¶

Pick an independent variable
Find Entropy of all classes of that independent variable

\[ H(C_i) = -P_\text{Pos} \log_2 (P_\text{Pos}) -P_\text{Neg} \log_2 (P_\text{Neg}) \]

Calculate gain of each independent variable for current set/subset of data

\[ \begin{aligned} &\text{Gain}\Big( \text{Value}(C_1), C_2 \Big) \\ =& H \Big(\text{Value}(C_1) \Big) \\ & - \left[ \sum_{i=1} \frac{n (C_2=\text{Value}_i)}{n \Big(\text{Value}(C_1) \Big)} \times H(C_2=\text{Value}_i) \right] \end{aligned} \]

Pick the independent variable with the highest gain
Recursively repeat for each independent variable

Hunt’s Algorithm¶

Let

\(t\) be a node
\(D_t\) be train dataset @ node \(t\)
\(y\) be set of class labels \(\{ C_1, C_2, \dots, C_n \}\)
Make node \(t\) into a leaf node. Label \(t\) with class label \(y_t\)
Split using appropriate attribute
Apply splitting criteria for each attribute and obtain impurity of split using that attribute
Pick the attribute that gives the lowest impurity
Label the node \(t\) with this attribute
Determine the outcomes
Create nodes for each outcome
Draw the edges
Split the dataset
Repeat the steps for each subtree now

Additional Cases¶

Empty Subset¶

Let’s say when splitting your data, you end up having an empty subset of the data

Make node \(t\) into a leaf node
\(t\) is labelled as the majority class of the parent dataset

All records have identical attribute values¶

Make \(t\) into a leaf node
\(t\) is labelled as the majority class represented in the current subset

Number of records fall below minimum threshold value¶

Make \(t\) into a leaf node
\(t\) is labelled as the majority class represented in the current subset

Splitting Continuous Attributes¶

Binary Split¶

Choose a splitting value that results in a purer partition

Multi-Way Split¶

Apply discretization
Each bin will be a different node

flowchart LR

subgraph Multi-Way Split
direction TB
s2((Salary))

a2(( ))
b2(( ))
c2(( ))
d2(( ))
e2(( ))

s2 -->|< 10k| a2
s2 -->|10k - 30k| b2
s2 -->|30k - 50k| c2
s2 -->|50k - 80k| d2
s2 -->| > 80k| e2
end

subgraph Binary
direction TB
s1((Salary))

end

Challenges¶

How to split training records?
Find best splitting attribute (Test on Attribute)
Measure for goodness of split
Stopping conditions Stop at situations that result in fully-grown tree (the tree has learnt all the important characteristics, which may not be optimal; in that case we may require some other stopping conditions)
- When all records at a node have the same attribute value
- When all records at a node have the same class label value
- When a node receives an empty subset

Attribute Test Condition and Outcomes¶

Nominal¶

Binary split will have \(2^{k-1} - 1\) combinations of the binary split, where \(k=\) no of values. For eg:

flowchart LR

subgraph 1
direction TB
aa[Attribute] --> v1a[v1] & notv1a["Not v1 <br> (v2 or v3 or v4)"]
end

subgraph 2
direction TB
ab[Attribute] --> v1b[v1 or v2] & notv1b["Not (v1 or v2) <br> (v3 or v4)"]
end

Missed this class¶

Lec20.pdf

Terms¶

Variables

\[ \Delta = I(Parent) - \sum_{j=1}^k \underbrace{ \frac{N(V_j)}{N} I (V_j) }_\text{Weighted Average Impurity} \]

\(I(\text{Parent}) =\) Impurity at parent
\(N =\) no of records before split (parent)
\(k =\) no of splits
\(V_j =\) denote a split
\(N(V_j) =\) no of records at split \(V_j\)
\(I(V_k) =\) impurity at child(split) \(V_j\)

Steps to choose splitting attribute¶

Compute degree of impurity of parent node (before splitting)
Compute degree of impurity of child nodes (after splitting)
Compute weighted average impurity of split
Compute gain \((\Delta)\) Larger the gain, better the test condition
Choose attribute with largest gain
Repeat for all attributes

Note¶

Information gain means the impurity measure is entropy

Last Updated: 2024-05-14 ; Contributors: AhmedThahir, web-flow

Decision Trees¶

Advantages¶

Disadvantages¶

Terms¶

Goal¶

Decision Tree Approaches¶

Regression Tree¶

CART Algorithm¶

Steps¶

Classification Tree¶

Node Impurity¶

Steps¶

Hunt’s Algorithm¶

Additional Cases¶

Empty Subset¶

All records have identical attribute values¶

Number of records fall below minimum threshold value¶

Splitting Continuous Attributes¶

Binary Split¶

Multi-Way Split¶

Challenges¶

Attribute Test Condition and Outcomes¶

Nominal¶

Missed this class¶

Terms¶

Steps to choose splitting attribute¶

Note¶

Comments