Random Forests are among the most popular and best performing methods for general machine learning tasks as classification or regression. Random Forests combine several decision trees into a single ensemble (forest) and average predictions from individual trees. Each tree in the ensemble is trained on a random subsample on an input dataset resulting in a lower variance of the whole ensemble.
Among the hyperparameters of Random Forest implementation in scikit-learn controlling their overfitting are:
- max_depth: maximum allowed depth of trees in the forest reduces the complexity of trees and leads to lower chance of overfitting.
- min_samples_leaf: minimum required number of samples in terminal (leaf) nodes reduces possibility to catch some noise.
One can find the best values of these hyperparameters using grid search over them, but it would be interesting to visualize the random forest to visually inspect it and make more educated choice of the range of max_depth and min_samples_leaf. One can use GraphViz to visualize individual trees, but it’s useful only for not very complex trees (below is an example of tree with max_depth=3):
The size of the graph becomes unmanageable with increasing complexity of underlying trees (increasing number of samples in the input dataset). To address this, I decided to write two short helper functions for calculations of:
- Depth of each terminal (leaf) node in a given tree
- Number of samples in terminal (leaf) nodes
Let’s start with the first one:
from sklearn.tree import _tree def leaf_depths(tree, node_id = 0): ''' tree.children_left and tree.children_right store ids of left and right chidren of a given node ''' left_child = tree.children_left[node_id] right_child = tree.children_right[node_id] ''' If a given node is terminal, both left and right children are set to _tree.TREE_LEAF ''' if left_child == _tree.TREE_LEAF: ''' Set depth of terminal nodes to 0 ''' depths = np.array() else: ''' Get depths of left and right children and increment them by 1 ''' left_depths = leaf_depths(tree, left_child) + 1 right_depths = leaf_depths(tree, right_child) + 1 depths = np.append(left_depths, right_depths) return depths
And the second one written in the same fashion:
def leaf_samples(tree, node_id = 0): left_child = tree.children_left[node_id] right_child = tree.children_right[node_id] if left_child == _tree.TREE_LEAF: samples = np.array([tree.n_node_samples[node_id]]) else: left_samples = leaf_samples(tree, left_child) right_samples = leaf_samples(tree, right_child) samples = np.append(left_samples, right_samples) return samples
Example of visual inspection
One can inspect individual trees within ensemble:
boston = load_boston() X = boston.data y = boston.target rnd = check_random_state(0) ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd) draw_tree(ensemble)
def draw_tree(ensemble, tree_id=0): plt.figure(figsize=(8,8)) plt.subplot(211) tree = ensemble.estimators_[tree_id].tree_ depths = leaf_depths(tree) plt.hist(depths, histtype='step', color='#9933ff', bins=range(min(depths), max(depths)+1)) plt.xlabel("Depth of leaf nodes (tree %s)" % tree_id) plt.subplot(212) samples = leaf_samples(tree) plt.hist(samples, histtype='step', color='#3399ff', bins=range(min(samples), max(samples)+1)) plt.xlabel("Number of samples in leaf nodes (tree %s)" % tree_id) plt.show()
Or inspect the whole ensemble:
def draw_ensemble(ensemble): plt.figure(figsize=(8,8)) plt.subplot(211) depths_all = np.array(, dtype=int) for x in ensemble.estimators_: tree = x.tree_ depths = leaf_depths(tree) depths_all = np.append(depths_all, depths) plt.hist(depths, histtype='step', color='#ddaaff', bins=range(min(depths), max(depths)+1)) plt.hist(depths_all, histtype='step', color='#9933ff', bins=range(min(depths_all), max(depths_all)+1), weights=np.ones(len(depths_all))/len(ensemble.estimators_), linewidth=2) plt.xlabel("Depth of leaf nodes") samples_all = np.array(, dtype=int) plt.subplot(212) for x in ensemble.estimators_: tree = x.tree_ samples = leaf_samples(tree) samples_all = np.append(samples_all, samples) plt.hist(samples, histtype='step', color='#aaddff', bins=range(min(samples), max(samples)+1)) plt.hist(samples_all, histtype='step', color='#3399ff', bins=range(min(samples_all), max(samples_all)+1), weights=np.ones(len(samples_all))/len(ensemble.estimators_), linewidth=2) plt.xlabel("Number of samples in leaf nodes") plt.show()
Now, one can come to conclusion that in the grid search over max_depth and min_samples_leaf it’s not a good idea to try max_depth > 20 or min_samples_leaf > 2 as this will clearly have very small effect on trees.
To validate these two helper functions, one can visualize the effect of max_depth and min_samples_leaf on underlying trees in the forest. For example, max_depth=12 indeed results in maximum depth of 12 and leads to less complex trees and more samples in the leaf nodes:
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd, max_depth=12)
The same in case of min_samples_leaf=3:
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd, min_samples_leaf=3)
Randomization in Random Forests
In this last section we will demonstrate the effect of two more hyperparameters that control the way the randomization is introduced into trees:
- bootstrap: this parameter toggles the random bootstrapping of the input dataset where each sample maybe randomly replaced by another sample from the same dataset.
- max_features: in search for the best split of any given node consider random subset of max_features features.
Thus, these two parameters randomize the node splitting procedure.
In RandomForestregressor the default and recommended value of max_features equals to the total number of features, it means that in the example above the randomization was introduced only through bootstrapping. What happens if we turn it off?
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd, bootstrap=False)
As expected, majority of trees are the same (the small amount of leftover randomness could be due to random choice of best node split between several equal splits). Let’s bring some randomization back by lowering max_features down to 10 (the total number of features in this dataset is 13):
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd, bootstrap=False, max_features=10)
The trees in the forest are random again!
Visual inspection of Random Forests allows visualization of their underlying structure and complexity, and demonstration of effects of Random Forest hyperparameters.
The code for this post is hosted on github: