Random Forests are among the most popular and best performing methods for general machine learning tasks as classification or regression. Random Forests combine several decision trees into a single ensemble (forest) and average predictions from individual trees. Each tree in the ensemble is trained on a random subsample on an input dataset resulting in a lower variance of the whole ensemble.

Among the hyperparameters of Random Forest implementation in scikit-learn controlling their overfitting are:

**max_depth**: maximum allowed depth of trees in the forest reduces the complexity of trees and leads to lower chance of overfitting.**min_samples_leaf**: minimum required number of samples in terminal (leaf) nodes reduces possibility to catch some noise.

One can find the best values of these hyperparameters using grid search over
them, but it would be interesting to visualize the random forest to visually
inspect it and make more educated choice of the range of **max_depth** and
**min_samples_leaf**. One can use
GraphViz
to visualize individual trees, but it’s useful only for not very complex trees
(below is an example of tree with **max_depth=3**):

The size of the graph becomes unmanageable with increasing complexity of underlying trees (increasing number of samples in the input dataset). To address this, I decided to write two short helper functions for calculations of:

- Depth of each terminal (leaf) node in a given tree
- Number of samples in terminal (leaf) nodes

### Implementation

Let’s start with the first one:

```
from sklearn.tree import _tree
def leaf_depths(tree, node_id = 0):
'''
tree.children_left and tree.children_right store ids
of left and right chidren of a given node
'''
left_child = tree.children_left[node_id]
right_child = tree.children_right[node_id]
'''
If a given node is terminal,
both left and right children are set to _tree.TREE_LEAF
'''
if left_child == _tree.TREE_LEAF:
'''
Set depth of terminal nodes to 0
'''
depths = np.array([0])
else:
'''
Get depths of left and right children and
increment them by 1
'''
left_depths = leaf_depths(tree, left_child) + 1
right_depths = leaf_depths(tree, right_child) + 1
depths = np.append(left_depths, right_depths)
return depths
```

And the second one written in the same fashion:

```
def leaf_samples(tree, node_id = 0):
left_child = tree.children_left[node_id]
right_child = tree.children_right[node_id]
if left_child == _tree.TREE_LEAF:
samples = np.array([tree.n_node_samples[node_id]])
else:
left_samples = leaf_samples(tree, left_child)
right_samples = leaf_samples(tree, right_child)
samples = np.append(left_samples, right_samples)
return samples
```

### Example of visual inspection

One can inspect individual trees within ensemble:

```
boston = load_boston()
X = boston.data
y = boston.target
rnd = check_random_state(0)
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd)
draw_tree(ensemble)
```

where:

```
def draw_tree(ensemble, tree_id=0):
plt.figure(figsize=(8,8))
plt.subplot(211)
tree = ensemble.estimators_[tree_id].tree_
depths = leaf_depths(tree)
plt.hist(depths, histtype='step', color='#9933ff',
bins=range(min(depths), max(depths)+1))
plt.xlabel("Depth of leaf nodes (tree %s)" % tree_id)
plt.subplot(212)
samples = leaf_samples(tree)
plt.hist(samples, histtype='step', color='#3399ff',
bins=range(min(samples), max(samples)+1))
plt.xlabel("Number of samples in leaf nodes (tree %s)" % tree_id)
plt.show()
```

Or inspect the whole ensemble:

`draw_ensemble(ensemble)`

where:

```
def draw_ensemble(ensemble):
plt.figure(figsize=(8,8))
plt.subplot(211)
depths_all = np.array([], dtype=int)
for x in ensemble.estimators_:
tree = x.tree_
depths = leaf_depths(tree)
depths_all = np.append(depths_all, depths)
plt.hist(depths, histtype='step', color='#ddaaff',
bins=range(min(depths), max(depths)+1))
plt.hist(depths_all, histtype='step', color='#9933ff',
bins=range(min(depths_all), max(depths_all)+1),
weights=np.ones(len(depths_all))/len(ensemble.estimators_),
linewidth=2)
plt.xlabel("Depth of leaf nodes")
samples_all = np.array([], dtype=int)
plt.subplot(212)
for x in ensemble.estimators_:
tree = x.tree_
samples = leaf_samples(tree)
samples_all = np.append(samples_all, samples)
plt.hist(samples, histtype='step', color='#aaddff',
bins=range(min(samples), max(samples)+1))
plt.hist(samples_all, histtype='step', color='#3399ff',
bins=range(min(samples_all), max(samples_all)+1),
weights=np.ones(len(samples_all))/len(ensemble.estimators_),
linewidth=2)
plt.xlabel("Number of samples in leaf nodes")
plt.show()
```

Now, one can come to conclusion that in the grid search over **max_depth** and
**min_samples_leaf** it’s not a good idea to try **max_depth > 20** or
**min_samples_leaf > 2** as this will clearly have very small effect on trees.

To validate these two helper functions, one can visualize the effect of
**max_depth** and **min_samples_leaf** on underlying trees in the forest.
For example, **max_depth=12** indeed results in maximum depth of 12 and leads to
less complex trees and more samples in the leaf nodes:

```
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd,
max_depth=12)
```

The same in case of **min_samples_leaf=3**:

```
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd,
min_samples_leaf=3)
```

### Randomization in Random Forests

In this last section we will demonstrate the effect of two more hyperparameters that control the way the randomization is introduced into trees:

**bootstrap**: this parameter toggles the random bootstrapping of the input dataset where each sample maybe randomly replaced by another sample from the same dataset.**max_features**: in search for the best split of any given node consider random subset of**max_features**features.

Thus, these two parameters randomize the node splitting procedure.

In **RandomForestregressor** the default and recommended value of
**max_features** equals to the total number of features, it means that in the
example above the randomization was introduced only through bootstrapping.
What happens if we turn it off?

```
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd,
bootstrap=False)
```

As expected, majority of trees are the same (the small amount of leftover
randomness could be due to random choice of best node split between several equal
splits). Let’s bring some randomization back by lowering **max_features** down
to **10** (the total number of features in this dataset is 13):

```
ensemble = RandomForestRegressor(n_estimators=100, random_state=rnd,
bootstrap=False, max_features=10)
```

The trees in the forest are random again!

### Conclusions

Visual inspection of Random Forests allows visualization of their underlying structure and complexity, and demonstration of effects of Random Forest hyperparameters.

The code for this post is hosted on github: