Random forests and their…not so random decisions

Sheryl Li
4 min readNov 4, 2021

--

Today, random forests are integrated into many applications- in the banking industry, healthcare and medicine, stock market, and more. 😄

Basically, random forest is a flexible machine learning algorithm thats easy to use and produces awesome results. (even without hyper-parameter tuning) How awesome is that? 😝

images source linked at bottom ❤

So, let’s back up. Theres different types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Within supervised learning, there’s classification and regression. 🔍

Classification is a kind of problem where the outputs are more solid like “yes “ or “no”, “true” or “false”, “0” or “1”. In classification, theres various algorithms, like KNN, Naive Bayes, and Random forest (decision tree).

Okay, what’s a random forest? 🌲

(There’s a few tree emojis, Christmas trees are cute 🎄)

A random forest/random decision forest is like making one big decision based on smaller decisions. It makes multiple decision trees during the training phase, and the majority of the decisions from the trees is chosen by the random forest as the final decision. (imagine…actually making good decisions 😆)

But…why a random forest?

Aside from being simple and easy to use…

Theres no overfitting — by using multiple trees, you reduce the risk of overfitting and as a result the training time is less.

(what even is overfitting?

Well, the goal of a machine learning model is to use the training data to make generalizations and apply those to any data in the problem domain, making predictions in the future on unseen data. When a model that models the training data too well, it picks up on all the weird parts 😐, so instead of predicting the overall data, you pick up the weird stuff. As a result, those weird things are learned as concepts by the model that can’t be applied to new data, which screws up its ability to generalize.)

High accuracy- it runs efficiently on large databases, and produces highly accurate predictions as well.

Estimates missing data- with messy data these days, random forests comes in clutch 🙏. Random forests estimates missing data while still maintaining accuracy when a large proportion of data is missing.

Yeah, random forests are full of decisions. But what’s a decision tree?

What is a decision tree? A decision tree is a, well, diagram that looks like a tree 🌴. It’s used to determine a course of action, and each branch of the tree represents a possible decision, occurrence or reaction.

It’s a type of supervised machine learning (where you explain the input that corresponds with the output) and data is split to a certain parameter. Like a normal tree, it has nodes and leaves 🍃- the leaves are the decisions/final outcomes, and the nodes are where the data is split.

For everything on the decision tree, how it makes its decision is based on entropy. Entropy is the measure of randomness or unpredictability in the dataset.

Let’s look at an example:

made with ❤ with Miro :)

Say that we want to classify the different types of fruits in the bowl based on the different features.

In the initial basket, the entropy is high , theres many different fruits jumbled together, so theres no way you’d predict it accurately. (btw, the root node is where all the data with the first decision/split happens)

When will I stop using Miro? never

We have to split the data so that the information gain is the highest.

Information gain is how much the set has decreased in entropy after splitting. You go from one set with high entropy ➡️ into two lower sets with lower entropy. Looking at the training dataset, we will choose a condition that gives us the highest gain.

Looking at the training dataset, we choose a condition that gives us the highest gain by splitting the data using each condition + checking the gain we get out from them. For the first split, the condition with the highest gain will be used.

For a random forest, it’s a bunch of decision trees. Let’s say you want to classify this fruit that misses some information (in this case the color). Going through the decision trees, even though not every decision tree came to the same decision, the random forest is still able to make the right final decision because of the majority. 😎

Miro, sponsor me?

Resources:

Images from:

Random Forest Algorithm- Random Forest Explained (If you want, follow along an implementation of random forest with IRIS flower analysis)

And of course, Miro for the rest of the images

--

--