Gartner® Hype Cycle™ for Data Management 2024
Read The ReportGartner® Data Management 2024
Read The ReportRandom forest is one of the most widely used machine learning algorithms in real production settings.
Random forest is one of the most popular algorithms for regression problems (i.e. predicting continuous outcomes) because of its simplicity and high accuracy. In this guide, we’ll give you a gentle introduction to random forest and the reasons behind its high popularity.
Let’s start with an actual problem. Imagine you want to buy real estate, and you want to figure out what comprises a good deal so that you don’t get taken advantage of.
The obvious thing to do would be to look at historic prices of houses sold in the area, then create some kind of decision criteria to summarize the average selling prices given the real-estate specification. You can use the decision chart to evaluate whether the listed price for the apartment you are considering is a bargain or not. It could look like this:
The chart represents a decision tree through a series of yes/no questions, which lead you from the real-estate description (“3 bedrooms”) to its historic average price. You can use the decision tree to predict what the expected price of a real estate would be, given its attributes.
However, you could come up with a distinctly different decision tree structure:
This would also be a valid decision chart, but with totally different decision criteria. These decisions are just as well-founded and show you information that was absent in the first decision tree.
The random forest regression algorithm takes advantage of the ‘wisdom of the crowds’. It takes multiple (but different) regression decision trees and makes them ‘vote’. Each tree needs to predict the expected price of the real estate based on the decision criteria it picked. Random forest regression then calculates the average of all of the predictions to generate a great estimate of what the expected price for a real estate should be.
Random forest regression is used to solve a variety of business problems where the company needs to predict a continuous value:
Random forest regression is extremely useful in answering interesting and valuable business questions, but there are additional reasons why it is one of the most used machine learning algorithms.
Random forest regression is a popular algorithm due to its many benefits in production settings:
Random forest is both a supervised learning algorithm and an ensemble algorithm.
It is supervised in the sense that during training, it learns the mappings between inputs and outputs. For example, an input feature (or independent variable) in the training dataset would specify that an apartment has “3 bedrooms” (feature: number of bedrooms) and this maps to the output feature (or target) that the apartment will be sold for “$200,000” (target: price sold).
Ensemble algorithms combine multiple other machine learning algorithms, in order to make more accurate predictions than any underlying algorithm could on its own. In the case of random forest, it ensembles multiple decision trees into its final decision.
Random forest can be used on both regression tasks (predict continuous outputs, such as price) or classification tasks (predict categorical or discrete outputs). Here, we will take a deeper look at using random forest for regression predictions.
The random forest algorithm follows a two-step process:
Let’s delve deeper into how random forest regression builds regression trees.
Regression using decision trees follows the same pattern as any decision tree algorithm:
1. Attribute selection. The decision tree regression algorithm looks at all attributes and their values to determine which attribute value would lead to the ‘best split’. For regression problems, the algorithm looks at MSE (mean squared error) as its objective or cost function, which needs to be minimized. This is equal to variance reduction as a feature selection criterion. Note: Scikit learn also has an MAE (mean absolute error) implementation.
2. Once it finds the best split point candidate, it splits the dataset at that value (called the root node) and repeats the process of attribute selection for the other ranges.
3. The algorithm continues iteratively until either:
a) We have grown terminal or leaf nodes so that they reach each sample (there are no stopping criteria).
b) We reached some stopping criteria. For example, we might have set a maximum depth, which only allows a certain number of splits from the root node to the terminal nodes. Or we might have set a minimum number of samples in each terminal node to prevent them from splitting beyond a certain point.
So, why is a single tree not enough? Why do we need a forest of trees?
Decision trees have a couple of problems:
The ensemble of decision trees introduces randomness, which mitigates the issues above. So how does random forest impose randomness? And how does this help make better predictions?
The ensemble of decision trees has high accuracy because it uses randomness on two levels:
Ensembling decision trees allows us to compensate for the weaknesses of each individual tree.
The base model can be improved in a couple of ways by tuning the parameters of the random forest regressor:
The way in which you use random forest regression in practice depends on how much you know about the entire data science process.
We recommend that beginners start by modeling data on datasets that have already been collected and cleaned, while experienced data scientists can scale their operations by choosing the right software for the task at hand.
There are over 84 datasets to try out random forest regression in practice. Among the best ones are:
Data scientists spend more than 80% of their time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling.
Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, through which all of the modeling can be done using Julia, R, or Python.
Deep dive into the data science process with this Jupyter Notebook:
Want to take it a step further? Keboola can assist you with instrumentalizing your entire data operations pipeline.
Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. Deploy multiple models with different algorithms to version your work and compare which ones perform best. Start building models today with our free trial.