Random Forest Algorithm in Machine Learning

Aditya Kumar Pandey
3 min readDec 11, 2020

--

Machine learning is a technique of building a model by using different techniques and algorithms. Here I am not going to discuss the detail of machine learning. we will learn about the random forest algorithm of machine learning. Before we start let us take a look at the table of content.

  • Introduction.
  • How does the Random forest algorithm work?
  • Assumption of Random Forest.
  • Advantages and Disadvantages.

Introduction

Random forest is a supervised machine learning algorithm. Before we discuss the random forest we will first try to understand Bagging and the ensemble method.

The word ensemble means that combining multiple models. It is the process of combining more than one model to predict the result. Ensemble techniques are classified into two types.

  1. Bagging
  2. Boosting

we will only discuss the Bagging. Bagging is a technique in which different samples are taken from the dataset and trained using different models. These different models are combined to give the final results. Bagging is also called bootstrap aggregation. Random Forest is a part of the bagging technique.

Random forest works on the concept of Bagging. It takes the different samples of data from the dataset, builds a different decision tree model. After that, it takes the average(in case of regression) or maximum vote(in case of classification) of all the models and gives the results.

NOTE : Random Forest can be used for both Regression as well as Classification problem.

Consider the figure below. We will understand the Random Forest algorithm with the help of this figure.

  1. Let us consider that we have a dataset D. It has m number of columns and d number of records.
  2. From the dataset D, we will pick some sample of row(row sampling with replacement) and some number of columns(column sampling). Now, we will give this data D’ to decision tree 1 i.e. D1. Now This decision tree will get train on the data D’.
  3. Now we will take another sample of data D’ form the dataset D with row sampling and column sampling. We will pass this data to our decision tree D2.
  4. We will repeat the same step for our decision tree model D3 and D4. In this way, we are training the different decision tree models and predicting the result.
  5. Now, the model will check the result of all the decision tree model and give the result based on the majority. In the above figure, most of the decision tree models are predicting result as 1. So based on the majority, it will give the result as 1.
  6. In the case of a regression problem, it will calculate the average of all the decision trees output and give the result.

NOTE: Please note that Random Forest uses Decision Tree as base learner.

Overfitting and Underfitting

Decision Tree has low bias and high variance. The term low bias means when we are splitting the decision tree to its complete depth, it will get properly trained on the training dataset, and the training error rate is very less.

High variance states that whenever we get test data decision tree gives a high error.

Now, Random Forest converts the high variance to low variance which overcomes the overfitting problem. We know that random forest uses multiple decision trees. Every decision tree has a high variance and when we will combine these decision trees then the high variance will get converted to low variance.

Advantage of Random Forest

  • It can be used for classification as well as regression problems.
  • It reduces the overfitting problem.
  • It works well with categorical as well as continuous variables.
  • The random forest can handle the missing values.
  • No feature scaling is required.
  • It requires less parameter tuning.
  • It is robust to outliers.

Disadvantages

It is biased with features having many categories.

Assumptions

There are no assumptions for the random forest algorithm.

My other blogs.

Clap If you liked the article.

--

--