In Data Science, one of the widely used algorithms to solve problems is Random Forest, which derives from another algorithm called Decision Tree. A decision tree is utilized for both classification and regression problems, a supervised machine learning algorithm. A decision tree is a series of sequential decisions assembled to outreach a specific result.
A dataset should be available for training an algorithm. From this data, the algorithm is going to learn. A working model capable of identifying objects can be created by providing essential aspects such as size, shape, height, weight, colour, etc. For example, if we want to differentiate between four fruits, say apples, tomatoes, oranges & grapes, how would the Decision Tree solve this task?
To have a better idea, if we gave this work of segregation of fruits to a kid based on the fruit's shape, he would separate apples, tomatoes, and oranges in one group based on round shape and grapes into another group.
However, apples and tomatoes seem to be more alike at first sight. We would have to consider some other characteristics to complete our teaching mission to identify those fruits precisely.
The right criteria to differentiate apples from oranges could be their color. Assuming apples are usually red and oranges are orange, that would be a good second parameter to correctly identify the kid's fruits. To differentiate apples and tomatoes, the kid may compare size.
The Decision Tree algorithm follows the same procedure. Among all the parameters available for describing fruits, it will swoop into the data and choose the most important parameter that best separates one class from the others and follows the same logic for the remaining parameters until it finds a way to better decide fruit it is.
Due to this, Decision Trees algorithms can also be used as a tool for parameter selection, which would identify characteristics such as shape, size colour, etc., with more importance to our identification and excluding the ones that contribute little or none to our primary goal. For example, if we had the information from where we bought these fruits, it would be irrelevant to classify which fruit. These Decision Trees structures are quite good at identifying it.
What if we could put together different trees that detect those fruits in several different angles to understand better how they distinguish from each other? That's exactly the object of a Random Forest algorithm, which generates a series of decision trees, each one looking for slightly different formation of the same examples of fruits and also looking at them through different angle: for example, one decision tree is assemble using color as the first parameter, while one other tree uses shape or weight as its parameter.
It would only make sense if we ensure that one tree is not equal to the other. Otherwise, we wouldn't have the help from the different 'points of view .' The computational power required to create a set of trees would not be accounted for. To ensure a set of distinct trees, some parameters can be controlled.
Firstly, the algorithm will select a different set of samples for each tree. Suppose we have the dataset for 500 fruits. In that case, the algorithm will randomly select 500 examples for each one of the trees, i.e. some examples may happen to be selected more than once in the same set while other examples may be absent. Also, we can administer the parameters like the number of minimum features, maximum features, etc. This parameter can set a number, let's say 3. The algorithm will then select three random characteristics for each tree and decide which of them is the most significant to separate the different fruits.
To ensure that all the decision trees are going to be different, these random factors provide a virtual guarantee. One more important parameter is n_estimators, which allows the user to select the number of decision trees to be created.
When compared to a single decision tree, the series of decision trees working together will perform better, representing the power of using the Random Forest algorithm.
Based on characteristics, if we want our model to predict which fruit it is, each of the trees will make a prediction. This prediction will be associated with one probability for each class. For example, one tree could conclude that the characteristics given are 70% likely to be grapes based on shape, 15% to be an orange based on the colour a, and 15% to be an apple based on shape and weight. In contrast, the other trees would have different probabilities associated.
Finally, a weighted vote will be calculated over these probabilities coming to a conclusion. That would be the primary procedure for a classification job like our fruit example.
In regression tasks, the Random forests can also be used if we want to predict a serial number or measure, like the price of a property or the number of products that are going to be sold in some specific period etc. The random forest can be used. Even though the procedure, in this case, is the same, the final prediction would be given by computing the mean of the individual tree predictions and not through a weighted vote, like in classification tasks.
These models learn precisely what we tell them to learn, and the quality of their predictions depends on the quality of the data we give them.
This blog is written to understand better how one of the most popular machine learning algorithms works behind the scenes.