22.2 Choose the best number of trees for a random forest
22: Automatic Parameter Tuning with Caret
22.2 Choose the best number of trees for a random forest - Video Tutorials & Practice Problems
Video duration:
3m
Play a video:
<v ->Another popular machine loading algorithm</v> that requires tuning is random forest, which is a model made up of many many decision trees. It's tuning parameter is mtry it's number of random variables to try splitting at each given iteration. So to do this we will create a control called Forest Controls, gets train control method equals repeated CV repeats equals two and we're going to keep it simple and not do too many folds or too many repeats just because this can be computationally intensive. We will also create a grid to search over. This will be forest grid gets data frame, mtry equals one through four. Now we can go ahead and train the model. We will say forest one gets train, the formula is credit tilde credit amount plus age plus credit history plus employment, data equals credit method equals rf which is the default for Caret tr control equals forest controls and tune grid equals forest grid. You can run these lines, it loads packages as necessary. Now it's taking a bit of time to run this and Caret is possible to run in parallel we're doing it single threaded right now because this laptop only has two cores, it wouldn't save us much time, but it is actually very simple to run it in parallel. All you need to do is establish the parallel back end, and run Caret as you normally would, and then don't forget to turn off the parallel back end. We can see it's done so we will check out the model. And I've found that the best number of mtrys was three. Caret offers support for over 100 different models that need tuning. And if it doesn't exist already, there is framework to easily build your own training setup for whatever model you need. It is incredibly simple to let Caret go through multiple parameters at once on a model and find you the best. A common problem with a number of machine learning models, particularly random forest, is what do the variables mean? Luckily, there is the variable importance function with one specially written in Caret. You'll say var m of forest one, and it prints us out a normalized amount of importance for each variable. And here it shows that credit amount followed by age are the two most important variables. If you wanted to visualize this, you could do plot variable importance forest one. You'll see this is just an easy way to figure out which variable is the most important. It doesn't tell you the impact of the variables necessarily but generally the importance of them in determining the model.