22.1 Establish optimal tree depth for rpart - Video Tutorials & Practice Problems
Video duration:
6m
Play a video:
Video transcript
<v Voiceover>Many of the modern</v> machine learning algorithms have a lot of tuning parameters and the values of these parameters can make a big difference in the performance of the model. Finding the optimal values of these parameters can take a lot of work. Fortunately, there's the Caret package by Max Kuhn, which greatly simplifies this process. To illustrate this we will look at the credit data. We get that by loading it from the website. So, that's load url http:// www.jaredlander.com /data/credit.rdata. We can look at the head of this and we see here we have all sorts of information about someone seeking credit. We could just go ahead and fit this using a regular tree and let rpart do the work. So, we do library rpart, and since we will want to visualize it, library rpart.plot. We will then build up a formula, since we will be using it repeatedly. It will be Credit tilde CreditAmount plus Age plus CreditHistory plus Employment. We'll say mod1 gets rpart, treeFormula comma data equals credit. We fit that, it runs perfectly fast. Rpart.plot. We'll say mod1, extra equals four, and we get a nice plot showing us the decision tree. Thing is, we don't really know if this is optimal. For instance, tree depth is an important tuning parameter for a decision tree. How far should you let the tree grow? Should you let it go one level, two, three, four? So, it's best to tune for that parameter. To do that, we will load the Caret package. We will create a new object depth1 gets train. Pass it the treeFormula, data equals credit, and method equals rpart2. If you use rpart, it optimizes for CP, if you use rpart2, it optimizes for the tree depth. So we run this. And we see that by default, Caret used the Bootstrap to figure out the max depth and it really tested between a max depth of four, five, and nine, and found that four was the best. However, we want to use repeated cross-validation instead of the Bootstrap. So, we will set up some controls. Say treeControls gets trainControl, this is a special Caret function. The method equals repeatedcv. We'll have it repeat three times, and we'll use 10 folds for each one. You do repeatedcv because even cross-validation can be subject to sorting randomness. We create an object, depth2 gets train. We pass the treeFormula, data equals credit. Method equals rpart2 since we're testing for depth, and trControl gets treeControls. You run this and we see that now we use 10 fold cross-validation repeated three times. In this case, it finds that the max depth of five is most appropriate, but by default it's only checking the max depth of four, five, and nine. We might want to do a more robust search. So we will specify the search grid. We will say treeGrid gets data.frame, and the name of the column should be the name of the parameter we are optimizing. In this case, maxdepth, and we will let it go from one through 15. Caret essentially brute forces its way through the parameter's search base and just finds the best one based on your metric, which in this case is cross-validation. So we say deph3 gets train, treeFormula, data equals credit, method equals rpart2, trControl equals treeControls, and on the next line, tuneGrid equals treeGrid. We run all that and we see that the optimal max depth is three. Again, this is based on predictive power. So setting your max depth for your tree to three will get you the best predictive results. Let's look at some of the slots for depth3. We can get the modelInfo, which tells us information about the tree. It tells us about the different functions it used to trim and all sorts of information. Probably more than you want right now. But you'd also find what it chose as the best tuning parameter, which in this case is three. If we really want to, we can access the finalModel. Now, it's discouraged to access this for manipulating but if you just want to check it out, make a quick plot, you can, and we can do that by saying, rpart.plot of depth3 dollar sign finalModel, and let's not forget extra equals four, and it shows us what our optimal tree would be. Caret greatly simplifies the process of finding optimal tuning parameters as illustrated here using rpart.