17.6 Fit decision trees to make a random forest - Video Tutorials & Practice Problems
Video duration:
6m
Play a video:
<v Voiceover>A favorite modern</v> technique of data scientists is the Decision Tree. This is a type of regression or classification that deals with nonlinear relationships. It is particularly useful for making predictions. Let's take a quick look at the formula but not for too long. Essentially what it is saying is it partitions a data space up into small little cubes and then takes the average for that cube, and that's the regression value. Don't worry too much about it. That's essentially how it works and the computer will do all the work for you. In a similar fashion, classification trees, they break up the data into a little cube, and it gets a count for the dominant class in that area and that becomes the prediction. Don't worry about it, the machine will do it for us. So, the package to do this is rpart. So we do require rpart. To see this we will play with some German credit score data that I have put up on my website at "www.jaredlander.com/ data/credit.rdata" This data set is incredibly messy, but it's all cleaned up and stored nicely in an rdata file. So let's load it. (typing strokes) And we'll even take a look at it. See, it has all sorts of information inside of it. The type of checking account, the duration, credit history, purpose, all sorts of information. It's all about credit scores for people. So, let's build up the tree. We'll call it a credit tree. We've built it using rpart and this takes a formula interface just like everything else. Nice and consistent. We do credit on, credit amount, plus we should note that credit is a binary variable that's either good or bad. So, we are doing a classification tree, not a regression tree. So we do credit amount, plus age, plus credit history, plus employment, and the data is coming from credit. We run that, and now, we can view it. The problem is it spits out all these notes, could be a little confusing what it's saying, so I prefer visualizing it. Let's require a package built specifically to visualize trees. So to see this, we will do rpart.plot and we will say credit tree and extra equals four. We run this and we see this nice display here. Going down the left means that the answer was yes, to the right means the answer was no. What you do is start here. The first variable that gets split is the credit history, one of these three types. If it's no, it comes over here and there's a 60% chance that they'll have bad credit score. If it's yes they have one of these types, you come over here and split on this variable. This is the credit amount. If it's less than 7760, there's a 75% chance they have a good credit score. If it's greater than 7760, you come to another variable, are they over 30? If they are over 30, and they have over this credit amount, and they have this type of credit history, there's a 76% chance they'll be bad credit. That's if they're not over 30. If they are over 30, then you can ask, are they less than 38? If they're not, 62% chance of being bad. If they are, 79% chance of being good. So, decision trees are actually pretty simple to interpret when you see it like that, and they do make good predictions. Let's clear out the console and talk about the decision tree's friend, the random forest. Random forests fit thousands of decision trees, each one with a different number of variables. So for each tree, it randomly picks variables and makes the fit. And does this again and again and then averages the trees in an ensemble to get better predictions. This has been shown to generate better predictions because it smooths out all the deficiencies of all the different models. So to do this, we'll load a few packages. The useful package and the random forest package. So the best way to use random forest is to provide it a response matrix and a predictor matrix, rather than a formula. First thing we're going to do in order to save typing is store the formula in a variable. So we credit formula that gets credit on credit history, plus purpose, plus employment, plus duration, plus age, plus credit amount. I'll break that over a few lines so we can see it, and we'll just run that. And now, if we were to run credit formula, we'll see we have a formula object. We can then use that to build the x and y matrix. So, credit x gets build.x. We say credit formula and the data is credit. And duration should be a capital D. Notice the error did not get picked up in the formula creation, it got picked up when you tried building the matrix, because the formula is just storing it. Anything can be stored in a formula, right or wrong. Now we got that. Now we'll just copy this line and do build.y to get the y matrix, and that should be a lowercase y, not a capital. Even sometimes I forget my own functions. Now we can go ahead and build the random forest. It's as simple as that. So we look at it to be credit forest, not history. And here we have a quick summary of the random forest. So it tells us there's a classification forest, that at 500 trees, and at each split they tried four variables. It has an estimated error rate of 28%. Not amazing, but not bad. And here they give us the confusion matrix showing how well they did. They got it right 647 times plus 72 times, and they got it wrong 53 plus 228 times, not too bad. There's a reason decision trees and random forests are darlings of modern data scientists. They're fast, they're efficient, they work on all kinds of data, and they even work efficiently with missing data. They are two algorithms that are really great at making predictions which is a lot of what modern data scientists do.