16.7 Assess model quality with residuals - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
Video transcript
<v Voiceover>After our model is built,</v> we need to assess it's quality. Traditionally this has been done with residuals. To illustrate this, we'll go back to the housing data and fit a regression of the value per square feet on to units plus square feet plus boro. So we'll fit the model, house one gets lm, value per square foot, we're going to fit that on units plus square feet plus boro, d equals housing. We can look at a summary, which gives us all the usual information or we can look at it's coefplot. This right here shows us the effects the coefficients have and once again we see that Manhattan has an outsize effect on the value per square foot. There are all sorts of residual diagnostic information available here, an easy way to grab it is to use a fortify function from ggplot. We'll take a look at that by saying head of fortify house1, and that gives back the data that was in the model plus other statistics such as the cook standard deviation, the fitted values, the residuals, the studentized residuals, all sorts of information. So let's start plotting that to compare model. We'll say h1 just to save the plot for later gets ggplot, and we're gonna say aes, x equals .fitted and y equals .resid, and the data is house1. Now this is important to note, we're passing in the regression model, not this fortify data set we made. When you pass in a regression model into ggplot as the data, it will automatically fortify it creating this data frame, and then you can then go ahead and use the creative columns. So let's add on to this, points, plus let's make a nice horizontal line at zero. This allows us to get a sense of how to dispersed our data is. We'll put it on a smoothing curve so you can see it. And we will give it nice labels because it's always good to polish your plots. (keyboard clicking) We run this, and it didn't get printed because we saved it as a variable, to see it we do h1, we get a little message that geom_smooth automatically shows the auto method. And we zoom in on this plot and we see data that does not look random. When you have a residual plot, fitted versus residuals, I'll expand so you can see everything. You generally want a random dispersion of the data. This can be a bit disturbing, it might indicate we don't have a good fit for our model. But let's investigate a little bit more before we jump to conclusions. Let's say h1 plus geom_point, we'll add another point layer, and we'll give it an aesthetic where we say color equals boro. Doing this we see that this pattern is due to the boros. Once again Manhattan is outsized on it's own, and the other boros are clustered together. So maybe we aren't as concerned as we were. The next plot used to assess model quality is a qqplot. That could be as simple as doing plot of house1 which equals 2, this will use base graphics to make a qqplot. A qqplot shows the quanta of your standardized residuals, and the theoretical quantas from the distribution. If this were a good fit, all these points would be following along a straight dotted line. Since they don't especially around the tails, it's sort of indicating to us that we haven't quite made a good fit for our model. Now being a fan of ggplot over base graphics, I wanna recreate this plot using ggplot. So we say ggplot house1 feeding it just the model, for the aesthetic we're going to say sample equals .stdresid, we will say plus stat_qq, plus geom_abline. Running this we get an almost identical plot, but much better looking. Purely up to your personal choice which one you prefer. Back to our model, it's usually a good idea to plot a histogram of the residuals, because when you're doing a linear regression, you generally want the residuals to be normally distributed. So we do ggplot house1, aes x equals .resid, plus geom_histogram. We look at this plot, it's close to being normally distributed, but it's not quite there. All of this indicates that we still have more work to do on our model. For years, doing diagnostics based on residuals was the primary way to assess model quality. Nowadays there are other methods such as cross validation, which are often preferable, but you still get to know how to judge a model's quality based on it's residuals.