16.3 Fit multiple regression models - Video Tutorials & Practice Problems
Video duration:
19m
Play a video:
Video transcript
<v Voiceover>Now that</v> we have explored our data and gotten a sense of what we want to analyze, we need to learn a little bit about multiple regression. The math is kind of heavy, so I won't delve into it too much, but just to give you a sense. The basic idea is the same as simple linear regression where you want to fit a straight line through a set of points. However, now instead of fitting a straight line, you're fitting a mulitdimensional hyperplane. Don't worry, it's still essentially a straight line. The math for this is as such. It's very similar to before where you have your response variable, your y, your x variable and your coefficients, which are multiplied against the x variables. Plus a little bit of random noise. The epsilon, again, represents the errors because there's always randomness. We think about fathers and sons, comparing their heights. A tall father could have a shorter son, it happens. Now, in this case, it's all about matrix algebra. We can consider the y variable to be a factor of responses where, if we think about this in terms of a spreadsheet or a data frame, it's an entire column. The x variable is a matrix. There are however many predictors you're using if there are five predictors there are five columns and there's also an extra column that's all ones. That's for the intercept. That adds stability to the regression. And there are as many rows as there are observations of the data. Not too complicated. The betas, these are the coefficients. In simple regression, we have a and b, now we just have all betas. The first beta is the intercept and the other betas go with the variables. It's not too complicated. It looks a little intimidating, but truth be told, it's actually easier to do this math than it is for a simple regression. And the whole goal here is to find betas that explain the data. That's the goal. So, with this formula, y equals x times beta plus epsilon, it's possible to solve for beta. We do so using least squares regression and get this formula. It's beta hat equals x transpose x inverse, x transpose y. Do not worry about that equation, R does it for you automatically. So let's go back to R and do some of this. We come here and we've already explored the data in a previous session, we downloaded it from my website, www.jaredlander.com/data/housing.csv and we cleaned it up, gave it better names, took out some bad rows of data. If you haven't done that, you can see the previous lesson and follow along. So now that we have the data, let's take a look at it. We have a number of variables here, what we are interested in doing is modeling value per square foot on a number of other variables. So, let's call this model house1 because we will fit repeated models. And we are going to use the lm function. Again, that stands for linear model. We are modeling value per square foot on these other predictors. We'll say units 'cause maybe the number of units in a building affects it. The square footage of the building, how big a building could have an effect. And Boro, that's which borough in New York City we are looking at. And we are using the housing data. We run this and it's already done. Let's check out the summary of this model. We get a lot of information, here. We scroll up, we see the formula we used to fit it, and we see some diagnostics on the residuals. The coefficients is what we really want. This shows us the estimated effect of each of these variables on the value per square foot and the standard errors, the level of uncertainty. It also gives you t-values and p-values and that is supposed to be an indicator of what is significant and what is not. Now, this is where we can start a little bit of a debate. Baigian statisticians say p-values were arbitrarily chosen by R. A. Fischer about a hundred years ago and why we cling into such importance. They'd much rather use other ways of seeing of seeing if a variable's significance and we will explore that in a little bit. But first, we must understand what is going on with Boro. There are coefficients for Brooklyn, Manhattan, Queens and Staten Island. Nothing for the Bronx. That is due to the matrix algebra we have to perform to fit the model. There are some limitations in the way matrix algebra works. There is something called multicolinearity where you can't have any combination of columns equal another column. By only using four of the Boros instead of five, we solve the multicolinearity problem. Let's take a look at what's going on. We'll do head of model.matrix. And we'll use a similar formula, but not quite the same where we just look at Boro. So what happens is, since Boro is text data, we need to find some way to do math on it, 'cause you can't do math on a text straight off. What it does is that, if there are five levels of this variable, it creates four dummy variables. We have one for Brooklyn, one for Manhattan, one for Queens, one for Staten Island. And each one of these variables is a yes or no variable. Is it Brooklyn, yes or no? Is it Manhattan, yes or no? Is it Queens, yes or no? Is it Staten Island, yes or no? So whichever variable gets a one, this row of data is for that Boro. If they are all zero all across, that means it is the Bronx. The reason this has to be done, if there was a fifth variable for the Bronx, these five columns would always add up to one. That means their combination would always equal the intercept, which is also one and that introduces multicolinearity. That is why whenever there is the categorical variable, dummy variables are created and there's always one less dummy variable than there are values in the data. It's a very important part of matrix algebra, but something you don't have to worry about too much. So, getting back to our data, let's say we want to extract those coefficients again. There's a few ways to do it. You could say house1, dollar, coefficients. We run that, we get to see the coefficients. You can also use the coef function which gets you the same exact thing. And if you really feel like being verbose, you say coefficients of house1. Many options here, choose which one you like best. Now, even if we look at the summary again, and let's do that, this is just a mosh of numbers. Staring at just a set of numbers makes it difficult to see, really, what's going on. I think it's much better to visualize it. Andy Gelman at Colombia University is a big proponent of coefficient plots. I liked his idea so much that I took it and built an entire R package around it to make visualizing models easier. So we require coefplot and then we run the function coefplot on our model. And it comes back with this nice display showing our model. Each dot represents the coefficient fitted by the model. The horizontal bars represent the confidence interval. The thick inner bar represents one standard error from the coefficient. The thinner outer bar represents two standard errors from the coefficient. The general rule of thumb is that if zero is not contained within the confidence interval for a coefficient, that means that that coefficient is significant. What significance means to you is up to your interpretation but it generally means that that variable has a realistic effect on the response. Fitting a model is not a one-off process. You have to fit numerous models and see how they compare to each other. So let's fit a few more. We'll call the second one house2. And again we use lm and value per square foot. And this time we are interested in the interaction between units and square foot. And we'll still use Boro. Let's look at the coefplot for this and see what this interaction, this multiplication sign, let's see what it did. So we have our standard variables, units and square feet, and the four dummy variables for Boro. But we have a new variable, units times square feet. This is considered an interaction term because sometimes variables have a multiplicative effect. An anecdotal story I heard from a famous statistician is smoking data. When considering mortality, 80 variables' interaction with smoking is far more powerful than that variable alone. For instance, living near power lines, isn't as strong of an effect on the mortality as living near power lines and smoking. Biking in traffic isn't as powerful as biking in traffic interacted with smoking. Interactions is a great tool to use in data analysis and it's actually quite simple. All it's doing is taking the two variables, units per square foot, and making a new variable that is the multiplication of the two and using that as a variable in the regression. If we want to look at that, we can come here and fit another model matrix and we'll say value per square foot and we will only regress it on units times square foot. If we look here, we have the one variable, units, the second one, square foot, and units times square foot. It's really quite that simple. Now notice, though, even though we only specified units times square foot, it also gives each of the other variables separately. If we don't want that, we could've used slightly different notation. So we will do head of model.matrix, value per square foot and this time we'll do units colon square foot and data equals housing. And what this does is it only gives us the interaction, it doesn't give us the separate variables. Just another bit of notation that you need to get used to when using R, makes things slightly different. So why don't we go ahead and fit the model again and use just the interaction, not the main effects. I will copy house2 since it is so similar, paste it in, change the asterisk to a colon, and name this house3. We will look at the coefplot yet again. We zoom in. We see a slightly different story except that units by square foot is such a small effect, doesn't even have standard errors and it's right on the zero line. So while we're doing this, let's look at a few more models. We might as well check a lot of them out. Models are quick and easy to fit so you might as well fit a lot of them. This time we will say value per square foot on square foot times units times income. This is a three-way interaction. Let's check out the coefficients. Here we have each of the variables by themselves, square feet, units, income. Then you have the two-way interactions, square feet and units, square feet and income, units and income and then the three-way interaction, square feet, units, income. The formula interface in R is very smart and something that might take a bit of getting used to. Even though we only specified square feet times unit times income, it gave us all the subsets of one variable and two variables and then three variables. Very good thing to have. Let's try another type of interaction. Let's do house5 gets lm, value per square feet on Class times Boro. Now, I'm doing this on purpose because both Class and Boro are our text data. Run this and we see house5 coefficients, we get a lot more information here. What it's doing is, it created the dummy variables for both Class and Boro and then multiplied all the dummy variables in Class by all the dummy variables in Boro. So there's a lot of coefficients now. At this point, we're getting maybe too many coefficients for the data. We see through our data that neither square footage nor units appear to be significant. So, let's try the ratio of those. To calculate the ratio, it's not quite as simple as saying square foot divided by units. We need to wrap it in the I function and that stands for identity. What it's gonna do is simply divide square foot by units and make that a new variable. So lets look at the coefficients of this. We have literally a variable that is the division of square feet by units. And there's some shortcuts to the formula interface as well. And this is where you can really do some tricky programming to really make your life a lot easier. So value per square foot again and this time we're going to say, units plus square feet and we will square that. This is coming from housing. Let's take a look at the coefficients and see what we've got. It looks like we just got units square feet and units per square feet. That seems identical to if we had just used the regular multiplication. So let's go ahead and build house8. If we'd just done units times square feet, it looks like it's the same exact results. What it's doing there is actually expanding out that formula as if it was algebra. If you don't want that, if you literally just want to add units and square feet and not get the whole interaction term, you need to wrap it inside the I function yet again. So let's look at value per square feet on I units plus square feet then square that from housing. We see we have a typo because we didn't capitalize the P, we check out the coefficients now. We just get the addition of the two variables squared. Little things like that are going to make a big difference in your modeling. Being able to use these formulas to your advantage can make your life a lot easier. Now that we've gone ahead and fit so many models, let's compare them. So first, let's clear the console to make some room. And we'll use another function from the coefplot package called multiplot. This plots multiple models. So we will just look at the first three of them 'cause the rest are really just playing around to see how formulas work. We run this and then we will zoom in and we get another coefficient plot but this time we see three models all on the same plot. They're all color-coded so you can see how they're doing. And you can compare the coefficients with each model. For instance, the Brooklyn coefficient seems to be largest for house2 as well as for house3 whereas the Manhattan coefficient, it's smallest in house2 and largest in house3. Now that said, just because the dot might be a bit smaller, you've gotta look at the whole interval and see there is a great deal of overlap between the coefficients meaning, between the different models, it all might be roughly the same. So far, we've mostly been looking at regression to explain the price difference in housing in New York City. And right here we see that the effect of Manhattan is clearly much bigger than any other variable. So, just like the first plot showed, where we had the bimodal histogram, having your building in Manhattan sure does make the value per square foot a lot higher. And I don't think that comes as much of a surprise. But we may also want to use regression for prediction. So, to do that we need some new data and I have set aside some data on my website, it's called housingNew.csv. So we will read that in. So we load in the data using read.table as always. Now, we can look at this if we want to. And I've already been nice and changed all the variable names so you don't have to do that again. Now let's say we want to make a prediction. What we do is housePredict gets predict. Now, predict is a special function, its first argument should be the model you're making the prediction with, house1. The next argument is newdata. And that is the data you want to base the predictions on. So that will be housingNew. And we want it to give us standard errors, so we say se.fit equals true. And we want a prediction interval. There's a few different types of intervals you can make for a prediction. There are confidence intervals and prediction intervals. Confidence intervals, let me fix this plus sign to an equals sign. Now, confidence intervals gives you a measure of how certain you are about the model and a prediction interval give you a sense of how uncertain your prediction is and prediction intervals are almost always bigger than confidence intervals because it's a prediction, so you build a little more noise. Let's move this to the next line so we can see everything we're doing. And we're going to build 95% confidence on this. So we run this and we can take a look at the results. We see we get the predictive values, which are called fit and the lower and upper bound. So that means for this row of data, we're predicting about 74, but it really could go anywhere from negative ten to 158. That's quite a large spread and it's going to the negative, saying that our property has negative value. Things like this have to be sanity checker and it might mean your model's misfit or it just might mean the model doesn't explain the data all that well 'cause it's hard to fit. So always look at your predictions and always take into account confidence intervals because you don't want to make a prediction saying, "Hey, we're gonna get 74 dollars "per square foot. "Oh, by the way, it can go anywhere from a loss "to 158 dollar per square foot." You really wanna tune these models to get nice, tight prediction intervals. So, we've seen how to fit regressions in R, the intricacies of the formula interface and we've seen how regression can be used for predictions or inference. Typically when making inference, you really want to be able to explain your model and the size of the coefficients, both positive and negative, tell the effect that that variable has on the response. A highly positive coefficient means it's a really strong positive effect on the response. Highly negative coefficient means it's a negative effect on the response. When using regression for prediction, you don't necessarily need to be able to explain the model. You can be more of a black box. And that's always the trade off between prediction and inference. In inference, you might not get a strong prediction but you can really explain what's happening. In prediction, you might have no clue what's happening, but you can make really strong predictions. So, feel free to make your models as complex or as simple as you need, but try to get a really good fit that suits your purposes.