16.1 Fit simple linear models - Video Tutorials & Practice Problems
Video duration:
10m
Play a video:
<v Voiceover>Regression is</v> the workhorse of statistics. It is a powerful tool that lets you examine all sorts of relationships and data. It was invented by Francis Galton, and the name regression refers to regressing to the mean. It turns out that, over time, highs and lows end up regressing back to the mean. That's where the word came from. It all starts with simple linear regression, predicting one variable based on another. To explain this, we'll look at the data about the heights of fathers and sons. We'll load that up in the Using R package, and we will look at it and see some of the data. We have a bunch of heights for fathers and a bunch of heights for sons. Let's load ggplot to explain what we're looking to do. Don't worry about these messages like "The following object is masked from package:UsingR." Movies is an object that is in both ggplot and UsingR. If you don't need to use it, it's not going to affect you too much. So if we run ggplot, we'll use the father.son data set. We will map the fathers' heights to the x-axis, and we will map the sons' heights to the y-axis. We'll add in points to make the scatter plot, and we will put in a smoothing line. This line is gonna be the regression line. You do that with geom_smooth, and we tell it to use lm, which is the linear model. And lastly, we'll just give it some nice labels. Right there is one of the tricky things about using ggplot if you break it up over multiple lines. I ran just the bottom line without running the top line first. So combining both lines, we get the plot we want. Now let's zoom in on this. So we have fathers and sons, So we want to see what relationship these two variables have. If you have a father of a given height, how tall should his son be? Generally, we expect if a father's tall, his son should be tall. If a father's short, his son should be short. Of course there's randomness in here, and there's variation. A short father could have a very tall son, or a tall father could have a very short son. It's possible it could go either way. That's what regression does, it tries to capture and explain away this randomness. We do that by fitting this best fit line through the data. We're trying to get the line that best accommodates all the points. In this plot, you see the shaded gray region. That is the amount of uncertainty in our estimate. The smaller the gray region, the more certain you are. The bigger the gray region, the less certain you are. So our goal is to fit this line. If you remember back to your math classes, this is just a straight line of the form y equals mx plus b. So, in fact, there is a set equation for this and it is ŷ, that's the estimated line, equals a plus b times x plus epsilon. For a moment, ignore the epsilon, just ŷ equals a plus bx. That is y equals mx plus b, but we just change the letters up a little bit. The b represents the slope and the a represents the intercept. It's a basic simple straight line. The epsilon is the random noise. Like I said sometimes a tall father will have a short son or a short father will have a tall son. That's the randomness in there. Now, the whole point of regression is solving for a and b. Turns out, this is easily done. B, the slope, is solved by this equation. It is the sum of xi minus x bar times yi minus y bar, divided by the sum of xi minus x bar squared. That gives you the slope and don't worry about calculating this, computers do it automatically. The intercept a is simply y bar, which is the average of y, minus b which we previously calculated. That's all there is to this. So now that we've seen the idea of regression and we've learned the formula let's put it into action. We come back to R and we fit a model. We can assign the results of this model to a variable which we'll call heightsLM. The function for fitting regression in R is lm. It stands for linear model. We are regressing the height of the sons onto the height of the fathers. Again, we see the formula notation. Sheight is the response, it's the variable of interest. It's what we are modeling. Fheight is the predictor, it's what we're using to predict the response. Now this is a point where terminology can get changed around in different disciplines. Statisticians prefer to call the y variable the response and the x variable the predictors. Other fields have different names for them. One particularly irksome name that always bothers me is independent variable for the xs and dependent variable for the ys. However, this is a misnomer because in probability theory independence is a two way street. If x is independent of y, y is also independent of x. If there's a relationship between the variables one can not be independent while the other is dependent. So I do not like that terminology, the best way to go is predictors and responses. Finishing out this function we tell it we're getting the data from father.son and we run it. Simple as can be. It was very quick, it is a small data set but fitting your models in R is very quick because they're built with these very powerful Fortran libraries that are really good. So let's look at our model. All we get back is the formula we used and the coefficients. Remember that coefficients tell you the effect. The intercept doesn't always have an interpretation. In this case, what it means if you have a father of zero height, you'll be at least 33 inches tall. Of course, that doesn't make any sense because you couldn't have a father of zero height, but intercept is still very important because it adds stability to the model. Fheight is the effect that the father's height has on the son's height. That is, for every additional inch of father's height, the son gets an additional 0.5 inches. Little confusing at first, but it's essentially a multiplier. If we go back to our equation, we see that the coefficient for father's height is the b while the father's height is the x. It means for every increase in the father's height, you multiply it by b to get the increase in the son's height. This display didn't give us a lot of information, so let's look at another display by using summary. Now we have a lot more information. We still have the formula we used to call it and we have some diagnostics on the residuals. The residuals are the errors. It is what was predicted versus what was real for the training data. Residuals are a way of judging how good the model is. We'll come back to that a little bit later. Down here, we have more information on the coefficients. We have the estimate that we saw before and we have the standard errors because in statistics, you don't ever just want one estimate. You also need a measure of the uncertainty. The standard error tells you how uncertain you are. We'll come back to significants later also. Other information is R-squared. Now, so many people are focused on R-squared. You want a high R-squared. Most statisticians don't get too excited about it because it really depends on your field. In the social sciences, an R-squared of 0.3 is considered a really great R-squared and it explains your model very well. In physics, an R-squared of 0.9 is considered good. So it's all relative, there is no absolute, "Hey, that's a great R-squared." It's all relative to your field. The display also puts out the standard errors for the residuals and the degrees of freedom. In this case, the degrees of freedom are the number of rows minus the number of parameters being estimated. It also gives an overall p-value for the whole regression, saying as a whole, "Is this regression good?" For now, I would like to take a quick detour into the Anova. Anova is often used for comparing groups. As an example, let's look at the tips data. Let's clear out the console and we load up the tips data using data(tips, package="reshape2"). And we look at head of tips. Here we have information on the bill, the tip, sex of the worker, so forth and so on. Let's say we want to compare tips by day. We want to see if tips differ on Sunday or Saturday, or what have you. So let's build and Anova model. We do that using aov function and we say tip is modeled by day minus one and the data comes from tips. We'll also fit the same thing, but use regression instead. We do tipsLM gets lm of tip by day minus one, data equals tips. The minus one in this formula means do not use an intercept. So running this, we can see the summary of the tip's Anova and we get a certain breakout here telling you it does look like there is a significant difference between one of the groups. When we run summary of tipsLM, we get more information. In fact, we get an estimate for each of the coefficients. Now, what this is showing us is each coefficient is indeed significant. That means they're significantly different from zero. It doesn't necessarily say that they are significantly different from each other, but it does give a lot more information about each of the groups. This provides similar information as the Anova does, but it gives it in more details. We now have greater insight into the individual variables, whereas the Anova just told us something was different. Plotting these, as we see when we discuss multiple regressions, can really give insight into which one is different. This has been a quick look at simple regression and a little detour into the Anova. And simple regression is just the basis for multiple regression, which we cover in great detail. Using regression, you can explain the relationships between variables or make predictions on new data. It is a powerful tool with many extensions that let's you quickly and efficiently analyze data.