16.4 Fit logistic regression - Video Tutorials & Practice Problems
Video duration:
10m
Play a video:
<v Voiceover>Linear regression</v> is great when you have the response variable that's continuous data. That means it's just a regular numeric variable that can take on any numeric value. In order to model other types of data, such as count data, categorical data, or binary yes-no data, you need to use generalized linear models. They are an extension of multiple regression that can work with all sorts of other type of data. First we will look at logistic regression. This is regression where you're modeling a yes-no outcome. It could be survive or not survive, it could be win or lose, it could be yes or no, true or false, two possible values. This is modeled based on the logistic distribution. The probability of yi equals one, yi being the response, so if the probability of it being a success is the inverse logit of xi times beta. Xi is the predictors for this particular realm, and beta is the vector of coefficients. Don't worry too much about this, r does this automatically, and now the logit inverse is a simple function that is defined as one over one plus e to the negative x. While for linear regression, the betas can be solved for analytically, we saw there's just a simple equation. For doing logistic regression and other glms, we need an algorithm called the Newton Raphson method. It is an iteratively reweighted least squares. It fits a linear regression, adjust the weights. Fits a regression again, adjust the weights. Fits it again, adjust the weights. And it does this until it converges, and there's no more need to change the weights. Don't worry about having to do this. R has a function called glm, which does this automatically and quickly. To illustrate, let's grab some data from the ACS, that is the American Community Survey. I've put up on my website some ACS data for New York. Again that's JaredLandard.com/data/acs_ny.csv So we can look at this now, and see we have all sorts of demographic information. We have the acreage of the property, the family income, the family type, number of bedrooms, number of children, number of people, rooms, it goes on and on. And this is just a small subset of what the ACS provides. For those who are unfamiliar, the ACS is very similar to the census. It's all sorts of demographic information, except the results come out more frequently, and are much more detailed. The census has moved towards the short form only having 10 questions, the ACS has 100s of questions, is a very complex survey, is a great source of information about the country. We are going to model whether or not a family's income is greater than or equal to $150,000. To do that, we need to create a new variable, which we will call income. We will use the with function, just to make the typing a little bit easier, and say with ACS, and we're gonna do a check if family income is greater than or equal to 150,000. Remember the with function let's us step into the data frame and access the variables by direct name. And this check is gonna return a bunch of true falses. It's gonna be a bunch of yes nos as to whether income is above that. So to see this a little better, we're going to load a couple of packages, ggplot of course, our favorite plotting package, and useful. It has some helpful formatting functions that go along with ggplot. Let's take a look at the distribution of incomes, and see where they fall. So we will say ggplot acs, and we'll make the x variable map to family income. We will make a geom_density, and we will fill it in with grey, and we will make the color grey as well. Remember the fill is how it gets shaded in, the color is the color of the line. Then we will also say geom_vline and (xintercept=150000). This is gonna plot a vertical line at the $150,000 dollar mark. And we will just format the x access to be a little nicer, and we'll say (label=multiple.dollar, and this is a function that comes from the useful package, and we'll put some limits on the scale so it goes from zero to one million. We run these lines of code, and we need to make sure we spell continuous correct. I was going to use auto-complete, but I decided to type it out myself and that's the lesson I learn. Control shift P to run the whole line again, and we can now zoom in. There are a number of warnings that came up, but they are just warnings, you don't have to worry. Now we see the density for the nation's income. We can see right here where the $150,000 mark is. So the bulk of the density is below that, with a right skewed tail going beyond. So let's go ahead and fit our first model. We'll clear the console, and fitting a logistic regression is very similar to fitting a linear model. We say income1 < - glm. And again the g stands for generalized, because we are extending linear models. We've built up the formula, it's the same way we did it with lm. The response variable goes on the left, and then all the predictors go on the right. We're going to use five variables, housing cost, number of workers, whether or not the place is owned or rented, number of bedrooms, and family type to predict whether or not income is greater than or equal to 150,000. So we enter this in, and we will break the formula to the next line because it's getting kind of long. And we say the data is coming from ACS, and now we need to specify the family, because there is a whole family of glm models. There's a family for count data, a family for binary data, family for waiting times, lots of different types of families, so we need to specify that. In this case it's binomial, because it's binary data, yes or no. We are also going to specify that the length function should be logit. This is a function we showed earlier in the formula, and it's a way of translating from the linear predictors to this scale of the original data, which is yes or no. We run this, and we see we have an error. It can't find the variable income. Something must have gone wrong when we created it. So let's come down here and we'll check out the head of ACS, and we scroll here and we see INcome has a capital N. R is case sensitive, so the easiest way to fix this is to just change the name of this one variable. We could use the rename function from the plyr package, or we could use just base r, let's try that. Names of acs, we only want to do the last one. Now unlike in other languages you can't do negative one to get to the last element. You need to actually specify what the last element is, because negative one actually drops the first element. Negative two drops the second element. So, for us though, we could just say [length(acs)], because when you take the length of the data frame, it gives you the number of columns. So we'll change this name to income, with a lower case n, and now when we look at head of acs, we see income looks much better. So we will once again enter our formula, I will copy and paste it down here. We've run our function, and now it worked. Let's look at a summary of it. Much like with lm, we get a lot of information, particularly the coefficient estimate and the standard errors, and the z values and the p values. Here it's using a z value instead of a t value, 'cause it's just a different distribution, no need to worry too much about that. What's important though, is deviance. The measure of error for a glm isn't means squared error, it is deviance. It's a measure of how off you are from the true mark. We explain deviance in more detail a little bit later. For now, let's look at the coefficient plot of this model. So we do require(coefplot), in case it wasn't loaded yet, and we say coefplot(income1). We zoom in, and we can see the different effects each of the variables have. And we look and we see that even though it has a big confidence interval, the most highly positive coefficient is for owning a place outright. That means not rented, and not even a mortgage. That is the greatest indicator for having an income over $150,000. Interpreting these coefficients takes a little bit of work. Let's just call them up, and notice that these are on the logit scale. We want to convert them back to the standard scale they were on when we first did the model. To do this we need to build a little helper function. We'll call it invlogit <- function(x), which simply does 1/(1 + exp (- x)). Exp is the letter e, so it's e to the negative x. Close that function and run it. And now we can do inverse logit of the coefficients. And it puts 'em back on their standard scale. Logistic regression is a modeling technique very, very common in everything from medical fields to advertising. Medical fields it's often used did a patient survive. In advertising it's used will someone click an ad. It's a very powerful tool that, when harnessed correctly, can really provide a lot of insight and make great predictions.