17.2 Decrease uncertainty with weakly informative priors
17: Other Models
17.2 Decrease uncertainty with weakly informative priors - Video Tutorials & Practice Problems
Video duration:
8m
Play a video:
Video transcript
<v Voiceover>Similar to</v> regularization methods, Bayesian shrinkage is another way of wrangling in out of control coefficients. Some people, when they hear the word Bayesian, they get all up in arms, other people get excited. And other people are just like okay, it's another tool. We will show here how it can be used to great effect in handling data that wasn't necessarily collected properly. To do this, we will look at voter ideology data. It is available at www.jaredlander.com/data/ideo.rdata. Took a lot of work to get it formatted nicely, but it's ready to go, and there's only 3 w's in www. So let's load it in, and let's take a look. This is all sorts of demographic information going back to 1948, including who the person voted for. So what we can do here is plot a separate model for each year. It's a very common tactic that's done, and it's actually been coined by Andy Gelman as the secret weapon. So to do this, let's first figure out what years we have available to us. So we'll say, theYears <- unique(ideo$Year) Let's take a look, and we have a good number of years going from 1948 to 2000, that's every presidential election. Let's pre-allocate a list to hold all these results. So we'll say results <- vector, cause we're making an empty list. The (mode="list", length=length(theYears)). We'll have as many slots as there are years. So we look at that, that is just an empty list of 14 slots in it. To make this easier to work with later, we want to give each of the elements names. We'll say names(results) <- theYears. Now if we look at results, they're all nicely named. What we will do now, we will build a loop, yes a loop, that goes through and fits a glm on each year of the data. So we do for(I in theYears), this'll loop through all the years, one at a time. Then we say, results[[as.character(i)]]. So what this does, remember the list has just got names, when it gets to the year 2000, it'll convert it into a character and put it into the 2000 element. We will assign this <- glm which will be Vote democratic or republican, ~ Race + Income + Gender + Education. The data=ideo and we will subset=Year==i, which will be 1988, 1992, 96, and so on. And we say, the family=binomial, and the (link="logit"). And we close the for-loop. We can run this now, and it already completed. It was very quick running all of these models. To view them all together, we will do require(coefplot) to pull up that package. And we're just interested in one coefficient actually. So let's go grab that information. The nice feature of multiplot is you could feed it all the models as a list, and you can specify one coefficient, in this case, it is "Raceblack", let's say plot=FALSE so it comes back as data. And now we can look at head(voteInfo). And we see we have the value of the coefficient, it's all for RaceBlack, and we have the confidence intervals. Again, I don't like tables, I prefer graphs. So we'll do multiplot, in fact, why don't we just copy and paste. This time instead of saying plot=FALSE, we say secret.weapon=TRUE. Let's just look at this very quickly. It came back and this one year, 1964, has so much uncertainty, you can't even see how much uncertainty is in the other coefficients, you can't even see if they're significant or not. So we will take this again, and we will add in a piece of code from ggplot, coord_flip, cause we're using flipped coordinates, say (xlim=c(-20, 10)). That will zoom in the plot so we could see what's going on. It will cut off parts of the confidence interval, but we'll be able to see everything else. That's how out of control that confidence interval was. That when we zoom in, we see the other years, they actually have a lot of variants to them, and they're nowhere near the zero line. So something is clearly wrong with the model from this year. To fix that, we can use a weakly informative prior and a Bayesian regression. So let's go through and build another for-loop. So first we gotta make resultsB. So resultsB <- vector(mode="list, length=length(theYears)), just like before. And we have to make sure we close off our string, thankfully our studio reminded us of color-coding. We give it a name, like before, and we give the names theYears. And now we can build the loop. So for this, you will say for(i in theYears), same as last time. We're going to do resultsB[[as.character(i)]] <-, and this time we're going to use a special function in the arm package called bayesglm. Now we're not actually going to load the arm package, we're just going to take advantage of it. We do that by saying, arm::bayesglm. That :: lets you access functions in a package when that package hasn't been loaded. So this function works very similarly to glm, in fact, it's so identical I'm going to go copy the formula because why do it again and again? I'm even going to copy everything up through the family. I'll break the lines a little differently so it's easier to see. You can see here, we're still regressing Vote on Race + Income + Gender + Education, the data=ideo, we're still subset=Year==i, and the family=binomial(link="logit"). The difference is we're going to specify a prior.scale=2.5, and we will specify a prior.df=1. Now what Bayesian regression does, it puts prior beliefs on the coefficients and sort of constrains them a little bit. Andy Gelman put a lot of work into finding out the best of all priors, and he found a scale of 2.5, and a degrees of freedom of one, generally gets the best results overall, so that is what we will go with. So let's clear out the console to make room. And we will now run this for-loop. This takes a little bit longer to run, and we get an error. So the error we have now is actually compatibility issue that has come up with the arm package a few times. When using the subset argument, it doesn't always work the greatest. So what we will do is, we will cheat a little bit, and subset the data frame right in place. So say ideo$year==i, and do that as our subsetting. Let's clear the console again. Now depending on what version of R you are running, you may or may not run into that issue. It happened with the conversion from 215 to three, and apparently it just happened from three to three point o point one. So let's see it run again. That time it ran. That subsetting issue is a known issue to them, and they are working on fixing it. I was hoping it had been fixed by now. So we come down here, and let's go ahead and plot the secret weapon again. We will limit it to just that one coefficient. Notice this, while 1964 is still a bit out of control, it is much better than it used to be. You have much more confidence in it now, all by using Bayesian shrinkage. It's important to note that the model for 1964 did not get any information from the models from the other years. Completely on its own, got shrunk down due to that weakly informative prior. It's amazing how much that helped. It turns out that in 1964, the survey data included an incredibly small number of African American voters, so small that we couldn't get good measurements from the regression. Using Bayesian regression, we were able to compensate for that. A quick note about the secret weapon is that we fit a model in each year, and now we can plot that variable's effect over time. So we can see here how African American voters have voted over time and how it's changed. Secret weapon really is a great tool. Bayesian regression lets you get an out of control coefficient under control, without a lot of effort, simply by specifying priors on the coefficients.