17.5 Use GAMs - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
<v Voiceover>Generalized additive models,</v> otherwise known as GAMs are a nice way of modeling data that aren't necessarily linear, but they have a sort of additive effect. It lets you combine variables through functions of variables in order to fit a regression. To look at this, we'll take a look at the credit data, which is available at www.jaredlander.com/data/credit.rdata It was originally taken from the UC Irvine Machine Learning Repository. The address to the original data, which is incredibly messy and not coded is http://archive.ics.uci.edu/ml/ machine-learning-databses/statlog/german/german.data It has no headers, and the variables are not coded. Very hard to work with. I recommend using the data I have on my website. We load that into memory and let's take a look at a plot of it. (keyboard tapping) We'll say ggplot (credit, aes(x=CreditAmount, y=Credit)) and bear in mind, y is a binary variable. We're going to need to do some tricks to make it more visible. (keyboard tapping) What we're doing here is, we're jittering the data. Since it's binary, it would all fall along two straight lines. By jittering it, we're adding some random noise to spread them out a little bit. (keyboard tapping) Here we're just adjusting the text on the x-axis. The faceting will break it up into a nice grid by the credit history and employment variables. And we will give the x-axis some nice formatting. Multiple is a helper function from useful that formats scales nicely. We forgot to give a plus sign here. When we did that, we ran all but one line, so we need to go cancel it and run all the lines this time. Ahh, most importantly though, we forgot to close off theme. Lots of little tricks, and when a code gets complicated, there's lots of little things that could fall through the cracks. (mouse clicking) This is just one way of looking at our data, but it is, indeed, very complex data. So to fit this, we're going to use, again, a generalized additive model, to patch it together. Doing this requires a good package, and for that, we will use mgcv. So we load up the package, and now we fit a model. We say creditGam <- gam, and it still uses the formula interface, but now we get to throw in other functions such as a tenser product of CreditAmount and a spline of Age. Then we'll just leave in CreditHistory as normal and Employment as normal. We'll say the data comes from credit, and we'll say the family=binomial because remember, it's essentially still a logistic regression, even though it's technically not a regression, it's a GAM. (keyboard tapping) So let's run this and we get a bunch of warning messages about Nas and infinities. They're warning messages. We can sidestep them. Let's see the summary. So it returns information just like a LM would do. It has coefficients. First it does the normal variables, CreditHistory and Employment, and it gives their coefficients. They're both factor variables, so it gives a coefficient for each level. Then, it gives information on the approximate significance of the smooth terms. That's the tenser product of CreditAmount and the spline of Age. To fully understand these, we could look at some plots. For instance, we can look at the plot of CreditGam, we'll say select=1, se=TRUE, and shade=TRUE. What that does, is shows us the smooth of the tenser product of CreditAmount onto CreditAmount. Just shows how it gets smoothed in. Likewise, we can do that for the spline of Age against Age. This just gives a quick sense of the way GAMs work. GAMs are a great alternative to linear regression when the relationship between the predictors and the response isn't necessarily linear, therefore you can't use a regression. That is where GAMs are at their strongest.