17.4 Use Splines - Video Tutorials & Practice Problems
Video duration:
6m
Play a video:
<v Voiceover>Splines are a great way</v> to fit a smoothed curve over data. When dealing with messy data that isn't quite linear, yet you still want to fit a best fitted line, splines are probably your best bet. They have a way of fitting a squiggliness to the data, and yes that is a technical term. So to see this, let's load up the diamonds data. And let's build a few splines. We will build a few of them, each with different degrees of freedom. Different degrees of freedom means different smoothness. So we do smooth.spline say x equals diamonds carat y equals diamonds price We'll just use the default degrees of freedom to start. Now notice there's a smoothing spline, it's sort of like a regression, it's smoothing price onto carat. Although it has no restrictions of linearity, it could be curvy, it could go all over the place. So lets make a few of these. Let's this time do it with degrees of freedom equals two, and we'll do this, and then we will say 10 degrees of freedom, 20, 50, and 100. Okay, so now we have all of this fit. Unfortunately these are not in a nice easy-to-use format to extract data out of, so we need to build a little helper function to grab the data. So we will build get.spline.info and it will be a function which takes in an object and it returns a data frame made up of object x, object y and the degrees of freedom of that object. You run it to put it in a memory. So now, we want to do this for all of them and build a nice big data frame. The easiest way to do that is to put these all in a list and run ldply over it. Require plyr then we say splineDF gets ldply because we are going from a list to a data frame, and we're going to build the list right on the spot. Say diaSpline1, diaSpline2, and so on. And then the function we call will be get.sline.info so let's run this and see what it looks like. Okay, we have a lot of information here. Each of these splines had one entry per row of the data set, so we have a lot of data here. Let's go ahead and plot these and see what they look like. We'll do g gets ggplot diamonds and aes x equals carat, y equals price plus geom_point and we left out a parenthese and we see that mistake right here, where we use a couple underscores instead of a couple parentheses. Okay, now we want to add to this plot all of these splines we just built. So we will do g plus geom underscore line and in here we will say the data is the spline data, the splineDF and the aes will be x equals x, y equals y, and color equals factor of round degrees of freedom. What we're doing is, we're turning degrees of freedom into a nice integer, and then turning them into a factor, so it's nice and discrete. We will also do group as df plus scale color discrete, just to call it degrees of freedom. And we put in a slash n here so that way it breaks over the line, it's an old computer trick to make a piece of tech break over, sort of like hitting the enter key. We run this, and it takes its time computing. And we can see here the different effects of the degrees of freedom. Fewer degrees of freedom means it's a very straight line. More degrees of freedom mean it's a more squiggly line. In fact, 102 degrees of freedom, you can see in here squiggling all over the place. Whereas 10, just has a nice curvature, so a large part of working with splines is figuring out the correct amount of degrees of freedom. And those were smoothing splines. There's another type of spline called a natural cubic spline. That's a spline where there's interior knots that are smooth, and the end behavior of the splines past the end points of the data are straight lines. To use these, we need to load up the splines package. So let's require splines. The way natural cubic splines work is that they're not a smoother of one variable onto another. They take a variable, and then make new variables based on that one, by doing some sort of transformation. So let's take a look at some of those. Say head of ns of diamonds carat we'll say the degrees of freedom is one, that means it creates one new column. If we were to do this with two, it creates two new columns, and again with three, three new columns. And these columns can be used as new predictors. You would take this column and use that in a formula for fitting a model. To illustrate that simply, we will use a shortcut from ggplot that does it all automatically for us. Say g, which we created earlier, plus stat_smooth the method will be lm we're just fitting a regression, but we're fitting it on these new predictors that are coming from a natural cubic spline. So y on natural spline of x with six degrees of freedom, and we'll make it blue. Running that gives us this nice arching curve, and even includes a confidence band. That's what it looks like with six degrees of freedom. That means six new columns were built and used in this regression. Let's see what it looks like with three degrees of freedom. And we'll make this one red, just so it's easy to tell apart. We can see of this one fewer degrees of freedom, much straighter of a line. So in this case, the more degrees of freedom, the more columns, the more curvy it can become. Splines are a great way of adding contour to your smoother. They can really capture the curvature of data when necessary.