23.2 Fit a simple regression model - Video Tutorials & Practice Problems
Video duration:
6m
Play a video:
<v Voiceover>To look at an</v> example of Bayesian regression let's use the housing data. We say housing gets read.table http://www.jaredlander.com/data/housing1.csv. This is different than housing.csv because now all the column names are nice and clean and ready to use. Sep equals comma, header equals true. Strings as factors equals false. We can then look at it and see it has information about housing in New York city. We have the neighborhood, the borough, the number of units, the year built, the square feet, the value per square foot. All this good information. First, let's just fit a simple linear model of value per square foot against square foot. And see what LM tells us. Say mod one gets LM value per square foot tilde square feet, data equals housing. We do summary of mod one, then we get this information and it looks like the effective square foot is pretty much near zero. Let's build a Bayesian model to do this using Stan. To do so we need to create a new file that's going to be a Stan file. We will save it as house one dot stan. We say yes, and our studio will color code it for us. A basic Stan model will have a data section, parameter section, and a model section. So let's start off our data section. The first thing we have to declare is the number of rows in the data. This will be an integer whose lower bound is zero. We will call it N. Stan programs unlike an R need to end in a semi colon because it gets translated to C++. We will also pass in a vector of size n for value. That will be our response variable. And we also have a vector of size n called square feet. We then have parameters we want to estimate. They go in the parameters block. It's a real parameter alpha that's the intercept. A real parameter beta that's the effect square feet has on value. Then we're also going to be estimating sigma, the variance. That's a real, but it's lower bound is zero because sigma must be positive. And then we have the model block. This is where we specify our model. This is a simple linear regression so we say that value gets modeled. That's not exactly the way Stan works, it's actually updating values but for now we'll just say that it's being modeled as a normal variable with mean alpha plus beta times square feet. And a standard deviation of sigma. To run this we go back to the R file, we load up the R Stan package. Library r Stan. So we create an object called house one. And we do that by calling the Stan function. This takes a Stan file as it's main argument then it takes the data. Now you can't just pass the data as a data frame. You need to pass it as a named list where the names match up with those specified in the Stan file. So that's list, n equals n row, housing, value equals housing dollar value. Square feet equals housing dollar square feet. Close the list and say itar equals 100. We run this, what happens is the Stan code gets translated into C++ code, and that's fast. But then the C++ code gets compiled, and that is relatively slow. Now it's going into its simulations, and you could do this in parallel. You could set multiple cores to run simultaneously. We only have two cores in this machine, so it's not necessarily worth it to do it in parallel. But if you have four cores, set cores to four and it happens automatically then things just go a lot faster. Now that it's done of its simulations we can take a look at the model. We say house one, and we see here we get the coefficient and a standard error, and we get the R hat statistic otherwise know as a Gelman Rubin statistic. Generally if these numbers are not close to one it means you don't have a great model. We can visualize this by saying pairs of house one pars equals c, and in quotes alpha beta. And we can see the histograms for the models and they just don't seem great. That's because we're not using all the data. We had this very important variable called borough representing Manhattan, the Bronx, Brooklyn, Queens and Staten Island and that will almost certainly have a very strong effect on value. So the next step will be to fit a multi-level model, taking into account the boroughs.