16.10 Estimate uncertainty with the bootstrap - Video Tutorials & Practice Problems
Video duration:
6m
Play a video:
<v Voiceover>The bootstrap,</v> invented by Bradley Efron is a great tool for estimating uncertainty when there is no analytic solution. It can be used in complex scenarios, even for regressions, but we're gonna take a very simple example and calculate a confidence interval for batting averages. To get to this data, let's load plyr. And checkout the baseball data. To make things easier, we're just going to look at data since 1990. So we say baseball gets baseball, we check that the year is greater than or equal to 1990. Now when we look at the head, the information is a lot more topical. So batting average equals total hits divided by total at bats. It's a ratio. That means you can't just take all the hits and at bats, divide them, then take an average. Or take a standard deviation. You need to sum up all the at bats. That means it's going to be hard to calculate a confidence interval. So to do this, let's build a function that calculates the batting average the way it's supposed to be done, but does it on subsets of data. So let's say bat.avg gets function. Now the bootstrap requires a few things. The first argument should be the data, the second argument should be the indices, then the further arguments can be whatever they need to be. So the first one will be data. The second will be indices. And we'll give it a default, and say it's one through NROW of the data. That's the cool thing about arguments in R. That an argument can refer to another argument, because this argument doesn't get evaluated right away. It only gets evaluated when it's used. So you can pass in data, and then find the NROW of data. Let's also give hits equals h, give it a nice default, and at.bats equals ab by default. So in here, you will do sum of data just do it for the rows in the indices, and do it for the com that is hits, whatever com they choose. And let's be safe, na.rm equals true. Let's divide that by the sum of data, again, just for the rows indicated by the indices, and for the at.bats column, whatever they may pass through. And again, remove .na, that's your na.rm. And there, that will be our batting average. So let's instantiate this, and test it out. We'll just pass in the baseball data set, and do it in all indices, and I'll pick up the default for at.bats and hits. And we get a nice average of 273. And now we want to come up with some sort of uncertainty. So we are going to use the bootstrap. To do this, first we should load the boot package. So now, let's save a variable, avgBoot, and we'll call the boot function, data equals baseball. Statistic is the function we want to apply, in this case is bat.avg. We're going to do 1200 replicates. It's been seen roughly that 1200 bootstrap samples do a good job. And we're going to pass through indices, so that's i. The bootstrap works in an amazing way. It takes your data set and resamples from it. So let's say you have 1000 rows in your data set. It'll randomly draw 1000 rows out of it, sometimes with duplicates, sometimes with triplicates, that means some rows won't be included at all, and other rows will be included multiple times. In fact, on average, about 63% of the original data is included in any given replicate. It draws, in this case, 1200 of these replicates. That means that you've just generated 1200 new data sets. It goes and calculates the statistic on each of these, and then sees how the statistic is distributed. So let's run this. And we put a plural statistics, it should be just statistic. So now it's running, and it did all those replicates. So we can go ahead and print this out. We can see it comes up with the original statistic, has an estimate for the bias, and an estimate for the standard error. This standard error is the measure of uncertainty. So what we can do right now is print out a confidence interval. We do boot.ci, put in the boot object. We want a 95% confidence level, that should be an equals sign. And we want a normal confidence interval. We run that, and we see that the average batting average should be anywhere from 271 to 274. The way it figures out this confidence interval is that it builds a distribution of all the averages it calculated. It has a vector of 1200 averages, and it puts them in order, and pulls out the necessary quantiles. We can see this in action by plotting it. So first, let's clear up the console. And let's require ggplot. Then let's build this up, little bit by little. We'll instantiate an empty ggplot. We then say geom_histogram, and here we say aes, x should be avgBoot$t. T is all the replicates. Those are the 1200 values for the average. We fill it with a nice grey color, and we make the color a nice grey. We are going to put in two vertical lines. They get the original statistic, which is stored in avgBoot$t0, plus two standard errors on either side. So that's c(-1, 1) times 2, times the square root of the variants of avgBoot$t. And we use a line type of 2. We can then plot this, and find that we have an extra comma floating around, because instead of saying variants of the open parentheses, we used the dollar sign by accident. So we run this line again. And we see we get a nice histogram. These represent all of the different averages computed in this sample. And right here, these vertical lines represent the confidence interval. That is how bootstrap works. It draws many replicates and forms them into a distribution. This is an incredibly powerful tool when you have formulas that you just can't get a confidence interval for. This can happen for all sorts of reasons, and the bootstrap has come in and been a real great brute force tool for solving these intractable problems.