15.1 Draw numbers from probability distributions - Video Tutorials & Practice Problems
Video duration:
21m
Play a video:
<v ->All statistics relies on probabilities distributions.</v> They can be a big test taking at first, but they are crucial. The most famous of which is the bell curve, otherwise known as the normal distribution or the Gaussian distribution. R offers all sorts of ways to work with these distributions. The first of which is drawing random numbers. Let's look at pulling 10 random numbers from the normal distribution. Let's start with r norm of 10. There you have it, 10 numbers randomly drawn from the normal distribution. By default, this draws from the standard normal, that means it has a mean of 0 and a standard deviation of one. Let's pull from a different normal distribution, we do r norm and what we will again pull 10 draws, but this time we will say the mean equals 100 and the standard deviation equals 20. Just different numbers were drawn, not that big a deal, but very handy to have. The density, that's the probability of any particular value happening. While technically for continuous variables you can't calculate an exact probability of an exact number happening, we get an approximation. First, let's draw 10 random numbers, we'll say randNorm 10. We'll just draw them from the 0, 1 distribution. Now let's look at those. Now let's find the probability of each of those occurring. We do that with d norm. We see, we're finding negative 0.0177 there's a roughly 39, 40% chance. That's what d norm stands for. We can see it with some simpler numbers. We do d norm and we'll just put in a simple little vector of negative one, 0, and one. See there, for negative one, there's a roughly 24% chance. We're hitting 0 exactly, there's a roughly 39.8% chance. To visualize this, we can generate, let's say 30,000 random numbers, plot them along with their densities and see the type of curve that comes back. Let's clear out the console and let's load in ggplot. Our favorite plotting tool. Now let's pull 30,000 random numbers. Just using a lot to get a nice moved distribution. Now let's make the density. We'll say randDensity and that's going to get d norm of randNorm. Now we'll call ggplot and build a data frame in one step. We'll say ggplot, data frame, we'll say X equals randNorm and Y equals randDensity and that way we just built it on the fly. Now we will put aes on the outside. Aes can go inside ggplot. It can go inside one of the geoms, but we're just going to put it on its own on the outside, it's just another capability of ggplot. Say X equals X, Y equals Y. Then we will add a geom. Now, I'm going to break this onto the next line which is doable because we had this plus sign which lets us continue to the next line. Come down here, we say geom_point and we'll add in the labels for the X and Y axis. We run these two lines and after it computes we are going to get a nice bell curve. These are the random draws and these are the probability of the draws showing that indeed we do have this nice, beautiful bell curve. The probability distribution of a random number is the cumulative probability of that number or smaller. We put in p norm of our randNorm 10. We see this is the probability of getting each of the numbers or less. That's very important. We can see this again of a simpler vector, we'll do p norm and for this we will do negative three, 0, and three. Let me just spell p norm correctly. Here we see that there's very little probability of getting a number less than negative three. 50% probability of getting a number less than 0, that's because the normal distribution is symmetric. The mean and median are right here so of course half of it is going to be to the left and three, which is three standard deviations out of the standard normal distribution is a 99.8% chance the number is going to be smaller. Now this gives you the left-tailed probability. It gives you everything to the left of a point. What if you want the probability between two numbers, say between one and zero? What you can do then is p norm of one minus p norm of 0. That gives you the probability of the number falling somewhere in there. Q norm is the quantile. Given a certain probability, it tells you what the original number was. We can look at this by jumping through a few hoops. First, let's say randNorm 10 to remind us of what our data was. What we can do is, let's get the probability distribution, that is p norm of randomNorm 10. If we find that quantiles of the p norms, it should come back to be the same as the randNorms. Let's find out. That's q norm of p norm of our random normal variables. We see it does indeed get us back to the original. we can even test to make sure they're identical using the identical function. Copy this, we'll say identical, randNorm 10 and what we just entered and we get false. That is due to some positional errors and some rounding errors, but otherwise if you do a visual comparison, they are the same. Another popular distribution is the binomial distribution. That is a series of trials where the result can be either yes or no, true or false, live or die, safe or not safe, some binary result. Each trial has a certain probability of being true. The way you draw random numbers from that is r binom, random binom for binomial. If we say N equals one and size equals 10. We are doing one draw from an experiment that had 10 trials and we will say probability equals 0.4. We can run this repeatedly and we should get different numbers. It's basically saying one draw so we conducted 10 trials, each trial had .4 probability of success, how many successes were there? One draw might be two, another draw it might be five, another draw it might be one. Let's say instead of just doing one draw, let's do a few draws. let's do r binom, N equals 5, size equals 10 because we're still doing a 10-trial setup and probability still equals 0.4. Here we see one experiment, there were four successes, another experiment there were two, another there were four, another three, another six. We do this again. Let's copy and paste this time for 10 experiments. Here we have one of the experiments there were six successes, another one there was two, another five, so forth and so on. When they're just one draw, a size of one, this reduces to a Bernoulli random variable, which is just one trial, yes or no. We have r binom, N equals one and size equals one and we'll say the probability is still equal to 0.4. This time it came up with a 0. We did it again, came up with a 0. Again, a one. We can do the same thing. We'll, copy and paste. This time we'll draw five of these. They're all 0s or 1s. Let's say we do it again, but this time we have 10 draws, that will be 10 0s or 1s. We expect it to be about four 1s because the probability was .4 and the law of large numbers, even though we're dealing with small numbers, is that you should expect to see roughly four successes. We can visualize what a binomial distribution looks like. We'll say binom data gets data frame. We'll say successes equals r binom. We'll do 10,000 draws, have an experiment of size 10, with probability of 0.3. Now, we'll go ahead and plot the histogram of this and see what it looks like. We say ggplot of binom data. Aes, X equals successes, plus geom_histogram binwidth equals one. When we do the ggplot we had to close off the closing parenthesis so we will escape that and run it and we get an error. That's because I was right the first time I said it, it's success, not successes. We run now and we get a nice histogram with the peak being roughly at three. That means doing this experiment 10,000 times, the highest level successes was three which goes with the theory because probability is .3, there's 10 draws, you expect on average about three successes. The binomial distribution is nicely approximated by the normal distribution as the size grows larger. To see this, we will do a number of experiments. First, I'm doing experiments of five trials then 10 trials and then larger and larger. Let's build this up by doing binom five gets data frame, successes equals r binom. We will do 10,000 draws of an experiment of size five, and again probability of 0.3. We will tell this data frame that the size was five so that way we know what we were looking at. We run this and we take a look at binom five, just the head of it. We see the number of successes and the size of the example. We will copy and paste this and do it a few more times. This time instead of five trials, they'll be 10. We'll change the names appropriately. Then we will copy this one, we will do it again for 100 trials and 1,000 trials. Mark this as 100, this is 1,000. We just make sure we give it appropriate names. We run both of these and then we combine them all into one data frame using r bind. We forgot to run this first line. That's another good lesson, if you don't run it it doesn't show up. Now we have, this becomes the head. We can look at the tail. Just to see what it looks like. Now we can make a plot that shows the histogram for each of these types of trials. We will do ggplot binom all. Aes will be X equals successes, plus geom_histogram plus facet_wrap we will break it up based on size. You know what, the scales be free so it's easier to see. We need to make sure we spelled histogram correctly. It goes through and does its calculations. When there's only five draws, it's kind of very discrete looking. 10 draws, it's still discrete. 100 draws it's getting a little better. By 1,000 draws, it's looking much more like the normal distribution. This normal approximation was very important in the history of statistics because it made calculations much easier. Much like with the normal distribution, you can get the distribution of the binomial random variable by doing d binom, let's say we want to say X equals three, size equals 10, probability of 0.3. This is asking for, "What is the probability of three successes out of 10 when the probability of any success is .3?" We can find out the probability of three or less successes out of 10 by doing p binom. Here we say Q equals three, size equals 10, and prob equals 0.3. Has the probability of three or fewer successes. We can also do the quantile which will be given a probability lower the number of successes. We say P equals 0.3, that's the probability. There are five draws and then the draws each had their own independent probability of 0.3. We're saying given the probability of .3 of an experiment with 10 trials, each with .3 probability of success, what should the number be? Here it says two. Lastly, we can feed into q binom, a vector. We will say q binom, the vector, we'll say .3, .35, .4, .5, and .6. Size is equal to 10. Probability equal 0.3. We get two, two, three, three, and three. Those are the quantiles of the .3, .35, .4, .5, .6 probability. A third very important distribution is the Poisson distribution. This is used for counting processes. The number of accidents, the number of home runs, the number of successes. The Poisson distribution is very, very useful. It has all the normal functions, r pois, d pois, q pois. Like the binomial distribution, Poisson distribution can be very well approximated by the normal distribution. To see this in action let's go through and do a number of draws from the Poisson distribution. Each one pulling from a distribution with a different average. We'll say pois 1 gets r pois, we'll say 10,000 draws. We'll say lambda which is the mean for this distribution is one. Then we will repeat this a few times for two, five, 10, and 20. Here this will be two. Five, 10, 20. Each one is still going to have 10,000 draws. Then we combine them all into one data frame. We'll give them names as we do it. We'll call each column lambda one, lambda two, so forth. Go in another line so it's easier to read. That should be lambda .10 and lastly, lambda .20. Run these two lines of code and we can expect this new data frame. We just see that each column represents another vector of the draws. But plotting this using ggplot, we really want this in a long format, not a wide format, life would be easier. To do this, we load in the reshaped two package. Then we're going to melt this down. We're just going to overwrite it by saying pois gets melt. Data equals pois. Variable name equals lambda. Value name equals X. Let's see what this looks like after we run it. We didn't specify an ID variable so it just automatically melted everything down. We look at the head of pois we see we have a lambda column telling us which lambda we're using and the X telling us what the draws were. We look at a tail we see the draws coming from a distribution with the lambda of 20. To make this plot a little bit easier and to learn some more skills, we're going to do a bit of a text extraction to get rid of all these lambdas right here. To do this we will load the string R package. We will say that pois lambda gets and we're going to want this to be a factor so we'll say str extract which is going to pull some pattern out of this. The string will be pois lambda. The pattern will be any digit even if it's multiple digits. We run this. We look at it now. We have the numeric information. It tells us what the mu was. Let's quickly check the class of this. I'll do class of pois lambda. We see it's a character. We want to make sure this gets nicely converted into a factor because that will make it easier to work with in ggplot. To do this we're going to take a bit of a weird jump. We're going to say poise lambda gets as.factor as.numeric of pois lambda. We have to make this silly two-jump because of the way characters are sometimes encoded. We do that and now we're ready to take a look and build a nice plot. We will say tail of pois and see that here we have lambda of 20 and these are the draws. Plotting this we're going to make a nice overlayed density plot. We'll say ggplot of pois, aes is just going to be X equals X plus geom_density and here we'll set the aes to have group equals lambda. That means each level of lambda will get plotted by itself. We will also make the color of the plot be according to lambda and the fill of the plot be according to lambda. That is, when we draw a density plot the inside area is the fill, the line itself is the color so we need to specify both. For more information about the density we will make an adjustment of four which explains the way it's shifted and an alpha of a half. This allows for some transparent coding. Alpha of one half means that for every position on the plot, it takes two dots of color to make it a solid color. One dot will be half transparent. After we get the density geom ready, we're going to set some scales. We'll say scale color discrete to let ggplot know we're dealing with discrete colors here because remember lambda is a factor. We're going to do scale fill discrete for the same reason. We'll give it a nice title. Here's yet another way to make a title. We'll call it the probability mass function. We run all this and we see we added an extra closing parenthesis, that was completely not needed. These extra parenthesis or the lack of parenthesis can really get in your way so it is important to have careful coding. We run this again. This time we get this beautiful plot which I will zoom in on which shows the different distributions. Let's drag this out a little bit so you can see it. This is a Poisson distribution with lambda equals one. This one is lambda equals two. Lambda equals five, 10, and 20. As we are getting closer to 20 and higher numbers, it's resembling a bell curve very nicely. That is why the normal distribution is such a good approximation for the Poisson distribution. These probability distributions and many, many others are the backbone of statistics. There are many functions in r that allow you to work with these distributions and they all take generally the same idea. To draw random numbers, it's r norm, r binom, or r pois, or r Cauchy. For densities it's d. For distributions, it's p. That'd be d norm, d pois, or p norm, p pois. For quantiles it's a q. That's pretty much the general format for all the distributions, whether it's a normal, a Cauchy, a Weibull, or a Poisson. Using these distributions allow us to do all the modern statistics that we love.