7.5 Do group operations with the plyr Package - Video Tutorials & Practice Problems
Video transcript
<v Voiceover>While R</v> has a number of functions built in for working with data, the interfaces can be a little bit daunting, even for experienced users. Fortunately Hadley Wickham wrote the plyr package which really simplifies the data munging process. So, let's first load up plyr. Now that we have it loaded, we're actually going to take a look at a dataset that comes with it. It's all about baseball. This has a lot of information on individual players throughout most of major league baseball history. The id column is the individual player, the year is the year it has information for, and a lot of this is just different information. Like the games, number of at bats, runs, hits, number of doubles, triples, home runs, runs batted in, stolen bases, walks, so on and so on. In baseball a very popular statistic is on base percentage. That is made up of number of hits, number of walks and the number of times the batter's been hit by a pitch divided by the number of at bats plus the number of walks plus the number of hit by pitches plus the number of sacrifice flies. So it is a bit of a complicated formula, but baseball fanatics love it. Now, before we can go ahead and apply this formula, we have to take care of some housekeeping for the data, because before 1954, major league baseball didn't track sacrifice flies, there's just no data for it. So, you can actually see in the data set that under sacrifice fly they're all NA for 1871 and in fact every year before 1954. So, baseball theory says you should take this data and convert these sacrifice flies to zeroes, and that's acceptable in the baseball world. So, to do this we're going to find just the values of sacrifice fly before the year 1954, convert them to zero. So we say, baseball$ sacrifice fly, 'cause that's one we're editing. We're going to subset that one and just do where baseball year is less than 1954 and we set that equal to zero. To make sure this worked, we do an is.na check on this column and see if any of them are NA. Then we see false, none of them are NA. Another hiccup in the data is that sometimes hit by pitch was NA. It's nothing to do with a specific year, but sometimes that shows up and again this is something baseball statisticians allow. So, we'll come through here and say, baseball$hbp for hit by pitch, and we're going to get the ones that are is.na and again this is baseball$ hit by pitch, and we set these and these only to zero. Again we check this by saying, any is.na baseball$hbp. And we see it worked. Now, one more thing to keep this fair, we wanna make sure we're only going to look at players who have at least 50 at bats in a season, otherwise the data isn't really quite telling the real story, so we'll say, baseball gets baseball, we're going to subset the rows such that the number of at bats is greater than or equal to 50 and we leave the column argument blank. We now have a nice good data set to work with. So now we can calculate on base percentage for each year of a player's career for all of the players. To do this we will create a new column, baseball$obp for on base percentage. And here we learn about a new function "with," because writing out this formula and repeatedly typing baseball$h plus baseball$bb can get very laborious. So we say, with. The first argument to with is the data you're going to be working inside of. In this case, baseball. Then the next argument is the formula. So now we can say, hit + bb + hbp without having to qualify it with baseball$, this makes typing a lot easier. Now it's this number divided by ab + bb + hbp + sf. And close off the parenthesis for that division and for the with statement. Now I used with, some people when they first learn R they get taught to attach the data frame then they can use each column as its own individual vector and then detach the data frame to get everything back to normal. I'm not even going to go over this because that is such a horrible thing to do, attach and detach shouldn't exist, it causes so many problems. So if anyone suggests you attach something, just ignore it. Using with or within are much safer and much cleaner. So now that we ran this, let's check out what baseball looks like, and we're going to look at the tail of baseball 'cause it's probably just a little more interesting. We get the on base percentage for each player for each year they were playing. But to judge a player over his career, we really want to see his career on base percentage. You might be tempted just to average the on base percentage for each player's career. That's not quite as accurate as we would want to be. What we want to do is break up the data set into a separate discreet data set for each player. So for instance, Derek Jeter will have a separate data set for all the years he played, Barry Bonds will have one for all of his years, and so forth and so on. Before we even bother with that though, let's make a function that calculates on base percentage so we don't have to re-type this again and again. So this function will be called obp and it'll take one argument, data. Now this might not be the best way to write a function, we're just going to hard code in the variables we want to use because on base percentage isn't gonna change any time soon. We're going to return the result as a vector 'cause that will give it a name and we'll see in a little bit how having that name will make our results look really nice. So we say, obp gets, and we'll do that with trick again, with , and this time we specify data because it's using the argument for the function. And we say, sum of h + bb + hbp divided by the sum of ab + bb + hbp + sf. Now this is very important that we did the sum of the numerator divided by the sum of the denominator. That's because let's say we have n number of years for Derek Jeter. You wanna sum up all of his hits, walks and hit by pitches for all the years in his career and divide that by the sum of all the at bats plus walks plus hit by pitches plus sacrifice flies. That is the way you have to do a weighted average, you can't just calculate this division for each year individually and then sum it up, mathematically that's not cool. So now we run this function, and we now have it in memory, and we can go ahead and use it. So let's make a new data set called career on base percentage. And we're going to calculate it using the ddply function. Plyr has a number of functions which takes data from one type of data structure, whether a data frame, a list or an array, splits it up according to some variable, operates on each data set independently, and then re-combines them all again into some data structure. This is called the split, apply, combine paradigm. You split the data up, you apply some sort of transformation, you combine them together again. In the plyr family, all the functions have the same name formats. The first letter is the type of data structure it's coming from, the second letter is the type of data structure it's going to, and ply is just on the end of all of them. It's called plyr because you're supposed to think about a pair of pliers. So for ddply the first argument is the data set we're going to be working with, which is baseball. The second argument is the variable or variables that we're going to split the data up by. So, we'll check out variables and in this case it's just the player id. As you can see this variable needs to be in quotes, that's very important. And the next argument is the function we are applying, in this case obp. Running this gets our results. So let's check out what this looks like. We do head of careerOBP, and we see for each of these players their on base percentage. This list might be a little more interesting to see if we had it in a good order. So let's go ahead and reorder this, we say, career on base percentage gets, and we're doing the same exact data frame but we're going to shuffle its order. We're gonna say, order careerOBP$OBP and we're gonna say decreasing=TRUE. What this function does is it takes this vector, OBP, it sorts it in order from biggest to smallest and returns the indexes of that order. Then we're gonna feed that into the row index of the data frame, hence putting it in an order we want. And of course we just leave the columns as they are. Now you see we got an error here, that's because we used the wrong column name in the careerOBP data set. If we look at the data frame we returned, the first column's called id and the second column is called v1. Now if you recall, we came up here in the OBP function, we said we were gonna use OBP as the name of what we are returning so it would look nice in the data frame. The problem is though, we used the wrong assignment operator. This is the assignment operator used when assigning a variable. When assigning a name inside a vector or a data frame you're supposed to use the equal sign. So because of that we got a generic variable name down here. This is easy to remedy, we come up here to our function, and re-write it with a single equal sign. We call the function and then we can call ddply again and look at our data frame. Now we have the variable name we want. It's very important to understand the subtleties of assignment operators in R. So, let's go ahead and order our data again this time using the correct column name. We run careerOBP, and now if we look at the head we see the players with the highest career on base percentages. And notice who is at the top here, Ted Williams is number one followed closely by Babe Ruth. Do notice that Billy Hamilton and Bill Joyce, while career leaders, are not in this result because they were mysteriously missing from the data set. But up here we do see luminaries like Lou Gehrig and Hornsby. There's a quick way to see, hey, how your favorite players are doing. This is just one example and it's actually, believe it or not, one of the simpler examples of where ddply is so useful. And remember ddply is the most popular function in plyr because it goes from a data frame to a data frame the most commonly used data structure in R. The second most popular function in ddply is llply, it goes from a list to a list. To do this we need to first create our list. Let's clear out the console to make some room, we'll call it theList and we'll give it a list where A is equal to matrix one through nine with three rows, B is equal to the vector one through five, C is a matrix one through four with two rows, and D is the number two. Let's look at this. We have a nice four element list, two vectors and two matrices. We could use the base R function lapply and say theList and sum, and that works all nice and well. But since we want to use plyr we'll use llply. Say theList and we'll sum it up and the results come back the same. In fact, we can be certain of that by saying identical lapply theList, sum and llply theList, sum. And we see they are indeed identical. So you might be asking, "Why would we want "llply when we could just use the built in lapply?" There are subtle differences when using them and they usually have to do with the names of the list or the way things are treated, and you will find times in your working experience where one will be preferable over the other one, and it won't always be the same one. Now, even though this is returning a list, these seem like numbers that could easily be returned as a vector. So, in base R where we would have used sapply and gotten a vector back, in plyr we would use laply. What that will do, it takes a list and returns it as an array, and when plyr talks about arrays that can mean vectors and matrices as well. So we'll say, laply, that's not lapply that's laply, notice there's only one p, theList, sum. And it gives us back the results. Notice laply didn't give back vector names whereas sapply did. That's an example of where there is slightly different effects from the functions, sometimes you might need those names, sometimes you might purposefully not want those names. So take that into consideration as you use these functions. Plyr comes with a number of other useful functions, a lot of helper functions. For instance, when we use aggregate we're pretty much limited to applying one function to the data. We can do it on multiple columns of data but we're stuck doing it with one function. Plyr provides the each function which helps us out with that. Let's clear the console and take a look. Let's say we want to go back to the diamonds data, we'll take a look at that again to remind us. And let's say we want to find both the mean and the median of price for each level of cut. Well, we can use aggregate function. Say, price on cut, tell us the diamonds data, and for the function we're going to do each of mean and median. This creates a new function that returns two results. We look at this we get for each level of cut the mean price and the median price. This function is very, very useful. Another handy helper function in plyr is colwise. This allows you to apply some function to each column of a data frame. There are numerous ways to do this but this is just a handy little shortcut we can use. So we'll look at the diamonds data and let's just sum up every numeric column. The easiest way to do this is to actually say numcolwise, which is a similar function, except it only does it for numeric columns. You put in the function you want to apply to each column, any other arguments you might need, and then it's kind of weird but numcolwise and the function you put in returns one function, with that one you put in the data you want to use. So we put in diamonds, and what that did was it took just the numeric columns and summed them up. Doing this with base R functions would take quite a bit more work. You could do sapply for diamonds data, but first we could only do this on the numeric data, so we need to subset the data, not by rows but by columns, and we can only do it on the columns that are numeric, so we have to check for that. So to check we do, sapply diamonds is.numeric. So what sapply diamonds is.numeric will do is as it checks each column of diamonds, is it numeric or not? If it's true, it's gonna be included in our subset, if it's not true it won't be. Then to this we apply the function sum. While it gets us the same results with different levels of rounding, it can still be a bit clunky to write and then figure out, numcolwise is a very handy helper function. And in the past there have been some complaints about plyr and Hadley's work in general that it runs slow. He is addressing that with dplyr right now which is available on github which is a much faster version of plyr. Even with the speed issues which over the years have significantly diminished, plyr is such a helpful package of functions because it really makes working with your data so much easier than base functions. It's a common interface, you wanna go from a data frame to a list it's dlply, you wanna go from a list to a data frame that's ldply. It's very simple, some of the base R functions such as tapply can give you back results that are hard to work with, while Hadley's stuff is guaranteed to give you either a data frame or a list or an array. Very simple, very handy and a great tool to know.