7.4 Perform group summaries with the aggregate function
7: Data Munging
7.4 Perform group summaries with the aggregate function - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
<v Voiceover>Applying a certain function</v> to particular subsets of a data is a very common task. People familiar with SQL call this aggregation. And in R, this is done with the aptly named aggregate function. To illustrate this, let's pull up the diamonds dataset cause it's a good thing to use for an example. Let's do require(ggplot2) to get the data and we just type in data(diamonds) to load it up. Sometimes explicitly loading the data is necessary, other times just loading the package gets the data into memory. It really depends on how the author of the package wrote it because if it's a big dataset, in diamonds it's around 50,000 rows, it's a little bigger, you don't want that in memory unless you specifically ask for it. So let's run this and now let's take a look at diamonds to see what it's like. We see multiple columns, lots of things going on here. So let's say we want to find the mean price of a diamond. That's simple enough by typing in mean and then (diamonds$price). We get the average price is $3,900, now bear in mind, this dataset is a few years old so if it doesn't reflect reality, that's because it's dated. But now you know that just gives us an overall picture that could be very misleading. What if we want to see the average price of each diamond for each type of cut? That is I want the average price for an ideal diamond, the average for premium diamond, the average for a good diamond. Well, we can use the aggregate function. So here we do aggregate and the first argument is a formula. Now the way the formula works is the left hand side is what you're computing on, and the right hand side, in this situation, is how you're breaking it up. So for instance, if we want to see the average price of a diamond, so the left hand side of the formula is price, then we put in the tilde and the right hand side will be cut. That means take the data, break it up by cut, and for each value of cut, compute the mean of the diamonds in there. The next argument is the dataset. And the third argument is the function you want to apply, in this case, mean. We run this, and we see the average price for each cut of diamond. Very quick, very simple, easy to do. And if there was missing data in here we would have to tell aggregate to pass on an argument allowing me to skip them. In this case it would be na.rm=TRUE. Remove the na values that compute the mean. In this case, there were no missing values, it's the same thing. Now we cut it up by one variable, just cut, but what if I want to see the average price of diamond for the combination of cut and color? For instance, a fair cut and color D, or a good cut and color D. Well the formula notation lets us do that. So once again we aggregate, and we say price on both cut and color. This notation here says break up the data based on each combination of cut and color and then apply our function. So again, the next argument is the dataset diamonds, and the next argument is the function mean. So now we run this, and it will print out a lot to the screen, we see, and we'll scroll up to look at this, for each combination of cut and color we get a price back. Very simple to do, it just took a little bit of extra in the formula. That is the power of this formula interface and we see a lot of it when modeling and it is something that a lot of other languages don't have and it makes R really simple to use. Another frequently requested aggregation is aggregating two variables cut up by some other variable, for instance, we might want to aggregate both price and carat by cut. Doing this requires a bit of funky notation but it's not too difficult. We use aggregate, and here we need to build up a two column metric. So for that we do cbind. Now cbind we haven't seen yet, it literally takes columns and puts them next to each other to make a new matrix. So the first argument's going to be price the second one will be carat. This builds us up a two column matrix of price and carat. Then we go ahead of our tilde and say cut and diamonds, and we're looking for the mean. This just returned the average price and average carat for each value of cut. Of course we can extend this even further and add multiple splitting variables. So once again, aggregate(cbind(price, carat) and we'll do it on both cut and color. Now this will print out a lot of information, and we see here that for each combination of cut and color we get the average price and the average carat, and this continues on. Aggregate is a simple yet powerful function for splitting up data into discreet sets operating on each of them independently, and combining the results back together. This notation can be a little bit tricky at first, which is why other packages such as PLIER have come along, but aggregate is still fast and easy enough to use.