8.7 Use group_by to split the data - Video Tutorials & Practice Problems
Video duration:
2m
Play a video:
<v Voiceover>The real</v> power of dplyr lies in the split apply combined philosophy of data analysis. We're gonna take a data frame, split it up according to some variable, apply a function to it, then recombine it. In dplyr the the group by function facilitates this. So let's say we want to find the average price for each cut of diamond. We would say dia, pipe, group by, and since it's Hadley Wickham there are underscores. We're going to break it by cut. We then pipe this into summarise. We will say avg price equals mean of price. We run that. Then we get back this nice five row data frame telling us the average price for each cut of diamonds. Now we might want to break it up by more than just cut. Let's say we want to group it by cut and color. We do dia, pipe, group by, cut comma color, and we pipe that result into summarise where we once again say avg price equals mean of price. Now each row represents a unique combination of cut and color and that average price. Since this is a TBL it is smartly just printing the first 10 rows. We can also do a calculation of multiple functions on multiple columns. So let's do dia, pipe, group by, cut, pipe, summarise, avg price equals mean of price, avg carat equals mean of carat, and total carat equals sum of carat. We now have five rows because there are five levels of cut, but we have three new calculated columns, one for each the mean of price, the mean of carat, the sum of carat. Group by facilitates the split apply combine paradigm of data analysis making it really easy to perform these grouped operations.