7.8 Switch storage paradigms - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
<v Voiceover>Data comes</v> in many different formats and sometimes there's a need to switch between them. The two most common types of data sets are wide and long. Sometimes one is better than the other for different purposes. Wide might be better for humans to read, long might be better for computers to read. And given there's a need to switch between them, Hadley Wickham wrote a great package called reshape2 for doing just that, so let's take a look. We'll load the package reshape2 and look at a data set that comes included with it. That's the airquality data set. We can see it has an Ozone reading, a Solar.R reading, a Wind reading, a Temp reading and then the Month and Day it was measured. For plotting purposes or computational reasons for that matter, we might want this to be a month, a day and then one column telling the type of measurement, ozone, solar, wind, temperature and another column giving the value. Doing this is called melting. We take it from a wide format down to a long format. So, this is easy enough with the aptly-named melt function. I'll call the new data airMelt and we will say melt on airquality and we're saying the id variables are both month and day, that's because each row will have this combination of month, day and it'll be repeated month, days. So for instance, five and one will be repeatably, once for ozone, once for solar, once for wind, once for temp. We're gonna say these metrics as being the value column. We do that by saying value.name="Value" and we're also gonna say that the column holding these headers is going to be called metric. So we do that with variable.name="metric." Let's break this onto a few lines so it's easier to see and then we run this and check out the results. Let's look at 10 rows of it. We see on the fifth month, the first day, for ozone it's 41 and it goes through and does this and then later on when we would see a repeat, five, one, temp and then the temperature. So forth and so on. Now this data set looks like it's much less useful to a human, but for a computer, this can be much easier to use. Let's check out the relative dimensions of these two data sets. Airquality was 153 rows by six columns whereas airMelt is 612 rows by four columns. That's why this is called long, 'cause the data is stored in a much longer format. It takes more rows to show it, but fewer columns. Now that we have data in a long format, we very well could want it in a wide format. Since we already have this nice data set that's already long, we will just turn airMelt into it's former wide self. We will call this airCast. The first argument is your data set. In this case that's airMelt. The second argument is a formula where the left hand side of the formula are the variables that you want running down a data set and the right hand side of the formula are the variables that you want becoming the new headers for the new columns. So for our case that is Month + Day are the identifier variables that are going to stay in place and the right hand side is going to be Metric because each value of metric is going to become a new column and we can specify that value.var="value." These values here are gonna be put into the new columns. We have an error here because I tried using the variable name as a function name. What we want is airCast gets dcast. Dcast is the function in reshape2 that goes from long to wide. It is the opposite of melt. We run this and we have another error. That is because up here in our data set we used a lowercase m. Now, R is case sensitive for almost everything, so it's very important you get this right in the formulas. Now if we run it, we see it worked. We do head of airCast and the data is wide again. It's not quite like it was before because now month and day are on the left, but that's easy enough to rearrange. In fact, we can do that right now by saying airCast gets airCast then we will just specify the columns we want in the order we want, which in this case will be Ozone, Solar.R, Wind and Temp followed by Month and Day, capital D because it is case sensitive. Now if we look at head of airCast, it looks similar to before. The ability to melt and cast data sets is very important. Sometimes you want to go from wide to long, make some sort of transformation that could only be done in that format, then send it back to wide. Hopping back and forth is a powerful skill to learn and can really make your data munging much easier.