3.1 Create and access information in data.frames - Video Tutorials & Practice Problems
Video duration:
17m
Play a video:
<v Voiceover>One of</v> the best parts about R, and also what makes R so easy to work with data, is the data frame. It is a rectangular data structure similar to a spreadsheet that allows the storage in columns of different types of data. One column could be a numeric, another one could be character, and they both coexist nicely in this data frame. Data frames also allow easy access to individual rows, columns, or even individual cells in the data frame. So they are a very powerful tool and form the basis of almost all work inside R. So to get started, let's create a simple data frame with three columns. So to do this, I'll create some intermediate variables. The first one will be x, and we will say that's 10 through one. Then y, that gets negative four through five. And q we will make a character vector listing some sports. As you can see, we now have three variables. So we will create a new variable called theDF and we will make it a data frame by calling the data.frame function. And the arguments to data.frame are just these vectors you want to input, so x, y, and q. If we look at that by typing in the variable name, we see we get a nice rectangular dataset where each column is its own type and the first two are numeric and the last one is a character. Now the column names automatically took on the names of the variables that went into the data frame. During the creation process, we can specify these names explicitly. So we will recreate it, and this time for each variable, we will give it name. So the first column we will call First, that gets x. The second column we will call Second. That gets y. And the third column we will call Sport and that gets q. If we look at it now, we see that the columns have nice names. One thing to note that while Sport is a character, let's look at Sport for a second. It is character data, so let's check how it got saved in the data frame. We do this by using the class function. So class of theDF$Sport, and I'll explain the dollar sign in a minute. We can see now it got stored as a factor. Data frames automatically convert character data into factors because during modeling factors will be really important. However, this can lead to certain issues when dealing with data so I often like to prevent that from happening. We can do that by using a special argument called strings as factors. So once again let's recreate the data frame, so theDF gets data.frame and again we'll say First=x, Second=y, Sport=q and we use another argument stringsAsFactors=FALSE. This is telling it not to convert the characters into factors. So we'll check the class again, and we see now it is indeed stored as a character. So during this training, to keep the console clean, I'm going to keep clearing it. I understand that it's something that you might not necessarily do during your normal workflow, but it's common thing when giving presentations just to keep things neat and to maintain focus. So in R the way you clear the screen is by using a key combination Ctrl + L, and now we have a clean console. Now our cursor's in the console to get it back up into the text editor on it control one and now we're in the text editor again. There's a good deal of metadata about data frames talking about what is going on in that data frame. A very common one is finding out how many rows are on the data frame, so we can use nrow function, nrow(theDF) and we see it has 10 rows. Likewise, ncol gives us the number of columns and dim gives us both the rows and the columns. Nice and easy to use, easy to figure out. Alternatively, instead of using lower case nrow and ncol, we can use upper case like this. Now for data frames, it works just like the lower case version but this is more flexible. For instance, if we were to call lower case nrow on x which is a vector, we'll get an error or actually null. Now that's because vectors don't have rows, they have elements. The proper command to get the length of a vector is length. That gives us 10. However, if we use capital NROW(x) that gives us 10 as well. So capital NROW is sort of like a little safety function that works on all types of objects, not just data frames. So capital NROW is like a safety function that works on all types of objects, not just data frames or matrices. Another useful helper function of data frames is names. This gives us the column names of the data frame. So for instance, we'll type in names(theDF) and we'll see First, Second, and Sport, just as we expected. That gives the column names. And if we want to grab just, let's say, the third column name, we can type in names(theDF) and then say give us the third element using the square brackets. And we say, okay, grab just Sport. Getting the row names is similarly easy. To do this we use rownames. Type in rownames(theDF) and we see right now it's just generic one, two, three, four through 10. If we want to, we can assign new names to this 'cause let's say we don't want the default. We could do this again using the rownames function but we can assign values to it now. So again we use rownames(theDF), but this time we can assign a vector to it. The vector has to be the same length as the number of rows. So we'll just literally use the words one, two, three, four through 10, but I'll spell them out. And I will break this up over two rows so it's easier to see. And I'll go ahead and correct this typo on Four. Now if we run this, and then call rownames again, we see now the row names are nice and easy to read. Now if we want to print out the whole data frame, again we just type in theDF, and we see that the row names have indeed become these nice verbose versions. Now if for some reason we don't like that and we want to change it back to the default, all we have to do is do rownames(theDF) gets NULL. And now we can check it, and see that it's back to the default. I'll go ahead and clear the screen again. When working with large data frames, printing out the entire data frame to the screen probably isn't the best idea. Even once you get past 20 rows, you really don't get to see the information. So fortunately there's the head function which only prints out a few rows. So head(theDF), and we can see it prints out the first six rows by default. If we want to print out more, we could do head(theDF, n=7). This will print out the first seven rows. So head is a very helpful function. And likewise, we can check the bottom of the data by using tail. Again this prints out rows five through 10. Very handy to have for both of these functions. And of course, we could always use the class function to see what type of object our variable is, and in this case, it is a data frame. And again we'll clear the console to keep it clean. Now let's say we want to access an individual column from the data frame. I gave a little peek of this earlier. Let's say we want to access an individual column from the data frame. There are a few ways about it and I showed one a little bit earlier. So let's go ahead and grab the Sport column. The first way I'll show you is using the dollar sign. So you say theDF$Sport. This prints out that column and it prints it out as a vector which is why it breaks over a few lines. So that's one way to get the column. We'll look at a few other ways to grab an individual column but first let's see how to grab an individual cell out of this data frame. So again let's look at the data frame so we see what we have. And let's say we want to grab the cell that's in the third row and the second column. What we can do is say theDF, open square brackets, the first argument to these square brackets is going to be the row number, in this case three. The second argument is going be the column number, in this case two. Then we close off the square brackets. That returns negative two, which was the third row, second column. Now let's say we want to grab the third row but the elements in both the second and third column. So we enter in theDF, third row, second through third columns. We run that and we see we have both the element for the Second column and for the Sport column. Similarly, we could select multiple rows. So let's grab the third and fifth row. So it's theDF, square brackets, and since we want the third and fifth, we build a vector consisting of three and five, and let's just grab the second column. That gets us negative two and zero which, as we can see from here, negative two and zero, exactly what we wanted. And of course, you can grab both multiple rows and multiple columns. So we'll go ahead and save theDF and we'll build a vector in here to grab the third and fifth row and the second through third columns. Looking at that, we get the third and fifth row for both of those columns. Now continuing to use this square bracket notation, we can grab just one column. The way we can do that is theDF, put a comma right away to say we are not gonna enter any row information. It's just gonna be the entire column, no row selection. And we'll put the number three for the third column. And we see we get a third column back as a vector yet again. Now if we want to grab the entire second and entire third column, we could do theDF, remember leave the first argument blank, and then do 2:3, and then we'll get both the second and third columns together as a data frame. Now notice when we got just one column, it returned it as vector. When we took two columns, it returned it as a data frame. Now this is due to an old quirk in the way R works because R was intended to be an interactive language where you typed in a command, you got a result back, and usually if you just selected one column, you just wanted a vector of it. However, oftentimes you will want that one column to still be a data frame. We can accomplish this by using a special argument. So first let's confirm that when we do just one column it comes as a vector. So first let's just grab one column and check its class. It's a character, not a data frame. So now let's go ahead and use that special argument to keep the single column data frame. So theDF, same as before, third column. Now we could say, drop=FALSE. This means don't drop it down into a vector. And we can see now our display looks much more like a one column data frame. We can confirm this by checking its class. It is indeed a data frame. Now we can select just a single row, the entire single row, in a similar fashion but with slightly different results. So let's grab the second row out of this data frame. So theDF, now this time the first argument we put in two, and we leave the second argument blank. Running this we see we get all three columns, and it's a data frame. We can confirm this by typing in class. Now the reason when you select just one row it does not get dropped down into a vector is because every element of a vector needs to be of the same type. In data frames, each column can be a different type. So it cannot drop down into a vector. And again, we can select multiple entire rows by doing theDF and then say two through four and leaving the column argument blank. There we got three rows out of this. I'll go ahead and clear the screen again to keep it clean. Now there's yet another way to select columns or even a combination of rows and columns and that is to specify the column by name. So we could do theDF. I'm going to leave the row argument blank. This will work with row arguments, but for now I want to illustrate selecting entire columns. And we put in a vector of column names. So in this case I'll take the First column and the Sport column. By doing that, we see we get both columns, and we specified it by name. And you can specify them in any order. Let's do this again and reverse the order. And we see that the results are returned in the order that I specified the columns. It's a great way to rearrange the columns. Now we could have put in just a single column, and it drops it into a vector as before. We could, once again here, do theDF and tell it we want Sport and say drop=FALSE to maintain a data frame even when we're selecting just one column. Now things get even trickier. There's lots of little nuances with selecting columns out of data frames. We could supply just a column argument and no row argument. So we've put in theDF, and we just say right away Sport. Doing this returns a single column data frame. So it's sort of a shortcut for a line 57. Lots of little quirks here that can take some time getting used to. There is yet another way to grab a single column if we use theDF and use double square brackets and put in Sport. This time it returns a column as a vector. So there's all these little nuances and there's all these different reasons why these things work the way we do. And as we continue on, they'll start to make more sense. Using the square bracket notation it is possible to select multiple columns yet again. Again we supply it with a vector, let's say First and Sport, and we get multiple columns. Data frames, as we've seen, are very versatile and could store many different types of data. They can store numeric, character, factor, integer, all in the same dataset, and that's a very powerful part of R. It is perhaps one of the reasons R has become so popular because it makes working with data so easy.