7.7 Join datasets - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
Video transcript
<v Voiceover>Today, data</v> comes from all over the place. Some might be stored in a database, some might be in a csv, others on a website. R can easily go and grab all this data and then it needs to be joined together. Joining is a term that should be familiar to a lot of sequel users. It's a way of taking data sets, finding something they have in common, and combining them based on that. To see that happen we need to read some data into R first. We'll call the first data set codes and we'll do read.table and we'll say, data slash countryCodes.csv, and we will tell it header equals true, sep equals comma, and stringsAsFactors equals false. Now, the stringsAsFactors equals false is very important here because we are going to be doing a join on character data and if it's stored as a factor and in a different data frame it's stored as a factor with the same labels but different underlying integers that can cause trouble, so it's very important that our strings are not factors. This data set along with the rest of the data sets in these videos is available at www.jerrodlander.com/data. The next data set we're going to read in is countries. We do read.table, it's data slash GovType.csv, I'm a say header equals true, sep equals quote, and stringsAsFactors equals false. When we run this we see we have an error and that's because arguments need to be separated by commas not periods, and R is very sensitive to this. There we go. We can look at this data. We'll do head of codes. Here we see it has a country code and that's an international standard for abbreviating country names. We have a column named country name, and a year, and a bunch of other information that we don't need to worry about too much. Let's look at countries. It's a little hard to read this because of the way it got printed out because at least one of the columns is all over the place. There is a helper function in our call to view, which easily displays data. In Rstudio it looks really great, so why don't we look at it. By doing view of countries, it pulls up this nice spreadsheet like environment and you can scroll across and see that it's a lot of information here but it's really just the second column that is messy. You can see we have country and government type, so it would be natural to try to join these two data sets on country, so let's close this, come back here, and let's clear out the console. There are a view ways to do a join in R. In base R there's a function called merge, so let's see how that would work. We don't wanna be in the console again, so let's delete that. We come up here to the text editor and we write countryMerged, that'll be the name of our new data frame, and we say merge. X is the table on the left, in our case that will be codes, and y is the table on the right which will be countries, and now if you recall the names of our data sets are a little different. In one it's country.name. In the other one it's country. We need to specify which column in which table will be the one's we're doing the join on. We say, by.x to indicate which column in the x table is our join column and that is country.name. The column in the y table is just country. If we run this we get back a nice result which we will look at in the viewer. We have country name, code, year, and a lot of information, we scroll across the government type. The two sets have been combined. Now, that seemed nice and snappy, but it was a small data set. When dealing with larger data sets, merge can be slow and I prefer to use the join function out of the plyr package. Let's close this window, come down here, and load plyr. The join function in plyr is very nice but it has one glaring limitation. The column or columns you want to join on in the two tables or the two data frames have to have the same name, so where as right now we have country.name and country, that won't work for join. We need to rename one of them. Fortunately, plyr has a helper function for doing this. We will say codes gets rename, which is a special function to help rename columns, we could do this the old fashion way by using the names function, but this gives us a little more convenience. The first argument is the data frame we want to rename some of the columns in, so that's codes, and the second argument is a named vector, specifying the old name and the new name. In this case the old name is country.name and the name which has to be a string is country. Run that and now we're ready to do a join. We could do countryJoined gets join, here we could say x equals codes, y equals countries, and by is only one by and this is a string, country. Running that we get a nice joined data set and we can see it by doing view. And we have this good information again. You can see that it has the code, it has the country, it has the year, it has all this information that we had before, possibly in a different order. Merge and join really make it easy to combine different data sets into one using some sort of common column or columns. It's just a matter of speed and convenience as to which one you choose.