20.1 Build a recommendation engine with RecommenderLab
20: More Machine Learning
20.1 Build a recommendation engine with RecommenderLab - Video Tutorials & Practice Problems
Video duration:
13m
Play a video:
Video transcript
<v Voiceover>Ever</v> since the Netflix Prize, recommendation engines have been a very popular topic amongst data scientists. They're used everywhere from Facebook to Amazon to Google, and naturally many companies wanna use it. In the spirit of the Netflix competition, the example data set we'll be using today is about movie ratings. It comes from the GroupLens project out of the University of Minnesota. The URL is grouplens.org/datasets/movielens. And here they provide different sized data sets running from 100,000 rows to a million rows to 10 million. For our purposes, we will download the 100K data set and build a recommendation engine on that. So first steps, we need to actually download the file. To do that, instead of clicking, we could have gone ahead and clicked it, but I believe if you can program, it's better than clicking. So we will say download.file, and we will put in the URL we are trying to download from. In this case that is http://files.grouplens.org/datasets/movielens/ml-100k.zip We then specify where the file is going to go. That's where it's going on our computer, and I'll just store it in our working directory at ml-100k.zip. We run that, and R goes ahead and downloads that file. You can see this in my Git pane in that we now have a new file called ml-100k.zip. This is a file so we need to open it up, we need to unzip it. Of course you could go ahead and find that file, click on it and unzip it, but again, I like to program things. So let's just say unzip. Yes, R is that amazing you can actually download files and unzip without ever leaving R. The name is ml-100k.zip, and we will unzip it to the movies directory. So we run this, and you can see we now have a new directory called movies. In fact we can do dir movies and see that inside that directory is another folder called ml-100k. So we'll do dir movies/ml-100k. And in here, we can see a bunch of the different files they used. If you read the readme that comes with this data set, you'll see that u.data are the user ratings, and that's what we want. It is a tab separated file with no column headers, so we will need to create those. We will say ratings gets read.table, put in the name of the file, which for us is movies/ml-100k/u.data. Tell R that header=false, sep= tab, which is backslash t, and on the next row we will specify col.names, and we pass that a vector of UserID, Rating, and Timestamp. Now here is a perfect example of trying to put in column names when you don't have a full sense of your data. That's why it's always important to examine your data. In here, we left out one of the column names. In between UserID and Rating, there is MovieID. Let's go in there, fix our mistake, and run these lines of code again. That time it worked properly. We can see head of ratings. They have it, each row has a UserID, MovieID, the Rating, and a Timestamp of when it happened. So you have a user reviewing multiple movies, and multiple users reviewing the same movie. And this Timestamp, it doesn't look too friendly, so we're going to convert it to something that's more human-friendly. We'll say ratings$Timestamp gets as.POSIXct ratings$Timestamp, and the origin is equal to 1970-01-01, that's the Unix epoch. Now if we check out the head ratings, we can see the Timestamp is much more user-friendly. Currently, this data is in the long format. To be more useful for us, we want this in the wide format. So we'll use the reshape2 package to cast it into the wide format. We say require reshape2, we are going to create a new object, ratingsMat, that is a result of dcast. Dcast was covered in previous live lessons and in the book, and it takes in a formula interface. That is, we'll say UserID is a column that will remain a column. Whereas MovieID will be cast across into new columns. That means, each unique value of MovieID will become its own column. And the cells in each of those columns will be that users rating for that movie. The data is ratings, and we're going to use the rating column to populate all these new movie columns. We run this and it gets completed nice and neatly. We'll take a moment to see what it looks like now, but we've gotta make a few more transformations. This is a very wide data set, so we only want to look at the top left-hand corner of it. So we'll load in the package useful, that's a corner of ratingsMat. So now we see just the top left-hand corner, and UserID is just one, two, three, four, five, et cetera, and the MovieIDs are also just numerically numbered. A lot easier for us if the UserIDs said user one, user two, user three, so the best way to manipulate this object is to create the column names to be names of the movies, and row names to be the names of the users. So to create the row names we will say rownames of ratingsMat, gets sprintf, User%s. And we will populate that with ratingsMat$UserID. So now if we look at the corner of ratingsMat, we'll see that the row names became User1, User2, User3, User4, that's the beauty of sprintf. That %s is a variable you get to insert into, and it is indeed vectorized. Now that we've used UserID, we can get rid of it, cause it's not gonna be good for our data. So we will say ratingsMat$UserID gets null. If we look at the corner of it now, you'll see that that column for UserID is gone, it's not in our way anymore. Just like we did for the row names, we each created User1, User2, we need to tack the word movie onto each of the column names. We are going to do it in a very similar fashion. Colnames of ratingsMat gets sprintf Movie%s. And then here we're going to put colnames of ratingsMat. Now if we look at the data, we can see we have the row names are the users, the column names are the movies, when a user rated a movie, the cell is the rating. When a user did not rate a movie, the cell is just NA because it is missing. And lastly, just to ensure that this is indeed a matrix, we will say ratingsMat gets as.matrix ratingsMat. Now that we have our matrix ready to go, it's time to actually build some recommendations. So let's load up recommenderlab, a great package for building recommendation engines. so require recommenderlab, we'll create a new object, rateMat, which is as ratingsMat, then in quotes, realRatingMatrix, and yes, capitalization matters. To see what this looks like, we will do head of as rateMat data.frame. And yes, this looks a lot like our original data, but this transformation process was important for the modeling engine. If we want to see just the ratings for each of the movies, we could say as rateMat list, and let's just look at the first one. You can see, this user gave each of these movies a certain rating. Copy and paste, and you see much of the same. Again, I am a big fan of visualizations, so I wanna see how many items every user rated. So let's go ahead and say image of rateMat. This might take a little bit of time to populate, because it's drawing a lot of information, but it's gonna get us a nice plot showing the number of reviews by user. This is nice information, you know, most users did fewer movies, and a few users did a lot of movies. That's pretty much to be expected. Another useful visualization will be to see the histogram of the ratings. So we will do hist of getRatings normalize rateMat. And I'll say give it 100 breaks, make it for better plotting. Now we have this nice visualization on a normalized basis of the number of ratings for each of these movies. Now, it's time to actually go ahead and build a recommendation engine. So we will say itemRec gets Recommender rateMat, then the method will be popular. We're gonna do recommendations based on popularity. And look how quickly that processed. Granted, it's not the biggest data set, but it still processed incredibly quickly, it did the learning based on 943 users. This is our engine. If we want to extract the model out of this, we say getModel itemRec. And you can see all this information that's stored in the model that you could use for your recommendation engine. We just showed one way to do this in item base recommendation engine. There's also user base recommendation engine. In fact, there's different algorithms you can use, there's matrix factorization, there's association rules, and recommenderlab contains many different ways to do this, and it's definitely worth exploring.