4.7 Scrape data from the web - Video Tutorials & Practice Problems
Video duration:
2m
Play a video:
<v Voiceover>Often times,</v> data is tracked on a website and you need to be able to pull it out of there. Fortunately, there's the XML package which does this very well. So the first step is, load an XML package. If you don't have it installed, now is a good time. So we do require XML, all capital letters. And this gives us the capability to do all sorts of web scraping. As an example, I have a post up on my website where it's a follow up to the New York Giants Super Bowl parade. If we scroll down, we can see that my friend was involved in this game he was playing where they wanted to see what are the odds of one Giant player getting the first score and a Patriot player getting the second score. So naturally, I used R to analyze it, and I put up the results in this nice little table that has 10 rows and three columns. So we want to grab this table and read it into R. The first thing we're going to do is store that URL as a variable, so that way it's easier to use. So we say theURL gets, then we'll go back and copy and paste that URL. Come up here, make sure we have it highlighted. Copy it, come back here, and in quotes, paste it into R. Let's run that. So now we can go ahead and pull in the data. We will say bowlGame, gets, readHTMLTable. The first argument is the file, or in this case the URL. And so we just put in that variable, theURL. The second argument is which. This argument tells R which table to grab. We could be on a website that has multiple tables and R wouldn't know what to do unless this was specified. As you can see on the website, there are no headers in this table, so we need to tell the function that header equals false and as usual, I like setting strings as factors equals false to speed up processing. We run this, then we can check it out and see that R downloaded the data, stored it in a data frame, and gave the columns these generic headers, v1, v2, v3. The common slander against R is that it is not that great for scraping data. However, as we've just seen, the XML package makes this incredibly easy and has recently been determined to be faster than web scraping in certain other languages.