4.8 Read XML data - Video Tutorials & Practice Problems
Video duration:
27m
Play a video:
Video transcript
<v Voiceover>To begin</v> with, we're gonna repeat something learned in a previous live lesson. Reading in data from an HTML table. To do this, we're going to open our browser. I'm going to use Chrome, but use any browser you feel comfortable with. You come here, and we're going to go this URL. It is: www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool And there's dashes in between each word. When we come here, and we scroll down a little bit, we see that there is an HTML table. If we right click and Inspect Element, you can see it's a table, and there's a body, there's a tr for each of the rows, and within a tr there are tds for the individual cells. Now this is a simple table, but we might want to read this into R for any number of reasons, just to grab the data, and we want to do it in the easiest way. So let's go back to R, then the first thing we need to do, is load in the XML package. If you don't have the XML package, you come down here to the packages pane, and install it. To load it in we say Require XML. Now the package is ready to use. I'm going to assign the results of this operation to a variable, so "bowl" gets "readHTML", and I'm going to use the tab key on my keyboard to bring up an autocomplete list. This way I don't have to type out the functionality completely, I can go select the one I want, which in this case is "read HTML table". So the first thing you do is give it the destination for the file. In this case it is the URL I just showed, and it has to be inside quotes, because that's the name of the file you are going to. You also say "header = false" because in this case we have no headers. You say "stringsasfactors = false". This prevents the character data from being converted into factors. And lastly, you need to tell R which table to get. In our case, it was the first HTML table. There could be situations where it's the second HTML table or the third. You specify that by saying "which = <the number of the table>". So I will highlight both lines of code, and run it by using ctrl+enter on my keyboard. And if we come down here and type in "bowl" we can see that we captured all the data. The column names were made B1, B2, B3 because none were provided. And it got each of the columns of data. In this case it's a simple 10 x 3 table, and we have all the information. So, reading data from an HTML table can be made very simple, by using the Read HTML table function in the XML package. Reading data that is stored in an HTML table is quite simple using the Read HTML table function. But sometimes the data isn't stored nice and neatly, so you need to go and parse it out. To do that we will continue to use the XML package, but we need to load some new functions. For our example data, we're going to go to Chrome, and we're going to check out the menu for one of my favorite pizza places. That's www.menupages.com/restaurants/fiores-pizza/menu You come here, this is for a pizza place on Bleecker Street in the village of New York City, and right here we have information about its address, its location, its phone number. For instance 165 Bleecker Street. I want to be able to parse that, and capture it. So let's right click this, and Inspect Element. This will let us learn a little bit about what's happening in its HTML file, and how we can access it. We can see that the address is located within an li object, with a class "address adr". Within there, there's a span "addr street address" and that has the address in there, there's also a span for the zip, span for the country, so forth and so on. So we can use this information to figure out where to get information from this HTML file. So going back to R, first thing we need to do is make a note of the URL of this page. So I'll save it as a variable, "address" gets, then in quotes the URL. We then run this. We then need to read in the page. So I will assign that to the page, gets, "readLines" with a capital L, and I use Tab Complete to get the rest of it. And I put in "address". It now went out and read that HTML file. If we want to look at the first part of it, we can type in "headofthePage". And we see we have all this HTML, looks like a little bit of JavaScript, but all sorts of information that came from that website. This isn't the rendered page that you see, this is the actual HTML parse that render the page. You need to use the XML package. So let's load it up in case you don't have it loaded. And we are going to parse this page. I will create a new variable "pageRender". And that gets the results of HTML parse of the page. So now that we've parsed the page, we need to go extract information out of it, using XPath. XPath is a way of specifying the structure of an XML document. And HTML is just a specialized form of XML. So let's go and grab that address information out of the web page. Let's go back to the web page just to remind ourselves what it looks like. We can see here it is in an li tag, with the class "address adr". Then within that, it's in a span tag with a class "addr street-address". So we are going to use that information in XPath to grab it out. Now today is not meant to be a lesson in XPath. That's a whole book in itself. But we'll give you a little bit of information that you need for now, and hopefully that will get you through. So let's go back to R, so let's reassign the variable "address", and we will use XPath apply, we'll let the tab completion, and what this does, this goes through the rendered page and it can extract information. So we need to feed it page render, and now we pass it a string containing the HTML tags we are looking for. First up was that li with the class "address adr". We will say "//li". The two slashes means look for any li. It doesn't mean we have to start at the top and work our way down, it searches for any li. Then we'll put in a square bracket, because we're going to qualify it. It's not just any li, it's a specific one that matches a certain pattern. Now we want to specify a certain class for this li. So we say "@class =", then in single quotes we put in the class that was assigned to it, which in this case was "address adr". So address space adr. We close the single quote, and we close the square bracket. Now further, we are looking within this li class, we are looking for the span with a class "addr street-address". So here we've put slash, just one slash this time, because it's within the previous li we found, span, because we're looking for a span, and a square brackets @class, and you tell it what class you're looking for. It's "addr street-address", close off the single quote, close off the square brackets, and close the double quote because we're done with this pattern. There's another argument, so put comma, and I'm going to hit the enter key so it goes to the next line so you can see it. This is optional, but it should make it easier to read it. We have to pass in a function. In function XML value. Which is encapsulated in the tag. So that is XML. I just want the first element, so I do one. If we run this, we have 165, nice and simple. Let's do this again, and let's look for the city that Fiore's is in. So we say "city" gets "XPathapply". We are again feeding it page render. And now we have to feed it a pattern. Again we are looking for that same li, which could appear anywhere, but has a particular class. And within that we are looking for a span that has no class, then within that, another span, and this one has a class equal to locality. Now once again we pass it a function, XMLvalue. And we take double square brackets. So we run this. We can see we grabbed the city, which is New York. Sometimes when you're looking for HTML, you're not just concerned about the class, but you're concerned with the id of an object. So let's go back to our menu and check it out. We come here where it says "Insalata", we can right click and Inspect Element, we can see that it's inside a div with the id "restaurant menu". Then we can grab all the headers, let's scroll, there's Insalata, and Rolls, there's Pizza Slices, and here in that div, there's an h3 for Insalata, an h3 for Pizza Slices, another one for Sandwiches and Rolls. So we want to use XPath. So going back to R, I'm going to clear out the console. So to find this, I'm going to make a variable named "headers". Then it gets the result of XPathsapply. Previously we have seen XPathapply, which returns a list. But just like sapply in base R tries to return back a vector, so does XPathsapply. We pass a page render, and now we need to make a pattern. This pattern I'm looking for anything. In our example, it was a div, but it might be a situation where it's not a div. So we need to use generically, //, because I'm not worried about nesting, then *. The * means "match any HTML tag". It could have been a div, it could have been a span. The reason we're safe is because we're specifying an id. Because in HTML, an id can only be used for one object. So we say "@id =" and then in single quotes "restaurant-menu". Then we are interested in the h3 items of this div, so we say "/h3". Close it off, we say the function is "XMLvalue". And when we run this, we should see all those h3 headers that were in the web page. Nice and simple, gets us the information we need. So let's go back to our web page, and say we might want to get these prices. For instance, Sandwiches and Rolls. I want to see that a chicken roll is seven dollars. A sausage roll is seven dollars. Pizza Slices, a slice is $2.75. A slice with a topping is $3.50. So let's look at this table and see how it's built. So once again we right click and Inspect Element. And in here we see we have a table, whose class is "prices-three". There's a body, there's a tr, and a td. The interesting thing about this website is, sometimes the prices are stored in tables with the class "prices-three", and sometimes it's "prices-four" or "prices-five". So we need to build our code generically, so it just grabs any table that starts with the word "prices", and it doesn't matter if it's "prices-three" or "prices-four" or "prices-five". So let's go back into R, once again I will clear out the console. And let's once again use XPathsapply. So I say "items" gets "XPathsapply". Again we pass the page render. And now we need to build our pattern. We're looking for any table, so it's // because it's not necessarily nested anywhere, it's a table, then we'll qualify it with the square brackets. XPath has a special function called "startswith". As you can guess, it's looking for something that starts with a certain value. In here you put "@class" because we're searching for the class attribute. Comma, then in single quotes, what we want to start with. And in this case that is "prices-". Close off the single quotes, close off the function, close off the square brackets, close the pattern, and close the XPathsapply function. We run this, and type in "items", we see that we grabbed that HTML table, with all the information in it. In fact, we grabbed all the HTML tables that started with "prices-". Then we have all the information in there. All this information in this list of items, is an HTML table, so we could loop through there and grab each table and store it as a data frame all inside a list. So let's overrate items, let's use lapply, because we are going to iterate over this list of items. We are going to pass the function "readHTMLtable". And as an additional argument, "stringsasfactors = false". You run that, and now if you view items, we can see each element of the list has now become a nicely formatted data frame. Which is now going to be much easier to work with. So far we have seen how to grab the value of XML tags. But sometimes you actually want to grab what is in the attribute of the XML tag. So let's go to our browser, and let's open a new tab, and go to this url, which is: menupage.com/restaurants/all-areas/all-neighborhoods/pizza So this brings us up a list of the first hundred results for pizza places in New York City. Our goal is to get the name of the pizza place and its link to its menu. So we come here, and let's do 1.Aldo's Pizza, right click and Inspect Element, we can see that, if we expand this, the name of the pizza place is inside the a tag, but the url is an attribute of the a tag. So we need to use different functions in R to grab these different pieces of information. So let's go back to R, let's clear out the console, then the first thing we are going to do is load in the plyr package, because we're going to use that to help convert a list to a data frame. So Require Plyr. This is a package written by Hadley Wickham, and was covered extensively in previous live lessons. So first thing we need to do is get the link to the website. So let's go back to our browser, copy it, and make a variable to store it as a string. Now we should ahead and parse this HTML file. So we say "doc" gets "HTMLparsemenu". Now comes the difficult part of going through, finding the patterns in this function and getting information. We will do this iteratively. We will create a new object, place name link gets XPathapply doc. The pattern here is crucial. So we start with the table that's nested anywhere. Within there we want to grab the tr. Within there, a td, and we will specify a class. So "@class =" and then in single quotes, "name-address". We can see this if we go back to Chrome. Right here the td has class "name-address". Within there we will go head and grab the link. So we close off the single quote for address in the square bracket, then /a because it's nested in that td, and here we say "@class = link". Now the difficult part is specifying the function. So we will say "fun = " and we're going to define a function right in place that will call other functions. So we declare function of x, open the curly brace, and this is just going to be a one line function so we'll automatically return the only line of our function, which will create a vector. The first element is "name = XMLvalue", a function we've seen before, of x, and we will tell it to not be recursive, by saying "recursive = false". Then, our vector will also have an element called "link". Which will be the result of XMLattrs. This lets us grab the attributes here, which, in our case, if we go back to Chrome, you will see that the a tag has two attributes: class and href. We want to grab the second attribute, which will be /restaurants/1-aldos-pizza in this case. And other values in other places. So we come back to R, and we say we're gonna grab the XML attributes and we want the second element. We close our vector, we close the curly brace from the function, then we close the parentheses from the XPathapply. We run this. And right now, it's actually stored as a list. I'll print this out to the screen, it might get a little messy, but this way you can see what you're getting into. Again I use Tab Complete to finish the variable name. And you can see we have this 100 element list, each which has two elements with the name and the value of the link. To make this a little more usable, we'll go ahead and use ldply from the plyr package to make it into a nice data frame. So we can even override it, place name link gets ldply, the l because we're taking in a list, the d because we're going to a dataframe. And we just pass in place name link, and it should take care of it all automatically. If we look at the head of place name link, we will see that we have a nice data frame that has a column for the names of the restaurants, and a column for the links. This is a very simple way to go grab information stored in XML both as a value of the tag, and as an attribute of the tag. This is a quick intro to XPath, and how to use it using the XPathapply function in the XML package. So far we have seen how to parse XML files using XPath, which while very useful, can be a bit tedious. We will see R's in tool that will make it a little easier. For this example we will look at an XML file that is simulating comments on a social media site. So we come to this text editor, we see that we have this nicely formatted file. It starts off with an opening tag socialsite, and then comments, and then there are comments. Here you have an id, when it was published, when it was updated, category, you have all this information. Perhaps you want to get down to this content where it says this is the best place to eat, can't get enough of it. Perhaps you want to come down here, where it has information about the location at 7th Avenue South and Perry Street. Maybe we want to go back up to the original content and get the sentimentScore, or the sentimentPositive, which are attributes of the tag. So let's go back to R, we will assign a variable "tfile". It is located at http://www.jaredlander.com/data/socialcomments.XML Using the XML package, which we'll load again just to assure we have it loaded, and we can very simply parse it. We will say "tparsed" gets "XMLtolist". Then we pass it the location of the file. And we now have a nice list in R. We can say "lengthoftparsed". Notice it has just one element. That's because it's a nested list. Let's look at the structure of the object. We'll say "stroftparsed". We can see we get all sorts of information. It has ripped apart that XML file. We can see it starts as a list of one item, and then there are ten items. In here the first item we see is the id. Then we go so forth and so on, and dive deeper and deeper into this list. So let's go through here and say "tparsed", we'll go to the first element of the list, and we'll go to the first element of that. Then we'll say dollar sign, and using our studio, we can come here and grab the id. And we see that's the id for the comment. Let's go further. Let's do dollar sign say "author". We can type in author, and then say "$name". We see that I, Jared Lander, am the one who wrote this review for this fake data. Maybe we want to see when it was published. So we can say "tparsed", again the first element of the first element, and we can see "published". This was published September 1, 2014. So that's gotten us all of the information that was stored in the tags. So if we go back to our text editor, so far we've been grabbing information like this publishing tag. This id tag. Maybe it was this name tag inside the author tag. But now I want to grab the attributes of the content tag. That's the sentimentScore, and the sentimentPositive. We want to be able to grab 9 and yes. So let's go and make use of some special R functionality. Again we'll say "tparsed", the first element, and again the first element, and we'll say "content". Within there we'll say "$.attrs". This shows us all those attributes of the content tag. There was type, and that equals text, sentimentScore was 9, and sentimentPositive was yes. So let's say now I actually want to grab the sentimentScore. I don't just want to see it, I want to grab the value. So I'll come back up here and type in tparsed, again, square brackets 1, square brackets 1, $content, and I'll say $.attrs and subset that using the name of the element that I want, which is sentimentScore. And now we get 9, as we had hoped. We have just seen how a simple R function like XMLtolist to list can make parsing a file so incredibly easy. That's the beautiful thing about R. You can write it yourself, but odds are, a function has already been written to do it for you. So you might as well take advantage of these thousands of pre-built libraries to make you life easier.