9.2 Extract text - Video Tutorials & Practice Problems
Video duration:
31m
Play a video:
<v ->Extracting information out of text is</v> an important part in today's world of unstructured data. Doing so requires a lot of tools, the most useful of which is regular expressions. While this isn't meant to be a full lesson on regular expressions, we will cover them lightly. First thing we need is some fun data to extract out of, so let's use a list of the United States Presidents. To do this, we need the XML package, which we load by saying require XML, and then we want to read in data. It's a bit of a long URL, but it comes from the Library of Congress, and it's good information. So let's create a variable, theURL, and we will say it is http:// www.loc.gov, loc for Library of Congress, /rr/print/list/057_chron.html. We can visit this URL and see what it looks like on a website. We have the year the President was in office, President's name, First Lady's name, and the Vice President's name. So let's go back to R, run this, and now, actually read in that data. So we create a variable name, presidents, and we readHTMLTable. We put in the URL, and we specify it is a third table. Now going back to the website, how do we really know it's a third table? It's from playing around, checking out the source code, we see it was a third table in there. So some of this stuff isn't so automatic, you really need to play with it and figure out what you're doing, and find the patterns in the sites you are pulling from. So we tell it it is the third table, and we want this to come back as a data frame. It's possible sometimes, with ill-formed websites, to get other structures back, but we want this function to give us back a data frame, if at all possible. We also want to skip the first row, that's due to again, the way it's all formatted. Then we say header=true and stringsAsFactors is false. So we run this, it goes and reaches out to the website, and we can see what it looks like by saying head of presidents. But you see we get the year, the President, the First Lady, and the Vice President. Now, there's some things in here, some HTML issues, where a \n is an escape character, you have no image, but we'll hold off on that for now. Let's check out the tail of the presidents. Gets down towards the bottom, and there's a few missing entries here, just some weird shaped data due to the way the website came through. Let's explore this a little bit more, and see what else we can work with here, cause some of this data's a bit misshapen. And this is a big part of doing data science. 80% of the work is often just munching the data and getting it ready to analyze. So let's check out the tail of presidents$YEAR. We see there's not just information on the year, there's all this other text. It's a poorly written table with some bad things happening in it. So to clean this up a little bit, we're just going to take the first 64 rows. So that's presidents gets presidents one through 64, and keep all the columns. We're choosing 64 because we can see that's where the trouble begins. So we run this line of code, and now our data will be much better. And to clear the screen, R has a number of functions already pre-built for working with strings. That said, as with many things in R, Hadley Wickham has a package that just works better. It's easier to use and more understandable. So we will load this. It's called String R, or Stringr, depending on how you want to pronounce it. Now that we have that package loaded, we can go through and split up the year. If you recall, the year is generally written as a four digit year, a dash, a four digit year. There are some circumstances, such as with William Henry Harrison, where there's only one year, or Franklin Pierce. That's because he either died in office, or they could only serve part of their term, there's many different reasons, so that happens sometimes. So let's scroll back up here and come back to R. Let's go through and split apart the years and separate the beginning year and the ending year. We'll make a new variable called yearList, and that's gonna take the result of splitting up the data based on a hyphen. So the function inside stringr to do this is str_split. The first argument is the string you're putting in. In this case that is presidents$Year. The second argument is the pattern. That's what you're going to split it on. In this case, it's a hyphen. And the third argument, which we'll look at, is n, maximum number of positions to return, it could be split five times and you just want the first three splits. For now, we don't care about that. So, let's run this, and now we could look at the head of yearList. And we see here, it came back with an empty list. That's because, once again, R is case-sensitive. Putting in year in lowercase letters doesn't work when they're stored as all uppercase letters. So running this again should get us the proper results. And we can see it nicely splits the data, 1789, separate from 1797, 1797 separate from 1801. Now what we would like to do is combine this into a nice, easy to use matrix, because lists can be a bit of a pain. Well, there are many ways to do this. We could use some version of ldply, or lots of other options, the safest way and the easiest way will be to use a combination of rbind, which stacks vectors into rows, reduce, a special function for repeating another function again and again, and data.frame to make sure it all gets wrapped up into a dataframe. A dataframe is just a special version of a matrix. So we say yearMatrix gets data.frame, Reduce, now reduce is gonna take rbind and repeatedly call it on yearList. First it will call it on the first two elements of yearList, so that means 1789 1797, will be stacked on top of 1797 1801, it takes the result of that, and tacks on the next one, and then the next, and the next. So we run this and we get a warning message. That's because some of the row names were repeated. It looks scary like it's an error because it comes across in red, but it is indeed just a warning message, and things worked. We can confirm that by typing in head of yearMatrix, and we need to make sure, capitalization. And we see here we have two columns, x1 and x2, storing the beginning and ending years. Now to make this look a little bit nicer, we're going to give these columns names. So we say names of yearMatrix, which, in actuality, is a dataframe, but we're calling it a matrix, and this is gonna be called start and stop. Now when we look at it, the columns are nicely named. So what we can do now, we combine these columns back onto the original data set. The reason this works, and we could just use something as simple as cbind, is because everything is kept in order. The first row of this table corresponds with the first row of the presidents table. It's a perfect alignment, that's why it's simple to do something like cbind. Let's go ahead and do it now, presidents gets cbind, presidents, and yearMatrix. Now in this cbind, presidents is on the left, so in the resulting data, the original presidents data will be on the left, and the rightmost columns will be these new columns. So we look at head of presidents, and we see it's the same data as before, but now we have more information. We can also look at the bottom of the presidents. And we see that indeed, Barack Obama is 2009 to present, and therefore he has a start time, but not a stop time. Now let's say, for some reason, you just want to get the first three characters of each of the Presidents' names. Not sure why you would want that in this particular instance, but it is often a useful thing to do. So let's clear the terminal, and use a special function called str_sub. This lets you get a substring. So the first argument is again, the string, and in this case that will be presidents$PRESIDENT. And we use tab complete to automatically fill in the name. Second argument is start and we want the starting position to be the first character, and the last argument is end, which we want to be the third character. We run this, and we get a nice long vector with the first three letters of each President's name. Now that seems simple enough, but perhaps, we want the fourth through eighth character of the President's name. So we will take this line of code, we will copy it, make a paste, and instead of saying go from one to three, we will say go from four to eight. We run that, and now we have that portion of their names. Now that's not difficult, pretty nice to use. Let's say we want to find all those Presidents whose first year in office was a one, such as 1801, 1841, 1981. What we do then is we say presidents, open a square bracket for subsetting, we're gonna say str_sub, and here we will do the usual, string=presidents$Start, cause we want the year they started, we'll say the starting character should be four, because we're looking for the fourth digit in the year. Each of these years are a four digit year, we only care about the fourth one. So likewise, we are going to end at four. We only want those rows where this equals one, where indeed, the year was one. And we don't care about all the columns so much, so we are just going to keep the following columns of year, president, start, and stop. So YEAR, PRESIDENT, Start, and Stop. This allowed us to subset both by row and by column. When we run this, we get a nice table showing the President whose first year in office was the year one. Thomas Jefferson, William Henry Harrison, John Tyler, Chester A. Arthur, McKinley, Roosevelt, so on and so on. So this function so far has made it real easy to grab text when you know what character you want to start at, and what character you want to stop at. But let's say you don't want that, you want to do more of a generic search. Perhaps any president who has the word john in his name. This could be someone like John F. Kennedy, or it could be someone like Lyndon Johnson. So you can't just search for the beginning of a word, you have to be able to search anywhere in the word. So, let's go ahead and search for john. Let's clear the console to make things neat, the function we will use for regular expression searches is str_detect. This is from Hadley's StringR package. The R built in functions in R to do regular expressions, but they're a bit confusing, and not nearly as friendly as Hadley Wickham's stuff. We're going to search in the presidents dataframe in the presidents column. And we are going to search for john. Running this will get back all falses. It couldn't find john anywhere in all of the Presidents. Why is that? Because it is case-sensitive. We either need to search for it carefully, or we need to tell it to ignore the case. So, let's do this again, this time we're going to say str_detect, presidents$PRESIDENT, and instead of just putting in john, we're going to say ignore.case john. Now unfortunately, ignore.case takes on the form of a function, and then you can search this way. The built in regular expression functions let you set ignore.case as an option, whereas in str_detect, you need to feed the string inside a function that says ignore.case. When we run this, we get a number of trues. What this did was, it gave us true falses, whether or not that president had john in his name. Let's take this search, copy it, and use it to subset the presidents, just by putting it in the row selection of the square brackets. Run this, and we get this nice big dataframe which is hard to read, so we will view it in the built in Our Studio viewer. And we see, John Adams, John Quincy Adams, John Tyler, Andrew Johnson, John F. Kennedy, Lyndon B. Johnson, lots of options here. The reason Lyndon Johnson is listed twice, is because of his Vice President. For part of his term, he had no Vice President, and another part, he had Hubert Humphrey. So it displays it as two rows. Now we want to move on to even more complicated regular expressions, because they are an incredibly powerful tool. All we've done is search for a hard-coded string. Regular expressions are great for searching for patterns. I've put up a bit of data on my website that has information on the different times that the United States has been in wars. So let's do con gets, and since we are going to be loading an R data file, we can't just say load the name of the URL as we would for just about any other thing in R, we need to wrap it in a special URL function. And here we will go to http://www.jaredlander.com/data/warTimes.rdata. So now we have this connection we can load the con. And the reason we get this error is because even sometimes I forget my own URL and put an extra period in there instead of using the correct one. So let's try this again, and now it should load properly. That magic number error, it's sort of like an MD5 sum check, where it makes sure the data is appropriate to load into R. And clearly that fake URL wasn't appropriate. Now that the data has been loaded, let's close the connection because there's no need to keep it open. Let's look at the data we pulled down. We'll look at the first 10 entries. It is a vector with years and sometimes months, sometimes days, sometimes just one, sometimes two, sometimes there is a weird separator here, like ACAEA, this is something we're really gonna have to hack through. So we're gonna be getting down and dirty, let's clear the console, so first, let's find any times where the separator between the start and end time is a hyphen or a dash. So we will call warTimes, and we will subset it based on the results of str_detect, which does a regular expression search. We will feed it warTimes, and the pattern is a hyphen. Looking at that, we see only two instances where there is a hyphen inside warTimes. Because of this, we can't just split the data on a hyphen, we have to split it on both a hyphen and this ACAEA. So what we do here is theTimes gets str_split, which will split the data based on some pattern, the string will still be warTimes, but the pattern is now going to be something a little different. It has to be in quotes, we're gonna search for either ACAEA or a dash. So what we do is we put the ACAEA inside grouping parentheses to know that those all go together. Then we use a vertical pipe for or, and a dash for hyphen, and we tell it the maximum number of positions we want split is two. We do this because, in this very instance right here, if we let this split into more than two, we'd get everything before ACAEA, we'd get mid, and then we'd get July 1944. So let's run this and we can check it out. We get a list mostly of two element vectors with the beginning and end times of the wars. Some of these will just be a single element, some will be two, and like we mentioned before, some will be a month and a date, some will just have a month, some just a year, lots of different information in here. Let's go through this list and just grab the starting time of the wars. We'll ignore the ending time and just look at the start time. We can do that by saying theStart and now sapply, because we're providing a list, and we want the result back as a vector. We're feeding in the times, and the function we're going to use is a function we'll build just on the spot. What we're gonna do is for each element of the list, we grab just the first entry in that element. So we say function x, and we're just gonna grab x square bracket one. That goes through, and we see that we grabbed just the time the war started, not the time the war ended. However, if you notice here, it's still kinda ugly. There's some trailing spaces after 1774, again after 1774, after 1775, lots of trailing spaces in this data. Fortunately, there's a function to fix that. It is str_trim, and you feed it the data you want to trim off and what this does is, you need to use proper capitalization, what this function did was trim off any leading or trailing white space in each element of the vector. So any leading spaces, any trailing spaces are now gone. Everything we've shown so far is detecting strings and saying hey, where does it exist. Or we've been splitting strings. Let's say we want to extract information, pull it out. For this example, let's go through the data, and pull out January any time we find it. For that we will use str_extract, and the first argument is string, so for that we'll use theStart, and the second argument is the pattern, which for us will be January. We run this, and we get a long vector, where, if January was found, that's all that returns, and if January was not found, NA is returned. Now on this, it can be helpful at certain times, getting all those NA's can be annoying. Instead of just getting January, if we want to get all the information where January existed, we can subset the vector by those elements where January was detected. So we do that by saying theStart, using the square bracket, str_detect, the string is theStart, and the pattern is January. Remember, this pattern is case-sensitive. We run this, and we see a typo where argument names are sensitive and that's a good reason to use the tab. Instead of trying to type in the argument name yourself, or even a long variable name, use tab completion to your advantage. So if we run this again, we see we get all the times where a war started some time in January. Sometimes they just gave us information, January, they didn't tell us the year or anything. That's because it usually came later in the date. Other times we have the year, other times a full date, so lots of different quality of information, and that is a very common theme, different quality of information. So now let's say we want to extract only those numbers where there's four in a row, meaning it's a year. If there's one digit it could be a day, if there's two digits, that could also be a day, but four digits means a year. So let's search for that situation. We'll say head because we don't want to see all of it, str_extract, in here we will use tab completion this time, the string is theStart, and the pattern, how do we say we want any year four different times? What we could do is a special case of in square brackets, zero dash nine means match any number zero through nine. But we want that four times, so we can do zero dash nine, zero dash nine, zero dash nine. So we close out the quotes, and we close the str_extract, and we want to see 20, so we'll feed head a number of rows is 20. Here we get the instances where there was four digits in a row and the NA's were non-matches. Even this was a bit of a pain in the butt. What happens if you want 16 numbers in a row? You most likely don't want to sit here typing square bracket zero dash nine square bracket 16 times. Luckily, there's a shortcut. We will say head again, because we don't want to see it all, str_extract, once again, we will find the string to be theStart, and pattern to be square bracket zero dash nine, but only once, then to the right-hand side of it, in curly braces, you can put the number four. This means it is searching for any number four times in a row. If we run this, we get the same results. Either way, that zero dash nine can be painful, so there's yet another shortcut. So we say head once again, str_extract, we say the string is theStart, and we say pattern is \\d, now \\d stands for digit, searching for any digit. In most other languages, you probably only need one slash, in R you need two slashes. And don't forget, we want to find four numbers in a row, so we put the curly braces with the number four, and close down our pattern, and again, we're looking for the first 20 entries. And here we have it, the same exact information, but much easier to write. So now, let's say we arbitrarily want to find any string that contains either one digit, two digits, or three digits all in a row. The way we can do that is, head, because we don't want to see all of it, str_extract, the string is theStart, and the pattern, we don't need to name the argument, so we won't, is going to be \\d, curly brace one, comma three. And remember, this needs to be in a string. And what this says is, find any one digit, two digit, or three digits. We run this and we get either one or 177, or 14, or 181, what we did not find was a situation where there were no digits, or any situation where there were four digits. Perhaps we want to find just those entries where the four digit year was at the beginning of the entry. So we can do this, once again, with regular expressions, head, str_extract, and here we're not gonna name arguments anymore because they will match positionally, we do need to tell it's theStart, and here, our pattern is going to start with a little caret. This means search the beginning of the line. Means don't find our four digit year anywhere, it means find it at the very beginning, it has to start a line. Again, we do \\d, then the number four, just to say we want it four times. Let's look at the first 30 of them this time. The reason we have that plus continuation here, is because we only closed off one of the parenthesis, not both. So in order to remedy this, we need to come down to the console and hit escape, and come back up here and put the missing parenthesis in. We run this and we get the situations where the four digit year was at the start of the line. Let's say we want to be even more specific. We want to find entries where the four digit year was the only thing on the line. That means the line started with it and ended with it. So we do head of str_extract, theStart, and this time our pattern will indeed start with the caret, \\d, curly brace four, but this time we end it with the dollar sign, which means the end of the word. We look at the first 30 of these, and again, we have to close that middle parenthesis, so we go down to the console and we hit escape, come back up here and run the line, and once again, case sensitivity is indeed an issue. We run this line of code, and we see these are all entries where the year was the only thing mentioned for the starting time. Regular expressions also allow you to substitute text for text that you found. Let's clear the console and take a look at this. Let's do head of str_replace. The first element is going to be the string, so that's theStart, the second element is the pattern we are searching for. In this case, we will search for \\d. And the replacement will be an x. And let's look at 30 of these. Before we run this, the mistake we keep making, is that we keep forgetting the closing parenthesis for the regular expression. We put that in, run it, and now we see, in many places, one of the digits was replaced with an x. So this might seem odd. Why does only one of them, in fact, only the first one, get replaced? That's because str_replace only replaces the first match. In order to replace all of the matches, we need to use a slightly different function. We'll say head, str_replace_all, this will replace all of them. We say theStart, the pattern is a digit, and the replacement is an x. We close out the parenthesis, and check them out. Now, all digits are replaced by x's. Now that is replacing each individual digit with an x. Maybe we just want to replace all the digits at once with one x. So if we have a year, like 1815, we get one x instead of four x's. To do that, we once again will use str_replace_all, we say theStart, but this time, our search pattern has to be a little smarter. We will search for any digit one through four times, and we will replace it with an x. We look at that, now it's a very similar result, but instead of each digit being replaced, we have every clump of digits being replaced with one x. For instance, up here we have June xx comma four x's, now we just have June x comma x. Now regular expressions can be even more powerful than this. They let you replace text that you have found. This is very, very useful for sometimes, you need to search for some sort of text, replace it with itself and a little bit of something else. So to do this we're going to make a fake HTML entry to see how this would work. So let's make a new vector called commands, and the reason we're using HTML is because so many times when you scrape websites, you're using all sorts of crazy regular expressions, so it's good to practice with them. So this vector's going to have two elements, the first one'll be a href=index.html, this link is here, and a closing anchor. So we'll say a href=index.html, close the anchor, say the link is here, and then you want to have the concluding anchor. The next entry is going to be in bold face, this is bold text, using, of course, the HTML bolding. So let's run this, we now have this nice new vector, we will take a look. We have a two element vector. What I want to do is find and just extract the text that's between the anchors. That means in the first element, I want it to say the link is here, and the second element, this is bold text. The best way to do that is to search for the entire link, and then back substitute the text in between. So this is going to take a complicated pattern. We will use str_replace, the string will be commands. The pattern will be, now this is where regular expressions become complicated, we wanna find that first anchor. So we know we need this first anchor, and we're not going to build this up in pieces. We know we're going to want some sort of text that we'll want to substitute back in. That text will be right here. And these parentheses are important, they're grouping parentheses, and we'll explain that in a bit. And then we want to find the closing tag. Now in here, these tags could have anything, for instance, in the first element, the first tag is a space href=index.html. Well, in the second tag, it's just a b. So in order to search for anything, we can do a period, that's a wild card. That will match one wild character. What we want though, is we wanna search for more than one wild character. This wild card will get us the b, and it'll get us the a, but that's it. It wouldn't get us the a href=index.html. So what you say is plus. Search for any amount of wild cards. That'll match for anything. The problem is, it's going to keep going, and it's gonna match the whole line without stopping. So you need to put a question mark in there to tell it be lazy, check one line at a time, until we hit our next character. So what it's gonna do, it's gonna keep searching for wild cards until it finds our greater than symbol. Then for the text we were searching, similar idea, we're searching for any text, but want it to stop before it gets to our less than symbol. So same idea, period, plus, question mark. It's a lazy wild card search. Then we want to close off our closing tag, with just a period plus, because it's going to the end of the line in this particular situation. We don't need to worry too much about it. Now the cool thing is here, it's gonna find this whole pattern, which is the whole element of text, and rip it out. And it's going to replace it with something. We could replace it with just an arbitrary thing, the word hi, but we really want to replace it with the text that was found. To do that, we can use a special tag, \\1. That special tag goes to the first grouping of parentheses, and takes whatever was in there and reinserts that. If we had two grouping of parentheses, we could have used \\2 to refer to the second one. Let's see this in action. We got the link is here, and this is bold text, just as we wanted. The ability to search through text is crucial in today's big data world. Whether you're just doing simple splits on characters, or you're pulling out specific indexes of characters, maybe the third through fifth one, or you're doing some crazy advanced regular expressions, knowing these tools really eases up your data transformation process.