10.2 Reading from URLs - Video Tutorials & Practice Problems
Video duration:
8m
Play a video:
<v Instructor>In this section, we'll write another script</v> whose effect will be identical to the one from the previous section but instead of using Corel to download a file and then reading it locally, we'll actually read the text from the URL directly. I distinctly remember when I first wrote a program to do this, it was actually in the language PHP. And later I learned how to do it in Pearl, Python and Ruby as well. But that first time I wrote a program to read a URL from the live web and download its contents, it felt miraculous. The technique we'll discuss here gives us the power to write programs to access and process practically any public site on the web. By the way, this practice which is called web scraping should be done with caution and it's important to avoid abusing it. As in the previous section, our first step is to find an NPM module to do what we need to do, namely, read the contents of a remote URL. There are actually multiple ways to do it. Eventually the web search node read webpage URL, found an article within necessary steps using a module called request. Let's make a file for our script. Do the same thing. Okay? All right, there we go. So now we're set up. Now in this case the way I got started was literally just a copy some code from the documentation. Here it is, it says var request. Now remember we use let, but the sample code used var, so I'm just copying it. Var request equals require request. We've seen that pattern before. So apparently request here is a function that takes in a URL and an anonymous function with three arguments, error, response and body. Let's run it to see what happens. Aha, well, something happened. There's a lot of stuff here. It's kind of confusing, but I think what's going on here is we're looking at the body. Let's comment this out. Aha, error null, status code 200, which is okay, is the status code for a successful response. And then this stuff here is the body, which as you might imagine for google.com is quite extensive. But we start to see the shape of a solution because the body of phrases.text is exactly the text we need. It's the one that we used readFileSync to read in the last section. Which means we can use the same code we did in the last section, except instead of reading from the file, we can just use body, which is being read from the URL. We'll leave error and response here, but we're not actually going to need them to write this script. Let's open palindrome file. We'll be reusing a lot of the ideas here. In fact, let's just do this. Let's just copy the whole thing. Alight, we can get rid of this, it's repeated, and this stuff will just go inside here. Let's cut it, like that. So what do we need to change? Let's change var to let, this replaces this line here, and we'll want a URL for this request. Now I like to bind this to a variable. Let's just say URL equals and then the URL was this. So then here we can say request URL function, error, response, body, get rid of the space just for stylistic purposes. And then instead of let text equal fs.readFileSync, well, it's just the body, it's this. Now it's important to note that this isn't the stuff inside the body tag, this is the entire body of the response. So it's the whole page. In this case, it doesn't matter because phrases.txt is just plain text, but the distinction would matter for an HTML page. So let's do this, let text equals body. We don't really need to do this. I'm just keeping the parallel structure for now. I'll eliminate text in a second. So there we go. Everything else and here is the same. See if it works. Look at that, that's great. We can even prove it's exactly the same like this. Redirect it to from URL and here is this from file. And then use diff. The null result there indicates that the two files are identical. It's cool, we don't actually need these. So, let's remove them. All right, well, this is working. We have exactly the same result, but in this case we're reading the text off the live web. Let's polish this up just a little bit. There's really no need to do this. I could just say body.split on new line. All right, still working. There's one little detail I want to mention before moving on to the next section. Let's take a look at this URL in a browser. Here's what it looks like. And you can see here that this is actually at Amazon AWS, where AWS stands for Amazon Web Services. So this is a redirect. We can look at curl-i to get the header. You can see this is a 301 status code. That's the 301 redirect. You can see it's redirected to AWS. In this case request here is smart enough to follow redirects, which in curl we do with the dash l like this. We saw this with dash ol. So here we have a 301 redirect, then finally 200 OK. But it's important to note that this behavior is not universal for these kinds of libraries. Some URL libraries require you to explicitly indicate that it has to follow redirects. So that's just something to bear in mind. But in this case we got lucky, requested exactly what we wanted to do. We were able to reuse this palindrome file code to read a text file from the live web. This technique is even more powerful if you're reading an HTML page whose contents you can then turn into a DOM, which lets you pull out specific elements using JavaScript. We'll learn how to do that in the next section.