4.5 Application: Unique Words - Video Tutorials & Practice Problems
Video duration:
11m
Play a video:
<v Instructor>Our task is to extract</v> all of the unique words in a fairly long piece of text, something where we can't just eyeball it, and then count how many times each word appears. This is really inconvenient to do in the REPL. So as you can see, I'm back at the terminal window and we're gonna do this in a file. So I'll open up a file called count.js and atom and we'll start by defining a constant for our text. You may recall from a couple sections ago that constants can actually technically be changed but this is a signal to the reader that this is not something that we're planning to mutate. For our block of text, we'll choose one of Shakespeare's Sonnets. So const sonnet equals, and now I wanna be able to paste in this block of text and preserve the new lines. So it turns out that the template literal or back tick notation that we've seen before works for this. So I'm gonna paste it in as my buffer, like that. So let's print it out. Remember, I really like to have some hello world or equivalent just to start off with to make sure something is working. Aha, that worked. This is Sonnet 116, by the way. And you may be interested to know that it actually rhymes. You think, well, yeah, of course. If you know anything about sonnets, you know that they rhyme. It has AB AB rhyme scheme. Minds rhymes with finds, shaken rhymes with taken, but wait a minute love and remove don't rhyme. So what's going on here? Is Shakespeare's a bad poet? And the answer is no. He was maybe the greatest poet in the history of the English language. Love and remove rhymed if you pronounced like that. Same way come and doom rhymed like that. Come and doom, come, doom. And instead of proved and loved is actually proved and loved. So this is one of my favorites sonnets for several reasons, but one of them is the rhyme scheme doesn't work at all in Modern English, but it works great in the Elizabethan English of the late 1500s. All right. So now we need an object to hold all of the unique words. So let's think about how are we gonna find the unique words? Well, we're gonna use a great trick that works especially well with these sorts of hashes or associated arrays or plain objects, which is that we're gonna go through and we're gonna use our regex matcher to pull out all words. We're gonna iterate through those words. And if it already exists in our object, which we're gonna call uniques, we'll increment account. Otherwise we'll set the count equal to one. So it's a lot of words. Let's go through it in the code. This is an empty object. We can extract the words like this. Let's back to our regex matcher. Copy this. So I'm gonna paste in this sonnet. And how are we gonna get the words out of here? We wanna skip the spaces. We wanna skip punctuation. Although we might wanna keep this. That's actually left as an exercise. So how are we gonna match words? Well, let's take a look at our quick reference. This is a good bit. You can see here that JavaScript supports a word character pattern, \w. And so remember from splitting, we saw \s+ was one or more white space characters. Well, we can do \w+. That matches one or more word characters. Pretty cool. So it is missing these ones with the apostrophes. As I mentioned, that's left as an exercise. It's actually a little harder than it seems 'cause you actually want to bundle word characters and apostrophes. So I'll just give you a hint for the exercise. There's a way to do that. A way to consider more than one kind of character. So it's actually this square bracket. So square bracket \w and apostrophe will match either of those. So it's a little hint for the exercise, but for now let's extract these words using the string match method we saw a couple sections ago. So the words in this sonnet are sonnet.match. And then we'll use this slash literal and we'll say back \w+. And there's one more thing. Actually, we can console.log this thing. Let's run this. Uh- oh, what happened there? It only matched the first one. It's because we forgot one thing, which is this, g for global. So this will work. I hope. Aha, look at that. All right, now we're ready to iterate through this. We're gonna use a for loop 'cause that's what we know how to use. We'll be learning a better way in the next chapter, but for now we'll do this for let i equal zero. This is our loop variable, our little counter. i less than the length of the words, which is the property of the array, words.length and then the increment operator i++. And for each of these, well let's just for convenience to find a word. This is just the ies word. And actually we can console.log this. It's really useful to print things out like this as we go along. So let's just build this up. All right. So now we're going to keep count inside this unique variable. If uniques already has the word as a key, we're going to increment the count. Otherwise, we're gonna set it equal to one because it's the first time. We'll use the plus equals operator. We saw this when we created a string out of two strings. So we're gonna increment this plus equals one. Else, this is if it doesn't have the key already. Else, uniques of word starts off at one, like that. Let's put in some console.log statements to make this a little clearer. Actually, let's do this here. We'll do template, colon. Actually just pipe this to less. So let me not to the marriage of true minds admit impediments love is not. And now the second time it gets to the word not, it's already been seen. And so we've hit now this branch here. This is a useful thing. Just console.log stuff. Actually, we can put this outside the loop like that or outside the if statement rather it's still inside the loop. So let one, me not to the marriage of true minds admit impediments love is not. So now you can see that the count has been incremented. 1, 1, 1, 2, the has appeared twice. To now has appeared twice and so on. So you can get some insight into the execution of the program by putting in these log statements. And then at the end, we'll remove them like this and then just output the results. There we go. We see that most of the words appear once, but there are quite a few that appear multiple times. Not, to, the is all up here four times and so on. By the way, I'd like to mention that lining up these curly braces can get quite tricky when you have nested curly braces like this, but any good text editor will help you with that. So for example, if I place the cursor here right next to this closing curly brace, you can see that it's underlined very, very subtly. And the matching underline here indicates that these two braces match each other. That this is the opening brace here. This is the closing brace. Same thing if I go here. And similarly if you go down here, this brace here, opening brace, closing brace, opening brace, closing brace. All right, this is a reasonable use of JavaScript plain objects, but it's worth noting that this isn't actually all that flexible of data type. There actually is a JavaScript object specialized for this kind of application. So I just wanna mention that briefly it's called map. Let's look inside node. This is a map object. It's initialized with nothing. If you look at what this is, kinds of looks like a plain object, these empty curly braces here. But instead of using the square bracket notation, you use the set method like this. If you want a key called loved, for example, initialized to zero, you would do this. And then to get that value, you would use the get method with the key. And then to increment it, you could do something like this. So in an industrial strength JavaScript application, it's probably a better idea to use map for this case because that's what it's designed for. And converting this program here to use map is left as an exercise.