Skip to main content

My Channels

5: Working with Data

5.2 How does Machine Learning use data?

5: Working with Data

5.2 How does Machine Learning use data? - Video Tutorials & Practice Problems

Video duration:

11m

<v ->All right, so let's look at how we use data</v> in the machine learning process. So this is kind of a general look at the process where you collect the data, you validate it. So you want to make sure that it's accurate, some people refer to this as cleaning the data. And then you model the data, that's what a data scientist will help you do, and start to make predictions. So you basically start using the system to see how well it works, because that's the next step, check the accuracy of the predictions. And there's going to be mistakes. There's no perfect model. Even after you've been working on it for a long time, it's always going to make some mistakes. It's like people always make mistakes. And then what you do is you perform an error analysis. So what does that mean? You try and identify what are the types of errors that are happening, and then you go back and change the model and repeat it. Now that's a little oversimplified because you may go back and collect more data. You may go back and clean the data. So there's other things that you can do besides changing the model, but often going back and changing the model is the place that you end up. Now, when the model is accurate enough, then you can deploy it. And remember, that doesn't mean that it's perfect, it just means it's good enough to solve your problem better than what's over solving it now. And so that's really what we want to focus on is when is it good enough that we can start using it? Because until you start using the model, you're not getting any value out of it. So let's look at a very specific example. So this is a process for a common machine learning problem, a document classifier. So what would a document classifier be? Well, there's a lot of different kinds of document classifiers. So the word document just basically means it's some type of textual data. So when we looked at the system that was taking tweets and identifying whether they were spam or not, that's a document classifier, it's classifying tweets as spam or not spam. Also when we were looking at how you can decide the sentiment analysis of a tweet, is it positive, negative, or neutral? The tweet is the document, even though I know that doesn't sound like a document, and the classification is positive, negative or neutral. So when we were looking at trying to identify the topics of pages on a website, that was a document classifier. Where what it was doing was looking at a web page and then of the list of topics that it had, it was deciding which one or more of those topics to classify that document for. And so let's look at how the process works when you're creating a classifier for documents. So the first thing to do is to define what's called the coding guidelines. So what do we mean by the coding guidelines? Well, you remember when we said that when you're doing supervised machine learning, when you're starting with data, what you're doing is you're starting by having human beings label the data. So for a document classifier, they're gonna label the documents that are in the training data with the correct answers. And so the first thing you have to do is to define the coding guidelines. And you might say, "Well, what does that mean?" What it's really is, is the instructions for the human beings to do their labeling. So let me give you an example. So when we talked about the tweets being positive, negative, or neutral, you needed a set of instructions that told the human beings what they're supposed to do to decide whether something's positive, negative, or neutral. And you might say, "Well, that's easy. "Well, why do you need a document for that? "Anybody can look at something "and know if it's positive, negative, or neutral." Well, it's not always the case. And so for example, suppose there is a tweet that says, "I hate AT&T. "I'm so glad I switched to Verizon." Is that tweet positive, negative, or neutral? And honestly, the answer is, it depends on if you're AT&T or Verizon. (chuckles) And so that's the kind of question that would be in the coding guidelines, is how do you make those kinds of determinations? The next step you're gonna do is to acquire documents. So in the example that we gave about the tweets, you have to be able to get access to all of Twitter's data so that you're pulling those tweets in and you're able to then look at them. Then you give them to the human beings to code. So you're gonna take a small number of tweets, maybe a few thousand, and you're gonna use that as your initial training data and have those human beings code the documents, or label the documents, is another word that they use. And at that point, you're gonna develop the classifier. So you're going to work with the data scientist, the data scientist is gonna come up with what the features are along with you, that makes sense to model, and then you're going to try and come up with a few different algorithms in data science and see which ones seem to be providing better answers than others. And the first way you're gonna do that is to assess it qualitatively. So not with numbers, you're just gonna kind of give it a sniff test. You're gonna look at some examples and say, " Does it look like this is working? "How do I feel about what I'm seeing here? "Do these answers seem okay?" If they do, then you wanna go on to something a little more rigorous, which is a quantitative assessment. So the quantitative assessment actually has you trying to do something statistical where you're saying, "I know what the right answers are for some of these things, "and I'm going to use those right answers to determine "if the model is working well or not." And so after you do those assessments and evaluations, then you're gonna start analyzing the errors to see, "Hey, is there always a certain situation that it doesn't work in?" Like for example, when it mentions two brands like AT&T and Verizon, does it seem like we don't have very consistent data that's being labeled? And so then you might want to go back to the coding guidelines, but usually you would go back to get more documents, to make more training data, or you might go to tweak the classifier or to maybe assess it again, qualitatively, to see if, you know, after you've pulled those errors out, what do you see with everything else? But you're gonna prioritize those errors and then go back and figure out how to return to a previous step. Now at a certain point, you're gonna develop a classifier that works really well and then you're gonna deliver it. And because it's working, it's in production, and you're done. And so that's kind of the process that you follow. So let's look at an example of machine learning that has text analytics features in it. So you still label the content, but then there's a text analysis step that happens because you might have a bunch of features that are text analytics features. And so that feature model is actually going to contain not just data analytics, like we talked about in a previous lesson, but also text analytics. So these are, again, some examples of features that you might extract from text analytics into your feature model. Now, after you've trained the model and you've got a classifier, then new content comes in that the model didn't see, wasn't trained on, and then what it does is it applies a label to it. One of the labels that your human beings put in the training data, and it has a certain level of confidence. The higher the confidence level, the more likely it is that the label is actually correct. And so you can think about, for example, if you were doing a topic classifier where you were doing a document classifier that was trying to figure out the subject or the topic of a document, you can think of the higher confidence as being more about that topic than one with lower confidence. It's one way of thinking about it. Now there are some technical things that your data scientist is going to do that you probably won't get involved in, but I'll just explain one of them here. There's actually a decision for how to do the classification. There's two techniques, one's called the multiple one-way approach, and the other is the multi-way approach. And they kind of sound like the same thing, but they're not. So if you have a multi-way classifier, like the one at the top, basically you have one document classifier, and when that document comes in, it could set a confidence level for any of the predictions that you want to have, any of the labels. And so in this instance, you're seeing that this is about certain industries that it might be trying to classify a document for. And it's saying, "Hey, big data and aerospace are the two top ones, "government is right after that." And so then you can set a threshold that says, "Hey, anything over 0.8, "I'm gonna consider to be close enough to be about that. "And I'm gonna say that it gets the three labels "of big data, aerospace and government, "and that's what we're going to do." Now you could accomplish the same thing by sending it through a whole bunch of one-way classifiers. So you have a multiple set of classifiers, you have the big data classifier, the aerospace classifier, the government classifier, and basically it just gives an answer that says yes or no. Yes, this is about it. No, it's not. It can have a confidence level as well, but it's not looking at whether it's about all of them, it's each classifier is only looking at one of them. And so you might say, "Well, why would you pick one or the other?" Well, you might get more accuracy one way or the other. Suppose the correct answer in this situation is that this document is about big data and government, but it's really not that much about aerospace. Well, your multi-way classifier would have gotten that wrong while you are multiple one-way classifiers would have gotten that right in this example. And so it's two different methods to classify documents. And so your data scientist will look at each of those methods to see which one seems to be giving better answers. So whether your data scientists chooses multiple one-way classifiers or a multi-way classifier, that's a single classifier that's not something that you're going to care too much about, but it is one of the many decisions that you get when you're working with data. And so I hope that this section really helped you see that there's a lot of different approaches, and working with your data scientist, you can pick the one that's most effective for solving your problem.

Do you want more practice?

No practice sets found.

0