6.1 How do we correct errors? - Video Tutorials & Practice Problems
Video duration:
7m
Play a video:
<v ->All right so one of the things to think</v> about that's an issue for people is correcting errors. Now we've talked a lot about correcting errors in the model before, but we've kind of done it in a technical way where we kind of said hey, there's the right answer, the wrong answer. We want to make the model more accurate to get the right answer more. But there's also some types of situations where you actually care about the confidence that your users have in the model. And people don't always determine that accuracy is the highest thing for them. So that might surprise you, but we're going to give you some examples. So one of the things to think about when you were looking at errors that occur, if we take an example of a chat bot is there's different kinds of errors. So did we get the question wrong, but the answer right? Did we get the question right? But the answer wrong, you might find in your situation that people hate one of these errors more than the other. So if you could make your model statistically more accurate, but the errors that remain are the ones that your users hate that might not be better. So maybe they really detest you getting their question wrong. So you don't even understand the question they were asking. Maybe they find that more frustrating for your problem. Then if you understood the question and your model, but the answer wasn't quite right. Now, I'm not sure if either one of those are good, but it depends on your situation and your users, what exactly might be a worst problem to have. And so sometimes you want to look at whether accuracy is always the best solution. So here's an example from the IBM Watson System, when it was playing Jeopardy against a human opponents and the greatest of all time, Jeopardy champion, Ken Jennings and Watson was playing. And here is a particular situation that Watson got into. The Final Jeopardy category was US cities. And you can see the clue. It says its largest airport was named for a World War II hero. It's second largest for a World War II battle. And Watson's answer was what is Toronto. Now, I don't know if you know the right answer to that question, the right question, I guess. But the truth is that even if you don't, you know, that Watson's answer is wrong because Toronto is not a US city. And so what happened here is that Watson was tuned to be as accurate as possible, and not to kind of use the category as something to screen the answers. And the reason it did that is because in the main part of the Jeopardy game, the categories are sometimes puns and jokes and other things. And Watson was actually getting more of the answers wrong by trying to limit by the category. The thing that they forgot was that they never do that in Final Jeopardy. And so they should have put that filter back on. But what this indicates is that your AI model can make mistakes, that human beings would never make. And when your AI model does that, it can reduce the confidence that your user has in the system. Now, this is kind of a trivial example because who cares if Watson does a good job on Jeopardy, but suppose you, your hospital was using Watson to do diagnoses of your medical condition. And Watson came out with this crazy diagnosis. That absolutely couldn't be true, even if someone's swore to you that Watson's diagnoses are more accurate than doctors. The fact that it can come out with something crazy, we call this a Howler because it makes you howl in pain, because it's such a bad answer. And so if it brings up something that is really so far off that even if you don't know the right answer, you know, that's the wrong answer that can undermine people's confidence in your system. And it could be that that's a worst problem to have then being a little bit less accurate. And so that's something for you to consider as you're designing your system. Let's use an example that's a marketing example. Let's go back to the example. We've used a couple of times of sentiment analysis where we're really trying to find out which subjects are being discussed in social media that are more positive or more negative in terms of people's reaction to them. And so we remember this example. Well, a Howler in this situation would be when you identify something as positive, that's actually negative or vice versa, you identify something that's negative that's actually positive. It's the complete opposite. Whereas a mistake, that's still a mistake, but not as bad, is to identify something that's actually positive as neutral or something that's actually negative as neutral. It's still a mistake. If you were just doing things numerically where you were using those accuracy counts, we showed before you were using F measure, precision and recall. Those mistakes would be considered equal, but human beings don't think they're equal. Human beings think that if something was positive and you called it negative, they start to question your whole system. So you might find that it's better to be somewhat less accurate numerically, but for the mistakes that it to be smaller, to not have any howlers. Now, one way that you can improve your system is actually to solicit user feedback. So when you solicit user feedback, that's actually asking the users of your system to correct things that were wrong. And so you can ask questions like, did this give you the right answer? Do you think this answer should have been different? Now here's the problem. Remember when we talked about creating that gold data and using inter coder agreement? Well, what you really want to look at is whether the people that you have using the system are actually giving answers back to you when they correct things that other people agree with. The other thing that can be hard is that you haven't trained your users the same way you trained your human labelers of your gold data. You didn't give them a document that told them here is the way for you to make those decisions. So they might be inconsistent with each other, even if they agree with each other. And they may agree with each other for different reasons, which might lead to inconsistencies in the data. So you really ought to think about whether correcting errors by taking input from your users is actually going to make the system better, or whether it's going to make it worse.