5.3 How do we measure our progress? - Video Tutorials & Practice Problems
Video duration:
12m
Play a video:
<v ->All right, so how do we measure our progress</v> while we're working with data? Before we start thinking about progressing, one of the things we ought to think about is is the data actually correct? So I think you remember when we talked about big data, that veracity is one of the areas where you, the marketer, really have to come into play. And so what you wanna make sure of is that when you have objective outcome data that the metric you're using really matches your purpose. So is that metric really an indicator that the problem is being solved correctly, that the data's being labeled correctly? And for human opinion, you remember we talked before about inter-coder agreement. Now remember what that is. You don't wanna just take one person's opinion if you're basing your outcome data on human opinion. What you wanna do is to have multiple people labeling the data so that you can see when they agree with each other. So sometimes you might say hey, we'll have three people doing it, and as long as we get two out of three saying something, we're good. Sometimes it's so important to have accurate data that you wanna have even more than three people label it. It depends on how bad it is if your data is wrong. You also need strong task definition. You remember when we talked about coding guidelines? Those are the instructions for your human labelers to decide what the outcome data is, to render their opinion. If you have very good instructions, that will help the humans be consistent with each other. Once you've gotten all of this done, so you have your objective outcomes or you've got your human agreement outcomes, that is called your gold data. Now, why do we call it gold data? Because we're considering that to be the rock solid truth. That is exactly what we'd like the system to do if the system was labeling things with no help. So we're gonna give it a bunch of data that has correct answers, correct labels. And that's our gold data. And the first thing we have to make sure of is that that data is actually correct, that we're not feeding the system incorrect data, which would then cause it to make incorrect predictions. So we use the gold data to train the system, which I bet you figured out by now. The other thing we're gonna do is to use the gold data to test the system. So how are we gonna do that? Well, what we're going to do is we're going to split the gold data. We're not going to use all of the gold data to train the system. We're gonna use some of it to train the system and we're going to have a holdout set. Some of the data we're going to hold out to use as a test. It's similar if you think about how tests work for children in school. In the book, they're gonna have some exercises to do to kind of see if you can test yourself and know the answers but they're not going to give you those same exact questions on the test. They're gonna give you similar questions because otherwise you end up teaching to the test. Well, the same thing happens in AI. If you actually test on the same data that you train on, well, guess what. The AI's probably gonna do pretty well at that because you already told it the answers. But if what you do is you give it some of that gold data to train on, and then you hold some of it back to test, that held out data, that's gonna be the data that tells you how good your model is. Is your model actually working? Because you can see what the AI model tells you for that held out data and then you can check it because you actually know the answers because your human labelers or your objective data already told you what the outcomes are. And so that's how you can test the accuracy of your AI model. Now, you can go even further. You can split the data even more. So you can have a training set, you could have a test set, you can have a validation set. You can have multiple training sets. So what I'm trying to show you here is I'm giving you a very simplistic version of how you can do it but your data scientists may use more sophisticated methods. And it's okay that they're doing that. So there's all sorts of ways to make it even easier to test things. And sometimes it makes it easier to test things even when you don't have a lot of gold data. And that's something that can really be helpful in some situations. So one of the things to focus on when you're doing your testing is a couple of concepts called precision and recall. Now, the reason we need to talk about these things is because accuracy is something that's derived from your model being correct, from not making any mistakes. There's actually two different kinds of mistakes that it can make, and that's why we have these two names. So let's talk about them. So precision is actually how many of the answers that it gave are really correct? So for example, if we go back to the example model that we used previously where we had a classifier that was deciding is this information about big data or not? And so remember, we had those classifiers that was saying hey, is it about big data or is it not about big data? So there's two answers. It's either yes, it's about big data or no, it's not. And so what would precision mean in that situation? Well, precision would mean of all of the documents that the model said was about big data, how many of them actually were? Did it have some predictions that it said a document was about big data when it really wasn't? When your gold data says that it's not about big data. So that's the precision. Recall is actually a different problem. Recall said of all of the documents in the system that really are about big data, how many did it actually get? And so you can see that these things kind of trade off against each other because you could imagine, you could get 100% precision by returning one document and saying it was about big data, and being correct about it. But maybe they were actually 100 documents about big data, and you only return one, and so then the recall would be terrible. It would be only 1%. Now, similarly, you could return every document in the database and say that it's about big data and you'd get 100% recall because all of the big data ones would be there but your precision would be terrible 'cause most of the documents in the database aren't about big data. And so precision and recall work against each other. When you do something to improve one, sometimes that makes it worse for the other. And so often, precision and recall are equally important but sometimes they're not. Sometimes one is more important than the other. And so you have to think about that for your situation. So for example, if you were trying to label all of your documents for a faceted search problem so that when people are typing in search and looking for pages, for example, on a website, you might want recall to be more important because you might want people to be able to click on those facets and not get rid of the right answers. You might wanna over tag for recall. On the other hand, suppose you had an application that is returning alerts. So an alert about a certain story that popped up on a certain subject. Well, in that case, precision might be more important 'cause you wouldn't wanna be unleashing a blizzard of stories that were only kind of about the topic, you'd wanna make sure that almost everyone you sent was really spot on for that topic. So it depends on the problem you're trying to solve, whether you care more about precision or recall. And if you're not sure, then you can kind of make them equal. And so how do we combine precision and recall to kind of come up with a number that say this is what accuracy really is? Well, that's something called an F-measure. So an F-measure takes precision and recall and it combines them into a single metric. Now, you can make them the same weight. So you can say precision and recall are actually the same or you could say I'm gonna weigh one of them more than the other. As we mentioned, it might depend on what it is that you're doing, what the right thing is to do. So you can see in this table of a real client project that we did that the precision and the recall combined together give you an F-measure and you can see that one that's maybe in the 70s is probably pretty good but when it gets lower than that, now maybe you might wanna work on your model a little bit more. And so this particular example had to do with particular industry codes. And so it was very accurate when it only had to get one of the digits right but when it had to get a more detailed code correct, then it got harder. So the way that the system worked is, for example, you had six-digit numbers that were being returned but the first digit was really an area for all of the six-digit numbers that started with that digit. So getting the first digit right was actually really important. Getting two digits right was even better. And obviously, it's really hard to get all six digits right because you had hundreds and hundreds of these codes. And so this is an example of how you might use F-measure to combine precision and recall. Now, remember, you can combine them equally or you can weigh them if precision or recall is more important in your problem. And as we looked at before, you'll use some type of improvement process. So you're gonna bring in the data, you're gonna validate it. That was how we started. Train the model, deploy the model. The model starts making predictions and then you wanna evaluate the predictions. You can use that F-measure to evaluate the predictions, and then you want it to have corrections. So you might correct the model, you might add more data like we've talked about in the past. No matter what kind of system you put together, AI is never set it and forget it. There's always changed that come in in the data. Sometimes the system will just start to perform a little less. It'll start to drift a bit because the type of information that's coming in changes a little bit. And it isn't as reflective of the original training data as it once was. And so that's when you wanna add more data. That's when you might wanna tweak the system and using this kind of production improvement process is what's gonna help you detect that your accuracy is going down and clue you in that it's time to really make some adjustments. Now, how does the process end? Well, if you're continuing to use the system, it kind of never ends but when you think about the ending of the process, maybe the ending is when you decide to put something in production, when you decide to update your model or put it in production for the first time. And so often the way that you think about that is that it's faster than the previous solution. So maybe it's giving you better scale, it's being able to tag or label more of the data faster than it was before. Maybe it's cheaper. Maybe you found a way to have the machine be more optimized so it's not using as many server cycles as it did before. Or the most important one usually is that it's more accurate. You're often trying to solve your problem better and better over time, and this is really how the process ends. So there's an old joke that says your model isn't done until the users are all dead. And that's kind of what we're talking about here. As long as you are continuing to use it to solve the problem, you're always gonna be looking to make it better.