5.1 How does the data process work? - Video Tutorials & Practice Problems
Video duration:
17m
Play a video:
<v ->All right, so if you have big data,</v> and you kind of need it if you're gonna do AI, but if you have it, it's likely that it's in silos. So what do we mean by silos? So data silos basically are about how data might be in different places. And even if you wanted to have the data work together, it might not be that easy to do. The reason this has happened is because we have taken a function-centric approach to technology over the years, rather than a data-centric approach. So what do I mean by that? Well, you might have a CRM system, you might have a marketing automation system, you might have a CMS. Those are all systems where marketing and sales store data. But is that really the data-centric way to do it? No, it's not. If you had a data-centric way of doing it, you would might still have these pieces of software, but they wouldn't be storing the data inside their proprietary databases, forcing you to have some kind of application programming interface, an API, to be able to get the data out. Because you might want to use the CRM data, the CMS data, and the marketing automation data all together. But what we've done is we've actually created these operational systems that basically own the data. And that's what creates silos, because you might have data that you want to use across these systems, but it's not that easy to do. And so because we've taken this function-centric approach, each piece of software has its own database, and what you probably need instead, to use AI, is something a little different than that. And so let's talk about how we go from where we are now to where we might need to be for AI. So the reason that this is important, just to back up a second, is because the only way to understand this type of data, and mostly we're talking about structured data here, if you remember the difference between structured and unstructured data, mostly the reason we're talking about this is because in order to understand the data, you need to understand what's called its "schema." Now, what a schema is, is it's basically a map that says, "This field has this data in it, and this is what it means." And so if you have a spreadsheet, the rows and the columns kind of tell you what the meaning is, but databases have to have something a little bit more complex because you can't intrinsically look at the database and know what it is. You need to have some kind of documentation. And so whether in the spreadsheet or whether it's outside of the database, the schema is basically the roadmap to understanding the data. What are these fields? What do they mean? What can you use them for? And so going back to our view of these operational systems, the problem isn't just that they own the data and you have to kind of beg them to let it out, it's also that they each have their own schemas. And so you might have a customer number that's in your financial system, but how does it get into your CRM system? Maybe your CRM system wants the number of customers all its own. You might have an email that links to a webpage, but the email is in your marketing automation system and the web page is in your CMS. So if you really wanted to share some content between those two, you'd end up copying the data from one to the other. That's not really the ideal way to do it. They each have their own schema, so they might define things a little bit different. If we go back to the customer number example, your CRM system might have a customer number that's not even in the same format as the customer number in your financial system that's doing the invoicing. And so this can make data very hard to share and very hard for you to cross the silos to be able to get value out of it. So those different schemers actually might call fields the exact same name, like customer number, but they might have different meanings, or it might be the things that are the same have different names, or that they're sourced from different versions of the truth. So it could be that there's different types of data that you can't really combine that easily because of their different definitions and their different sources. So obviously that's a problem. And technical people are very good at solving problems, and so they decided they're gonna come up with something called a "data warehouse." Now, what's that, you might ask? Well, a data warehouse is basically a way of taking all of the data that is in your operational systems and copying it into first a staging area. And what that staging area does is it tries to unite all of the schemas. So it tries to change the formatting of data from one operational system. It tries to update the way data looks. So it has all sorts of different things to transform the data and to have one schema basically that rules them all. So one schema that's going to understand all of the data from all your different operational systems. And so it gets into the data warehouse, and the data is now unified. So it's consistent. It's up-to-date because you have processes that are constantly copying it out of these operational system. It's organized, that's what the one schema is. It's secure, so you have all this data, but because it's one big data warehouse, you can have access controls that are very well understood. It's accessible because all the data's in one place. And it's very powerful because you can now pull data together from different operational silos. And so the benefits of a data warehouse are really undeniable. And if you're lucky, you have one at your company. And that's where you're gonna get your data from to solve your AI problem. But even though the benefits are undeniable, for some companies, the benefits are unattainable. And that's because you can imagine it's extremely costly to do something like this. It can take years to pull together all of these operational systems. And it can be very hard to use the data because when you have all of that data in one place, it's very complex what that schema must be in order to explain all of it. And so you might have so much data in one spot that you have way more data than you really need to solve your problem. And so you might have needed all the CRM data and maybe a couple of more fields from your marketing automation system, and instead, you've got all the data from 15 different systems all lumped together under this schema. And because it's so hard to put this schema together, it's very hard to add new kinds of data. Once you've created the schema and pulled in all the operational systems, if you decide a year from now you want to bring another operational system in, you might have to do a real surgery on your schema in order to make that work. So it can be very, very painful. So it takes years to do, and it might be really hard to change. Now, because they had those problems, they decided they would fix first the problem of making the data easier to use. So they came up with the idea of a data mart. So what's a data mart? Well, what that does is it takes subsets of the data in the data warehouse, but the subsets aren't the same as those operational silos. So you could set up a data mart that had in it all of the data from the CRM system that you needed and the few fields from the marketing automation system that you needed, and you could stick that into a data mart, and that would actually make the data a lot easier to use. Now, data lakes are a different idea. And so data lakes actually avoid the idea of a data warehouse entirely. And so the point here is because data warehouses take years to build and they're so difficult, why don't you just skip that? So a data lake says that you don't need any schema. You can just pull all the data in exactly as it is. And then what you're gonna do is you're gonna apply the schemas before you stick things in the data mart. And so this makes each of those schemas a little bit easier, and there's no big project where you have to unite all the data under one schema. So some companies use this data lake approach. And maybe your company that you've chosen for your AI problem is using a data warehouse or a data lake, and maybe it has data marts that you can use. And so hopefully one of those things is true, and you'll be able to get access to the data that you need. The problem here though, is what you've done is just set up another set of silos. Now the data marts have become the silos. And if you have a problem in your AI problem that actually crosses data marts, now you've got the same problem that you had before when the data crossed operational systems. And so now you have to understand all of the schemas in order to solve the problem, and you have to know which data mart each one is in. And so this actually doesn't seem a whole lot easier than the data warehouse was, depending on what your problem is. And so some people had the great idea, they said, "Okay, well, we'll just make one of them the data warehouse." So one of those data transforms, we'll just make that one with a really big schema. And it's like, "Well, okay, you can do that, but then that's just as hard as it would have been to have created the data warehouse in the first place." So it doesn't actually solve any problems. It just changes the way the picture looks. And so I don't know what situation you're in with the data that you need to solve your problem, but this is definitely something you want to pay attention to. You want to understand where the data is, what the right places to pull it from, understand the schemas around those data so you know what the data means, and you also have to be confident that the data is being updated from those operational systems quickly enough, that when your models run, you're solving a problem that you can still take action on. So all of those ways of dealing with data are a little bit difficult, but there is another way to do it. And that is to use a taxonomy. And so this is different from developing a unified schema of whatever data you decided to put in your data warehouse. This is actually using some kind of generic taxonomy that you can plug your data into. And the reason that this is helpful is because it's so generic that no matter what kind of data you want to pull in later, it can pull in that data fairly easily because the taxonomy is generic enough to be able to handle almost anything. And you can see what I mean by how generic it is. At the top level, a lot of these taxonomies have very simple concepts like people, places, and things. It's kind of hard to imagine data that you wouldn't be able to map into a taxonomy that's that generic. And so even though taxonomies do require you to come up with schemas like a data warehouse, the schemas might be a little bit simpler because you're kind of devolving what your database is into very simple concepts like these top-level ideas of people, places, and things. And as you add more data in the future, you don't have to redo the schema or redo the taxonomy. You can just keep it the way it is and just add more detail into it. And so this can be a different approach that can help a great deal. So what are the steps of actually pulling your data together? So the first step is to digitize the data. And you might say, "Well, you know, come on. All data must be digital by now." No, it's really not. I don't know if you remember the example we gave of the committee that was approving new webpages. Their data wasn't digitized. They had all sorts of things that they had done, but most of the data was still in their heads. But there's also lots of data that's still on paper. And so if your data isn't digital, the first thing you have to do is to digitize that data so that you have data that can be used by a computer at all. Now, there's problems in that digitize step. So once you've digitized, usually the way you've done that is by creating all those silos with those operational systems. And so you have those siloed databases, and that causes the ownership of the data to be fragmented. And that's where you get those inconsistent data definitions and, even worse, inconsistent data quality 'cause some people are better at keeping their data clean than others. Because of the silos, it's hard to access the data, and it can even be hard to secure the data. And the reason is because you might have to have people that are allowed to see data from multiple silos, and it can be very difficult to limit them to just the data they're allowed to see within each silo. Second step is to aggregate. So the aggregation step, you could be pulling things into a data lake or pulling things into a data warehouse, just as we showed earlier. And so the first step is to make sure the data is digital. The second step is to make sure that you can pull the data out of those silos into some form. Data lake, data warehouse, taxonomy. It doesn't matter what it is in terms of the step. Pick some approach and be able to aggregate that data so you can bring data together across multiple silos. And the last is to organize. And this is where that taxonomy can really come into play because you can actually organize data under a taxonomy, even if it isn't all in your data lake or your data warehouse, because the taxonomy can still understand what the schema is, even for the operational systems. It basically knows where the bodies are buried. So it really understands where the data comes from, and it can put it into a format, into this taxonomy, that allows you to make connections across the data, even if it's a little bit difficult to access because it's still in a silo. So last thing we want to look at in this section is to look at what's called the CRISP-DM process. CRISP-DM stands for cross-industry standard process for data mining. That's a mouthful. And I think you can tell that the CRISP-DM process wasn't branded by a marketer. So this is something that data scientists come up with. And what they did was they tried to create an industry standard process to support all data science applications. And they did a pretty good job. If you look at this diagram, it starts with having a business understanding of the problem. You can imagine that's where you come in. Then to understand the data that's required. So then start doing the data preparation. And part of that involves the feature analysis that we talked about before. And then your data scientist actually starts modeling the data, you evaluate the models, and when eventually you have something that works, you can deploy it. But before it works, you might have to go back early in the process and maybe get a better understanding of the business or the data or prepare the data differently. So you can go round and round in that circle until you eventually get to the point that you think, "Hey, it evaluated well. We've solved the problem. We can deploy it." And this is a good process, but it actually is a little bit more complicated for a lot of problems. And so what I want to do is to talk you through what of a simplification that might take advantage of things you've already done. So you do start still with framing the business problem and identifying the needed data, but the first thing you need to do here is to identify if that data already exists, and if it doesn't, then collect and prepare the data. And so you might not have to prepare the data. You might be able to skip that preparation step if you've got the data already and it's in a pretty good form. Next, what you want to do is instead of just immediately going out and starting to do modeling, maybe start to determine if a model already exists, that if it doesn't quite solve the problem, maybe it comes close. Maybe it's a good starting point. Now, if you don't have any starting point, then yes, you are gonna have to go ahead and model the data, but at each step of the process, what I'm asking you to do is to find out whether there might be something that exists already. If your problem is at a large company, it's very likely that somebody has done some work on it sometime before. And you want to look and see exactly what's gone on in the past so that you can take advantage of it. So then you would go ahead and use the model to assess the situation and hypothesize of what might be the problem if your evaluation doesn't go well. Then you would go back, experiment with new approaches, and then use the model again, eventually deploying. So it's based on the original CRISP-DM model, but what I want you to do in the steps is to really think about whether any of those steps have already been performed. Did they happen already? And if they did, that can save you a lot of time in solving your problem.