2.13 Transform and cleanse data - Video Tutorials & Practice Problems
Video transcript
<v Instructor>So far, we have talked a lot</v> about connecting into data sources, and bringing them into the power query editor. And we've done a little bit of work with merging data sets and appending data sets. But what about actually doing some cleansing of the data? In this sub lesson, we will take a look at just that. What we'll take a look at first here is our product list. So if we take a look at our product list query, like I have here, and I click on the product name, I can see a bunch of dirty data. So it looks like I have some spaces in the front of the product name here, I likely have some spaces on the very back as well, looks like this one column here called cable lock, actually maybe has some type of carriage return line feeding it, 'cause it's split it off into two lines here. So we may wanna look at cleaning that. And we look at some of the standardization issues between some of these column values as well. If we take a look at column number two, this looks like it is, oh sorry, row number two, it looks like it's all uppercase. And this one here it looks like it's all lowercase. And down here looks like we have some that are initial case and then lowercase after that. So let's go and look at how to actually clean that data. So what I'm gonna do here is my product name is highlighted, what I want to do is go to my transform tab. And if I go over to the text column group right here, and I choose format, I have a number of different things I can do to clean this data up. So what I'm gonna do here first is I wanna actually go through and clean all the row values in here first, and what the clean function is gonna do is get rid of any non-printable characters that are showing up in those values. So as I pointed out a moment ago with the column number, sorry the row number five there that has the cable lock in it looks like it must have some type of carriage return line feeding it. So if I go clean, and go like that, we're gonna see that those characters got removed from there. So our cable lock is all on one line. Okay, so that's one clean that we can do. Now, how about we look to get rid of this wide space that is in front of these strings, and any potential wide space that may be on the back. So to do that, let's also go to the format function here. And we'll click on that. And we're gonna say trim. And what trim is gonna do is get rid of the wide space at the front and very back of this string value. And lastly, to standardize this column here, we can stand product name, we can go to format. And we can say capitalize each word so that we now have a standard look and feel across all the rows in this particular column. We're doing a lot of this here just to make that column value look a little bit more professional when somebody wants to go and use that for input somewhere down the line. Or if you're gonna use this in analytics, it has a nice look and feel when it comes to you reporting later. So that is one type of clean there that we wanna do. Something that I should point out is keep in mind here that we applied the steps there to clean up this particular column in the product list query, we could have done it in the products query, or the product finished goods query. The only challenge with that is if we did it there, we would have to run all three of those cleaning steps in those queries individually. By putting it into the appended product list query, we're able to do those transformations once. So what we're gonna look at next here is if I go back to my product list, and just scroll over and find our product category and product subcategories, we are going to find out that there are some row values here that do not have a value for subcategory name and category name. And those all happen to come from the query called product finished goods. Because if you keep in mind here, product finished goods did not have a column called subcategory name, or category name. So what ended up happening when those values got appended together in the product list is they ended up with a null value. So what we're gonna do here is we're gonna go and clean that up by finding our subcategory name and category name columns. And I'm going to click on these both together here. And this time, I'm gonna right click, and I'm gonna go in here and choose replace values. And what I'm gonna do here now is the value that I wanna find is the null, and I wanna replace this with a value, I'm going to say other, like so. So that works really nice if I want to have the value for null across both of these replaced with the exact same value. So that works really well, like I mentioned, if you wanna have them replaced with the same value, but what if you want each one of these have a different value? What we would need to do is actually go into each of these columns individually to perform that. So let's go back and get rid of the supply step. And part of this reason I took you down this path is I wanna show you to have the ability to get rid of previous steps. So let's go ahead and get rid of that replacement there, where I took the null across each one of these two columns and replaced it with the word other. And let's go and do these individually. So I'm gonna right click, I'm gonna come down here and say replace values. And I'm gonna say null, because we have the null value, which we're trying to replace. And what I'm gonna say here is other subcategory this time. So go ahead and get that typed out, like so. And just to save myself a bit of time I'm gonna copy that there. So I'm gonna say, Okay, and what we'll notice here, now, if I click on this right here, we will scroll down here, we're gonna find a new category called other subcategory. I go over and do the exact same thing right here. So I'm just gonna go click on that say replace values, I'm gonna choose the null and just my time saver here, I'm gonna do that. And I'm gonna say other category, like so go ahead and click, Okay. And now any of the column values that were null are gonna come up with this new label of other category. Typically, the reason that we go ahead and replace null values with something more meaningful is that I guess the opposite of what I was just saying is the null value is a confusing value for people to look at, you know, it's a little odd to look at, at the reporting layer. And if we replace it with another value, it is a little bit easier to go ahead and consume at the reporting layer and lets the users know that it falls into some other type of category. And this is really just all part of cleansing your data sets, so they'll look clean and consistent to your potential consumers or even other producers of information. Okay, so that is some cleaning that we did on those two columns. I'm just gonna go ahead and cancel there. So the subcategory name and category name have been taken care of. So now let's go over to our sales tables, so I'm gonna move over to the sales query here. And what we're gonna notice here first is we have this column that has two different values in it looks like it has an order status and a type all embedded into one column. But what if we want to separate order status into its own column, and the type of order into its own column. So take this and go ahead and separate these apart. To do that, we are going to go in and split this column apart. So what I'll do here is I'm going to highlight on order status and type, I go to the transform tab. And over here to the text column group, I can see this function here called split call. And what I'm gonna do here is I'm going to go in and say split column by delimiter. And the delimiter that we are going to split the column on is the comma. So we went and chose the space. Let's go ahead and choose the comma as the split. And we are going to do it at the leftmost. If we left it at each occurrence of the delimiter work 'cause there's only one occurrence in each one of the columns, and go like so, click, Okay. We will notice now that we have a brand new column called Order Status and Type.1, which has the value shipped in it. And if we go over here, we're gonna see a brand new column. Oh, sorry, my apologies, it was right beside it. So we can see order status and type two right next to it here. So they were split into their own columns. So what I'm gonna do here now is I'm just gonna actually go through and rename these. So the engine wasn't smart enough to give them proper names, it just called them the original name dot one and dot two. So I'm just gonna go ahead and double click on there, and let's just call this one order status. And this one right here, what we're gonna do is we're just gonna leave this one and call it order types, so I'm just gonna leave type, like this, and Enter, and now I have that split off into two columns on their own. Now, we do wanna be aware of that, if we remember what the data looked like before we did this split. So I'm just gonna go back over here and click on the expanded product list to go one above where we did the split. We're gonna see that it had a comma and then a space and then online order. So what has happened is given that we split on that comma there, there's actually a wide space in front of online order. So I'm gonna go ahead and get rid of that. So I'm gonna go back down to this step right here, where we renamed the columns. And likely done a moment ago, I'm just gonna right click on this. And I'm gonna go down to transform. So I'm gonna get at this trimming from a different location. So as opposed to up on the transform tab, I'm gonna go transform, and I'm gonna say trim. And that is going to get rid of the wide space in front of that particular row value across that column there. So the trim will actually get rid of the wide spaces mentioned earlier at the front and the back of that particular string. So we have seen a few different ways to transform and cleanse your data. We went through and cleaned up a particular column, we did some replacing of values across a column, then we took a look at splitting data. And those are some major things that we typically do when it comes to transforming and cleaning data. So that brings an end to this sub lesson.