2: Importing and transforming data using Power Query Editor
2.13 Transform and cleanse data
2: Importing and transforming data using Power Query Editor
2.13 Transform and cleanse data - Video Tutorials & Practice Problems
Video duration:
10m
Play a video:
<v Instructor>In this sub lesson we're going to use</v> various techniques to go through and transform and cleanse our data. In particular, we're gonna take a look at using replace values, use the split columns functionality, and we're gonna work on formatting our columns. So let's go ahead and get our Power BI Desktop file open that supports this sublesson. Okay, once again, at the GitHub repository, I am going to open up sublesson 13 file here and give it a moment. Okay, so the desktop file open, we are gonna go into the Parquet editor, like so. And what we're gonna do first is, we are going to go to our product query. And what we're gonna do in this instance here is we're gonna first take a look at the product description column. We can see that a number of the rows here in the product description column have a null value. And we'd like to go ahead and clean that up by replacing it with the word, no description. So, the way that we're gonna do that is we're going to right click on the product description and what we'll do is to scroll down a little bit and we can choose this option called replace values which is gonna give us a dialog box where we can go ahead and type in what we wanna search for and what we wanna replace. Okay, so I've gone into the value to find, I've typed in the word null in all lowercase, keeping in mind that the M language is a case sensitive language. So I've typed in null in lowercase and what I wanna replace it with is the phrase no description. And that's gonna make it a little bit more clear at the reporting level, whether there is actually a description or not, as opposed to leaving people wondering with a blank or a null, or actually is gonna call it out and say, this particular product has no description. So I'm gonna go ahead and click OK and what we're gonna see here is in the column here now, any of the values that had null, now say no description. And if I wanna take a look at the other values that are in this column, what I can do is click this drop down box and take a look at all the other description that are in here. Okay, so that is cleansing the very first column we wanna take a look at in the product query. So the next columns we wanna look to cleanse are the ones that we did back over in the merge queries. And our product subcategory and product category we will actually wanna get rid of these null values as well and replace it the word internal product. So, in this particular dataset, any of the rows have a null value, are internal products that are used to build things that we ultimately sell as AdventureWorks. So if we were to kind of scroll over to the right-hand side and take a look at the products that have a null value these are all the parts that we're gonna use to actually build bikes, that we actually gonna go ahead and sell. So if I just scroll back over the right hand side the two columns that I actually wanna change values for are the product subcategory and product category. So the nice thing is I can actually go and replace these values both at the same time by clicking the control key and highlighting both columns. So now what I can do is I can actually right click and say replace values and I can go ahead and do the exact same thing. So I'm just gonna go ahead and type those into the field. And I can see that what I wanna do here is search for, the value to find it being null all in lowercase once again then replace that with internal product. Once again, and click OK. And we'll notice across both of these here now internal product has been replaced wherever the word null showed up. So now I can go into the product subcategories and take a look at all of them and everything has a value, I can do the same thing with product category, I can see that I've got bikes, clothing, components, accessories and of course the internal product. So I've gone through here now and I've cleaned up the values in my actual product query. The next thing that I'm gonna do is I am going to go over to my sales territory query and do a little bit of cleansing here as well. So we can see that I've clicked on sales territories keeping in mind, one thing that I did earlier on was I moved the sales territory, which is in the first rows up into the header. So if I just rewind on these applied steps this is what things look like at the end of the extract process of bringing things in. Then I went into the promote headers and then I went through and did a change data type even though this change data type is doing nothing this is (indistinct). So I'm actually just gonna go ahead and get rid of that. Cause it doesn't really add any value cause this actually did have the correct data type on top of things. Okay, so what I really came to this particular column is going to do, was to split these values apart. So I can see a pattern inside of this column right here. I can see I've got what appears to be continents on the far right, countries going down the middle and then regions over here. So what I gonna to do is go to the home tab, and under the transformed group of columns I can see there's a transformation commander called co or split call. So I'm gonna click on that. I'm gonna say by delimiter and power query is pretty smart. It's gonna go through and say it's a custom delimiter. It figures out the pipe symbol which is above the entry key and most keyboards. And it's gonna do the split it each one of the occurrences, the delimiter. So I'm gonna click OK. And now my columns have all been split apart. Okay. So that's the first thing that we wanna do here. The next thing we wanna do here is actually each one of these columns. If we were to go back a step here has some or a couple steps I should say, it has some spaces in between these delimiters here's. What I wanna do is in my last step here I'm going to highlight all the columns, so like this. What I'm gonna do here is go into my transform group this time. And under the text columns, there's a format command here and I'm gonna say trim. And what that's gonna do is get rid of any white space on the left and right hand side of the character strings to chop that out. Which is a very common thing in Excel files and text files. The difference here is it was actually part of the delimiter piece. So we wanted to go ahead and get rid of it. The next thing we're gonna do here is actually go through each one these columns and give them a proper name. Because sales territory.one, two and three are meaningless. So I'm gonna go through and rename these to sales territory region, country, and continent. So I'm gonna go ahead and do that right now. Okay, so I can see now in my query that all of my columns have been successfully renamed to sales territory region, country, and content. So no they're a little bit more meaningful. Okay. So what I gotta do here next now is we're gonna go back to our product query to do some cleansing. And we can see here in our product name we got a bunch of stuff that looks a little odd. So we've got some spaces at the front of some of these product names. We can see that there's a mixture of uppercase and lowercase, and it looks like we've got some of these are being split onto another additional lines here. So we've got some special characters that are embedded into this particular column here. So what we're gonna do here is just to show you what is happening with this transformations. I'm gonna do two things here first. So what I'm gonna do here is actually click on product name and say, add column, and I'm going to go to the, I'm gonna go to the extract command here, and I'm gonna say length. And what that's gonna do is actually give me the length of that column at that point in time. And what I'm gonna do here now is I'm just gonna move it over to the left-hand side and get it close to product name. Just so I can show you the differences after all of these things have been completed. So I'm gonna call this length before. So it's called length I'm just gonna say length before something like this. And that says, here's the link before we actually did all of our transformations. Okay. So that's the first thing we wanna do. Now, I'm gonna actually go through and perform the transformations I wanna do. So I'm gonna click on the product name. Let's go over to our transform tab and go into format. And what I wanna do here first is I'm gonna go and run a clean function, which is gonna get rid of any special characters that are embedded in my particular rows here. So I go clean, and we're gonna notice it. Some of these had the carriage returns in line feeds that were in the actual columns removed away, out of those particular rows, okay. So that's the first thing we're doing. The next thing we gonna do, is go to the format command. And we're gonna say trim once again, so that all the white space at the very beginning and end of the column values for each one of these rows is removed. The last thing we gonna do here is go to the format and we're gonna say capitalize each word, so that we go through and standardize the product name to be a little bit more clean and professional looking. Okay. So we've done quite a bit of work on that particular column. Now, the last thing I wanna show you is actually showing you the length of this column after all this work had been completed. So I'm gonna do here now is I'm going to go to the add column tab again. And once again, I'm gonna click extract, and I'm gonna say length just to show you what this looks like after all this work has been done. Now, I'm gonna move that length column over towards the length before, and I'm gonna go here and I'm gonna actually call this length after just so we can see. Now this is something I would never normally do in a production environment, but I really wanna show you the differences of what happened to this column before the transformations and after the transformations. So we can see that these columns, for example the adjustable race column originally was a length of 25 cause there's a bunch of white space in here. After I went through and did all the cleansing, it's now of length 50. And we can go through and take a look a bunch of these row values to see that there are in fact some differences between what the data looked like before from a length perspective and what the data looked like afterwards. So, once we're done with that, we don't need these columns anymore. So what I gonna do is click on length before and length after, I'm just gonna go ahead and remove these columns we don't need them anymore. There's much better ways of doing that. In fact, I wouldn't have put them in there in the first place, other than to demonstrate what was happening with that particular technique. Okay. So we've actually gone through and we've managed to clean up several of our queries. We went through and cleaned up the product query, we did some cleansing of the sales territory query and then we went back on the product query and did a little additional cleansing as well. So our data sets are now just about complete. And that brings us to the end of this particular sublesson on cleansing data.