2.8 Filter rows - Video Tutorials & Practice Problems
Video duration:
8m
Play a video:
Video transcript
<v Instructor>As mentioned in a previous sub-lesson,</v> some of the first things we wanna do when connecting into a data source is to ensure that we remove the columns we do not need and filter the rows that are not needed as well. So let's go ahead and take a look at this data source, which we're gonna bring into our spreadsheet and show you how this is gonna work. So we can see that this is a CSV file and it looks like it's a report. It's got some header information at the top, blank row here. And even within the data set itself, there are these blank rows, which we wanna get rid of here. So we're just going to show you those right there. So we've got blank there, there and there. And at the very end, we have some blank rows that we're gonna wanna deal with as well. So unlike hooking into an Excel table, which has some structure to it, this CSV file is going to be a little less structured. We have to do some clean up of the actual data as we're bringing it through into the Power Query Editor. So let's go ahead and get into the Power Query Editor and ingest this file and show you how we're going to take advantage of the features of row filtering to clean this data set up. So I am back over in my Excel spreadsheet now. And so now I wanna make a connection into the file that I was just showing you. So I'm gonna go to the Data tab. I'm going to go to Get Data, let's say From File from Text/CSV. And I wanna connect in this one called Product Finished Goods right here, and it's going to click Import. And we can see like we did before we can see a preview, what the data looks like, the file origin, the delimiter, and then how power creator is doing its data type detection. And we can once again, take a look at this to verify that, yes, this is the file that we did wanna bring in. So previously we had been clicking on Load and Load to, and just getting a connection set up and leaving our file at that. So we're just doing the E side of the ETL. Now we actually wanna start doing some transformations. So we will go ahead and click this Transform Data button, which is gonna take us into the actual power corridor. So we can pick up from the East side of the ETL and start doing some transformations. So we can see the contents of the file that we previously showed you here in the Data Preview pane. And now we wanna go ahead and use the remove rows functionality to clean this file up. So the first thing we're going to do is we know that we wanna get rid of the top three rows by looks of it. And it looks like we probably wanna get rid of the bottom three rows as well. So let's go into Remove Rows and click on this and let's say Remove Top and the dialog box will pop up. And we're gonna say top three, go ahead and click, Okay. And now the top three rows disappear. We can once again, go back to Reduce Rows and click Remove Bottom Rows. So the next one down, and we also want to remove the bottom three there as well. And now the bottom three rows are gone. Next thing we wanna do is we wanna get rid of these blank rows. So we've got blanks in these few rows, going all the way across all the column values. So what we can do is choose, Remove Rows and say Remove Blank Rows. So now the data is further reduced. Next, what we can do is we see the in row number one, it appears that these are the column headers. So let's go ahead and move over the Transform tab over here. And under the table group, we can go ahead and click Use First Rows Headers, and suddenly our data looks like that. So we've done quite a few steps here. So let's go through and review what has happened in our applied steps now. So if we go right to the very top and click on Source, this will show us the source file that we actually connected into. So if I take a look in my Formula bar up here, I can click on this and it looks like it's connecting into the CSV file. Once again, I mentioned earlier, you don't need to know this M code, but it's really nice to look at so that you can understand what is being generated as you go through each one of the steps, in the Power Query Editor. So let's close that back up. All right, so that's step number one. The next thing that happens is that the data types are chosen. Those two steps are performed automatically. Then we got into the steps where we started doing some work. So here's where we've removed the top three rows and the bottom three rows, got rid of the blank rows, promoted to headers and then this step was automatically included or inserted by the Power Query Editor And it went through and did the data type changes once again. So our data set is now looking fairly clean. So let's go ahead and do some further cleansing lists. Now, if we take a closer look of the data set, we're gonna see that we have some duplication in our dataset. So if we take a look at rows four, five, and six here, that we can see the same product number of pizza itself across these three rows. Now, if we take a look across all the column values, in fact, these three rows aren't duplicates because they do have different values across the columns. So I can see in the color column, it was very first one, row four is black and then red and then blue. So if we take a look at deduplication across all the column values, four, five, and six are actually not duplicates. But if we only take a look at the product number itself, these are declared as duplicates right below it in rows seven and eight, we can see that there's a product number that repeats itself. And if we looked across all the columns and I'm not gonna scroll all the way to the right-hand side, these values are actually duplicates across all the column values. So we have a couple of different ways of dealing with these duplicates, and it's very important to understand the differences between the two. So I'm gonna go back to my Home tab and I'm gonna go back over to Remove Rows and keep in mind here now that I'm highlighted on the product number call. So I'm gonna run my deduplication process on the Product Number column only. So I'm going to go to remove rows and say, Remove Duplicates. And what you're gonna notice here is we're gonna go down from 16 rows to 13, and we'll explain in a moment. So we can see a brand new step was added into applied steps called Remove Duplicates. And we can see now that we only have one instance of this row right here. So the 2908 and one instance 2036, and that deduplication process works fine, if we want to do or deduplication on this particular column here only. But to me, that is a little problematic given this data set, because as shown earlier the 2908 rows, those three aren't actually duplicates, we consider their values across all the columns. So what I'm gonna do here is I'm gonna get rid of the supplied step. So that is one of the really nice things about the applied step is I can actually get rid of them as I build them up. So I'm just going like that. And now my dataset is back to the 16 ropes. So it's worthwhile highlighting at this point here that in power query, you're just taking a copy of the data from your source. You're not actually modifying the source information. So as we go back across each one of these applied steps like we did a moment ago where we just click through the individual steps, this is a copy of the data and showing us what each one of the steps look like after it's performed and what the data set then looks like. So what we're gonna do now is we wanna do the deduplication across all the column values. So I'm gonna click on Product Number, go Control + A to select the entire table. I'm gonna go over to Remove Rows and now choose, Remove Duplicates again. So now I should be knocked down to 15 rows because these 2908 rows, they are declared to be different because their column values are not the same across all three of those rows. So they stay and only this 2036 row remains. Okay, so I said that's a really important thing to distinguish in terms of how the deduplication process is running. And you certainly don't want that one to catch you off guard. So it's worthwhile highlighting the differences there. And that brings us to the end of this sub lesson.