4: Utilizing Best Practices for Preparing and Analyzing Data
4.3 Implement and maintain data integrity
4: Utilizing Best Practices for Preparing and Analyzing Data
4.3 Implement and maintain data integrity - Video Tutorials & Practice Problems
Video duration:
17m
Play a video:
Video transcript
<v Instructor>Data integrity refers to the consistency</v> accuracy, and completeness of a data set. It is a critical requirement for any data analysis solution where we need to generate reliable and actionable insights by processing raw data. Without this key aspect the resulting insights could be incomplete or worse misleading. In the context of this course, data completeness implies having access to all data required for analysis. Accuracy refers to a key requirement for having correct and validated data which may be entered manually, sourced from another system or processed and reported from our solutions. And finally, the third attribute for data integrity requires that all data should be created accessed, processed, and presented in a consistent manner. As data analysts we can apply sounded integrity practices to uncover numerous benefits for our solutions. Such practices can help us to be more efficient as we build sustainable solutions that can consistently generate accurate results in a timely manner. Some benefits of implementing and maintaining data integrity within our solutions include identifying and removing duplicate data which allows us to maintain referential integrity within our datasets, consistent business rule and logic implementation, which results in reliable lookups and calculations within our solutions. And the primary benefits for end users include timely and accurate reporting. Modern database systems provide a variety of options for implementing and maintaining data integrity. Obviously Excel is not meant to be a database and is thus lacking in such capabilities. It is thus imperative for us to implement and maintain a certain level of data integrity within our XL based solutions that either bring in data from other systems or use manually enter data with an Excel. Let's first look at some challenges related to implementing and maintaining data integrity within Excel based solutions. Later in the sub lesson we will discuss some options to get around these challenges. Now for most data analysis solutions we import data from other sources, typically databases and then try to collectively process this external data in addition to any manually entered data we might have set up within Excel. And as we do this quite often we will need to relate some of these data sets for look-ups, validation, upending merging, and so on. Although more recent versions of Excel provide data modeling capabilities which are outside the scope of this fundamental scores and generally lacks basic features for relational datasets like the ability for assigning primary and foreign keys or other built-in capabilities for defining and maintaining relationships within our data sets. Moreover, the most common use for Excel is data entry or creation. As such it comes pre-packaged with certain built-in features to facilitate such activities. One such feature is auto formatting and conversion of data when importing or entering it into Excel and in some cases when working with data within Excel formulas. For example, default text format for dates with an Excel is based on the windows system settings. Since this can change from one machine to another importing date data from CSV files may sometimes have undesirable results. In other cases textual data without letters maybe interpreted as numeric which causes issues for fields like order numbers where we might have some leading zeros. And since Excel stores dates as numbers, some numeric fields may be interpreted as dates within Excel. Finally, Boolean values or results within Excel formulas will be automatically converted to a zero or a one when used with mathematical operators. Apart from the data import and processing related issues we just looked at, Excel, also lacks comprehensive capabilities around access provisioning, data quality controls, audit trails, version control and structured workflows, including approvals. Since we typically import data into Excel and then process it to generate the desired output, we also run the risk off misconfigured inputs and unintentional changes to formulas used for data processing. Additionally, unless we store our Excel files on platforms with built-in version control features like OneDrive or SharePoint, we also need to plan for the risk of a potential XL file corruption. Now that we have reviewed some common challenges with data integrity within Excel, let's look at some ways to address some of these challenges. In such context, Excel provides several features and capabilities that can help us to implement certain data integrity controls for our solutions. To start with, there are a few categories for Excel functions that can assist with avoiding, identifying and handling issues related to data integrity. These functions can be reviewed under the following categories, within the functions library on the forum last tab of the ribbon bar, logical, look up and reference and information under the more functions group. In addition to these functions, we can make use of data validation and conditional formatting capabilities within Excel to further support data integrity related measures. Within Excel, data validation is used to restrict the type of data or the values that users can enter in a cell. One of the most common data validation uses is to create a dropdown list also known as a data validation list. Conditional formatting on the other hand is used to selectively change the appearance of cells or ranges based on certain conditions that we provide. Now let's review an example that uses some of these capabilities. So for this example, we have a student list with their ID, first name and last name and the student master table on the top left here. And then we have another student data table which has other data for individual students. To ensure completeness, we would want to check if the IDs available within the student data table are also available within the student master table. We'll probably wanna highlight the records that are not available in the master. And then we'll also pull in some data from the master table into the student data table to generate the full name for each student. In order to create the link between these two tables, we'll try to retrieve the index of each student record in the student data table from the student master table. To do that, we'll use the XMATCH function and the value we wanna look up as the student ID for this row. And we wanna look it up within the ID column of the student master table. Since we're looking for an exact match we'll use zero as the third argument and then press Enter. As you can see for each row in the student data table, we have now pulled in the correct index from the student master table. In case one of the student IDs was not available within the master data table, so let's say ST-06 the formula we just authored returns an error. Finally, we should handle this error and return a friendly message. So in this case, we'll do use the IFERROR function. And we'll check the value returned by the XMATCH function. And if that is an error let's say we'll just return the empty string. As you can see the record that previously returned an error, now displays an empty string. To take this a step further, we may want to highlight the cells that are turned and empty string for the match index. To do that, we can select the entire column and then choose one of the conditional formatting options under the styles group of the home tab on the ribbon bar. Here we have options to highlight cells based on their values or how their values compare to other values within a selected range. In our case since we just wanna compare the value to an empty string we can look under the highlight cell rules group and then look for the equal to option Here we can provide the value we're comparing to which is the empty string and then select an appropriate format for the cells we wanna highlight and then press OK. As you can see now, the cell with the empty match index is highlighted with the format that we just provided. Next, we'll try to pull some data from the student master table into our student data table on the right To do that, we'll use the index we just calculated so that we don't have to handle any header scenarios again. So we'll start the formula by checking for the index. And if it is empty, we just wanna return an empty string for the full name as well. If it is not empty however, we wanna concat the first name of the student. Which is the value within the first name column of the suit and master data. Add this index. And then we add a space between the first name and the last name. And then repeat the same formula for the last name. Hold and close the brackets, and then press Enter. As you can see, we've successfully pulled in the full name for each student with the exception of the student that wasn't available in the student master. Now let's build on this example and look at another table student grades Here, we want to enter the grades for individual students for the different subjects they might've enrolled for. In order to ensure accuracy for the data entry, we would want to limit the IDs in this first column here to only the values available within the ID column of the student master table. Similarly, we wanna limit the subject field to only the values available within the code column of the subject table to the left here. To enforce such constraint on data entry, we can use the data validation feature of Excel. This feature allows us to limit the content of a specific cell to either a list of values or to a specific format. To use this feature, we go under the data tab of the ribbon bar and then click on the data validation button. We'll start with a setting stab here and look at the options available for the allow setting. Here we can select the data validation type that we wanna enforce for this cell. The different options are, a whole number, a decimal, a list of values which is what we're gonna use for this example, dates and times, text with a specific length or any other custom formula that results in a Boolean coup or false value. Now, since we wanna provide a list of values for the cell we'll use the list option from this dropdown. For the source setting, we can simply click on the entry box here and then select the values that we wanna include in the list. Next we'll press OK and then validate if we get the list of values in a dropdown. Now, before we implement a similar dropdown for the subject column we will copy the standard format that we have implemented for dropdowns in this model. As we discussed earlier in this lesson such formats would include the cell formatting options and the standard input and output messages for our dropdowns in this model. These formats would typically be stored on the formats sheet, within the same model. Now let's select the dropdown template here, copy, go back to our data integrity sheet and then paste it in the subject column. Now, while it's selecting the same cell if we click on our data validation button we can see that it has the standard input and editor messaging already included. Going back to the settings tab, we notice that we should probably change our source for this data validation to something more relevant for this cell, which in our case is the subject codes. And then we'll click OK again, and notice that Excel is complaining about the value in this cell. If you open the list, we'll see that the values expected in the cell are now the subject codes. However, we still have the original value of one entered in the cell. As we clear this out, it will also remove the warning message from Excel. Now, if required, we can implement another data validation for the grade column in this table to ensure that only decimal values are allowed. However, we'll skip that for this example. Please note that the source reference for our dropdown lists can also accept a formula that returns a list of values. For example, we can use the indirect function with a structured reference for an Excel table, or we can use a defined name that uses the same structured reference or maybe a dynamic array formula to return the list of values. Using such options, we can reference a dynamic list of values that can change without impacting the dropdown list. We will review some examples of such approach later in this lesson. In addition to the use of Excel capabilities we just discussed, there are other measures that can be implemented to improve data integrity within our models. For example, we can list some input data and other critical processed information on the same worksheet as the generated outputs from our model. This allows the end users to review such inputs in context of the processed results and potentially identify any discrepancies. We can also implement worksheet and workbook protection to prevent unintentional edits to formulas and formatting. We will be reviewing this topic in further detail later in the course. And finally, if we're working outside of solutions that include version control capabilities like OneDrive and SharePoint, then we should be making periodic backups of Excel files to support potential data recovery activities. So with that, we have now covered the critical concept of data integrity and context of Excel-based data analysis solutions. Over the remainder of this lesson, we will now design and implement a couple of examples to demonstrate the skills we have learned thus far.