By: Matt Rhoads, Ed.D
Dr. Matt Rhoads is a Tech and Instructional Leader and Innovator with hands in Adult Ed, K-12, and Higher Education. He is the author of several books and is the host of Navigating Education – The Podcast.
Building data literacy to make data-driven decisions takes several steps. It does not simply just happen. But with practice, we can conduct data-driven decision-making seamlessly throughout the school day. Last week for Part 1 of this blog series, we focused on collecting student data and understanding there’s plenty of it to be collected in K-12 schools for us to use in making instructional data-driven decisions in classroom and school/district wide settings. In Part 2 of this 4 part blog series on data literacy and data driven decision-making, we are going to focus our attention today on cleaning data.
Why We Clean Data?
We clean data for a variety of reasons. It is a step that must be taken because its a process of identifying and correcting pieces of data within a data set that may be corrupt or inaccurate. Usually, when you export and import data into Excel or Sheets from elsewhere, there could be some inconsistencies and errors present since its coming from a different database. Furthermore, cleaning data helps ensure these possible inconsistencies and errors are not littered throughout a data set. Ultimately, inconsistencies and errors found within the data result in errors when a statistical analysis of the data occurs. So, it is vital our data is clean before we can move on to conducting any form of statistical analysis to derive any newfound knowledge from the data to be used in making an instruction decision.
Basic How-To’s for Cleaning Data
To clean data effectively, there is a step by step process you can follow to ensure your data is clean and ready for statistical analysis. This step by step process to clean data incorporates six major steps. Depending on the step, it could take 30 seconds to 5 minutes to complete the step. It all depends on the data set you are working with as well as how efficient you are with each step to clean the data set.
Step 1: Eliminate the Extra Space by Trimming Them
Sometimes when you export and import data from another database into Excel and Sheets, an additional space is found within the data cell. There could be one or many additional spaces throughout your data cell. If a space is present, a statistical formula utilized on the data will not work. One way to get around this is to to use the the =TRIM(cell#:cell#) function to erase any extra space present in the data set. This function can be used on each column of data spaces appear within your data cells.
Step 2: Get rid of Blank Cells and Decide What to Do
If there are blank cells present within a numeric data set, then we have a problem. When there is not a value in a data cell within a cell, we must figure out a way either to delete the cell, fill the cell in with a 0, or conduct a mean replacement, which means filling the empty cell with the overall average found in that column of numeric data. Once a decision has been made, we must do this for the entire data set. A shortcut instead of manually putting in the value for the cell is to press “F5” on your keyboard, which opens up a dialogue box. Inside this box, click on the “Special” box at the bottom left hand side of the box. Go down to the menu and click on “blanks.” After clicking “OK”, it will automatically select each and every cell that is blank in the data set. Now, it will go much faster and you will be able to fill in the blanks in no time.
Step 3: Delete Duplicates within your Data Set
Within a data set there could be a number of duplicate pieces of data. Generally within a data set, it is very rare to find duplicate pieces of data. When duplicates arise, they can throw off statistical calculations so we have to make sure whether the duplicates found are legitimate duplicates or duplicates we must eliminate from the data set. For Excel and Sheets, there are two different ways of doing this. Therefore, explanations for both will be provided.
First, for Excel, click and highlight your entire data set you are cleaning followed by pressing the “Conditional Formatting” option at the top of Excel’s interface. Once this option is pressed, click “Highlight Cells Rules” followed by “Duplicate Values,” which should be an additional drop-down tab given on the interface. Once you press “Duplicate Values,” Excel will have each individual header of your columns provided on the box that appears on your screen. This will ultimately give you the option to determine which columns you want Excel to target duplicate data. After you make your decision of what columns you want to be cleaned for duplicates, click “OK” at the bottom of the box. This will clean the selected columns of duplicates for Excel.
Next, for Sheets, you must begin by selecting the data set you want duplicates removed from in addition to clicking the “Data” tab at the top of the Sheets interface. Then, near the bottom of the drop down tap that appears, you will see the option to “Remove Duplicates.” After you have clicked this option, it is extremely similar to how it is conducted on Excel because a box appears with each column of the selected data set appearing for you to determine which columns you want duplicates removed from your data set. Click the columns you want this to occur to and then press “Remove Duplicates.” On Sheets, it will provide a box of the total number of duplicates removed from your data set. This signifies you are done and ready to move on to the next step.
Step 4: All Text Must Match – Watch Out for Case Sensitivity and Spelling
Now, we must determine if all of data within the data set is in the right case as well as whether it is spelled correctly. There may be times in Excel or Sheets where textual errors appear in capitalization of names, titles, or places. Usually, this occurs when the data is imported onto a spreadsheet. When this occurs, it can be difficult to mine the text and conduct statistical analysis. An easy way to ensure all of your data is in the proper case is to insert one of the following functions: =LOWER(cell#:cell#), UPPER(cell#:cell#), and PROPER(cell#:cell#). By having all of the textual data within the same case makes it so much easier to mine the data.
Besides looking at case sensitivity when looking at the data, spelling should also be looked at because if some of the textual data is spelled incorrectly, it will be difficult to mine the textual data for patterns. An quick and easy way to do a quick run down of the data is to click on “File” for both Excel and Sheets and then “Check Spelling & Grammar.” A quick run through with the spelling and grammar will ensure uniformity throughout your data set.
Step 5: Split and Merge Data Columns, as Needed
When you import data into Excel or Sheets from an external data source, the data can be merged into two or more columns or even can split one column into a multitude of different columns. This sometimes is a huge mess that will need to be cleaned up. One common example of this occurring is that the data imported on the spreadsheet will separate first and last names. Therefore, you must be sure to analyze your data set and see whether the data that has been imported has been successfully merged into the proper data cells. If not, you must either manually do this by reorganizing the data by highlight data and merging/splitting data cells or use the =CONCATENATE function so that you can join two or more strings of text into one string. To do this, type =CONCATENATE into a blank cell on Excel or Sheets and type in within its first range the first text (i.e., “text1”) you want to string to an adjoining text (i.e., “text2”). Furthermore, an example of this formula in action would look like =CONCATENATE(cell#, “lastname” cell#, “firstname). Ultimately, Excel and Sheets will conduct is to string together the last and first name onto one cell by combining the data together in one single string. Also, note that this same function can be used to combine data on a data set that was not originally together.
Step 6: Conduct Error Analysis and Review
You are almost there. This our last step. You have completed much of the required data cleaning up to this point. Now, to be one hundred percent ready to conduct statistical data analysis on your data set, the last step that needs to be done is error analysis on the data set to be sure its error-free. Errors in the data will cause your statistical formulas to not properly compute the data. Ultimately, error analysis looks different on Excel and Sheets. As a result, discussions on error analysis of your data set will be outlined for both Excel and Sheets.
On Excel, error analysis relates directly to the conditional formatting feature you can access at the top of its interface. Before doing this make sure you have highlighted the data set you would like to conduct the error analysis on. Once you click on “Conditional Formatting,” you will press “New Formatting Rule” at the bottom of the tab. Within the formatting box appears, click on the option that states “Format only cells that contain.” After this has been selected, be sure to look at the “Edit the Rule Description” option below and select “Format only cells with errors.” Lastly, select “OK” at the bottom of the box and Excel will conduct error analysis on the datasheet. It will highlight areas on the datasheet that have errors, which you then can pinpoint and fix. Remember, also take note that within the “Conditional Formatting” options, you have an option to develop rules based on what you want to specifically format in your data. This can range from duplicates to numbers greater than or less than a certain value; there are many rules you can sort through or even create, which makes it a great tool to complete your data cleaning process.
For Sheets, error analysis takes a few different forms because there are no specific functions that relate to computing an all-encompassing error analysis. Rather, there are several specific options you can utilize to conduct an error analysis, but note that it is not as systematic as Excel’s error analysis. To start, highlight the data set you want to conduct error analysis on and then click on “Format” at the top of Sheets interface. Then, underneath this tab, select “Conditional Formatting.” After this has been selected, a box to your right will appear that provides conditional formatting rules for you to apply to your data set. Within the pre-set options, you will notice “error” does not exist as an option for you to select. However, you can format the data in the same manner as you would conducting error analysis on Excel by developing multiple rules that Sheets will then format the data. This option is at the bottom of the interface on the right-hand side of Sheets that will allow you to create as many rules as you would like to format on the data set you have selected.
At the end of the day, Sheets allows you to conduct error analysis in a very customizable matter. On the other hand, Excel allows you to do it in a systematic swipe but it is much more difficult to customize your conditional formatting when compared to Sheets. With this said, whichever software you decide to conduct your data analysis, error analysis should take place before moving on before inputting statistical functions and formulas on Excel and Sheets.
Cleaning Data in Action
To illustrate how many of the techniques explained above are put into action, two video demonstrations are embedded in this blog for you to view. I recommend watching the first and second video in order as they provide valuable step by step processes of how to import data from other sources and then clean the data.
The first video explanation is valuable because it shows how to export and obtain data (like we talked about in Part 1 of this blog series) and clean the data. It shows the step by step process of exporting the data by either copying and pasting the data OR downloading and uploading a .CSV file to Sheets.
In the second video explanation, it demonstrates 10 valuable tips in cleaning data on Excel. Many of these same tips of how to clean data can also be used on Sheets. We see here how raw text and numeric data can be cleaned with ten
Next Step: The Data is Ready for Statistical Analysis – But Review Once Again Before Moving On
After cleaning your data, it is time for statistical analysis. However, before moving on, make sure to review the data set multiple times to ensure its ready. We all will make some mistakes throughout this process. Therefore, the review will catch these mistakes so they will not come up while you are conducting statistical analysis.
Cleaning data is one of the most monotonous and toughest parts of the data driven decision-making process. It is not fun to clean data. It can be challenging. But, it does not have to take a long time if each of the steps discussed today are put into action every single time you interact with a new set of data. With practice, it will become second nature.
Ultimately, once you are done cleaning data, you are a ready to conduct statistical analysis. This is the most fun and engaging part of data-driven decision-making because we are transforming data into newfound knowledge that we can use to make an impact. Part 3 of this blog series will cover many of the basic descriptive and multivariate formulas you can use on Excel and Sheets to conduct statistical analysis. In addition, Part 3 will illustrate step by step tutorials on how to use the formulas while working with data you can collect in a K-12 setting.
Note: I recommend reviewing this post and the videos presented before moving onto to Part 3 next week. If you have any questions or comments, please feel free to reach out. See you next week!